How does it work?
Given an image of an object from the archives, we classify whether it is blank (no text), an image (e.g. the front of a photograph), has printed text (e.g. a newspaper cutting) or handwritten text (the back of a postcard). The classification is done via a yolo11 model trained on 1000 examples. We can then either skip (for blank or image only), or use the appropriately sized and costed vision language model to extract the text. For printed text, we used the dots.ocr open model on the Bouchet cluster to process 650,000 images in 120 hours of GPU processing. In terms of energy, this cost less than $60 or 90 kilowatt-hours, compared to the impossibly many hours to have humans re-type the text.
For handwritten text, we are further classifying which model to use. Gemini Pro 3 is the current state of the art, but not every image requires its power. Instead we intend to use cheaper/older models or open, fine-tuned models such as Qwen3-VL to process the easier images and reserve Gemini Pro 3 for the most complex. Once we have the text, we can use named entity recognition tools such as Gliner2, along with language models to extract the references to entities such as people and places.