Disability Discourse Matters

Abstract

This project is using a variety of vision language models to transcribe millions of words of text from images of archival material, ranging from latin manuscripts to 18th century correspondence to printed emails from high profile government officials. The AI “reads” the image and writes out the text from it to then enable further processes to extract people, places, organizations, events, concepts and other references to allow the archive to be more easily found, and thus more likely to be usable by researchers and the general public.

What inspired this project?

Yale has linear miles of archives, and is constantly acquiring new collections. The archivists can categorize the content and manage it physically, but do not have the time to read the text and describe the objects for discovery. AI is now excellent at reading handwriting, giving us a new superpower.

How might this project benefit humanity?

This enables all sorts of historical research, including genealogy, art history, political science, codicology and beyond. Enabling content to be searched opens up the archive for use in ways that were impossible only a year ago.

Who would use this?

Researchers
Historians
Scholars

How does it work?

Given an image of an object from the archives, we classify whether it is blank (no text), an image (e.g. the front of a photograph), has printed text (e.g. a newspaper cutting) or handwritten text (the back of a postcard). The classification is done via a yolo11 model trained on 1000 examples. We can then either skip (for blank or image only), or use the appropriately sized and costed vision language model to extract the text. For printed text, we used the dots.ocr open model on the Bouchet cluster to process 650,000 images in 120 hours of GPU processing. In terms of energy, this cost less than $60 or 90 kilowatt-hours, compared to the impossibly many hours to have humans re-type the text.

For handwritten text, we are further classifying which model to use. Gemini Pro 3 is the current state of the art, but not every image requires its power. Instead we intend to use cheaper/older models or open, fine-tuned models such as Qwen3-VL to process the easier images and reserve Gemini Pro 3 for the most complex. Once we have the text, we can use named entity recognition tools such as Gliner2, along with language models to extract the references to entities such as people and places.

Resources

Models

Gemini
Qwen3-VL
dots.ocr
YOLO11
Data

LUX Archival Images
HPC

Bouchet
Other Public Resources

Huggingface guide for GLAM AI

Contributors & Acknowledgments

Dr Robert Sanderson is working to ensure the availability of the organization’s content in a usable way via open standards including Linked Open Data and IIIF. He is a leader in several international efforts in the information science world, including IIIF, semantic standards in the W3C, and cultural heritage specific models. Previously he was the Semantic Architect for the J Paul Getty Trust, information standards advocate at Stanford University Libraries, information scientist at Los Alamos National Laboratory and a Lecturer in Computer Science at the University of Liverpool. H

Open to Collaboration - Contact robert.sanderson@yale.edu

Related Publications

Sanderson, Robert. Implementing Linked Art in a Multimodal Database for Cross-Collection Discovery

Where do you want to go next?

Reusable block with quick links to navigate horizontally to key pages.

Use Case Catalogue - Research - Handwritten Archival Transcription

Handwritten Archival Transcription