The Smithsonian Institution
Browse
Trizna_Dearborn_DD2023.pdf (13.38 MB)

AI models are getting better and better at reading handwriting, but how can we find handwritten text to begin with?

Download (13.38 MB)
poster
posted on 2023-06-15, 13:51 authored by Michael TriznaMichael Trizna, Jacqueline DearbornJacqueline Dearborn

Poster presented at 2023 Digital Data in Biodiversity Conference


The Biodiversity Heritage Library has made over 60 million pages of biodiversity literature openly available to the world as part of a global biodiversity community. The holdings of BHL have all been run through optical character recognition (OCR) models to extract text. However, general purpose OCR models are trained mostly on printed text and are notoriously poor at reading handwritten text. There are very good handwritten text recognition (HTR) models and products that exist, but identifying the pages in BHL that are handwritten and eligible to be processed by an HTR engine is difficult to surmise without accurate page-level metadata. In this case study, the Smithsonian Data Science Lab collaborated with the Biodiversity Heritage Library to build an image classification model that can detect handwritten text pages. This will allow BHL's Technical Team to run the identified sub-corpus through an HTR engine, thereby improving text outputs on data-rich archival materials such as scientific field notes. To build this model, the Data Science Lab used a novel approach of using zero-shot image classification with a bimodal large language model (LLM) to quickly tag pages. This handwritten text image classifier will be made openly available to ensure reproducible results, and also to allow other datasets to use the model. 

History

Usage metrics

    Office of the Chief Information Officer

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC