Making BHL scientific illustrations searchable for non-scientists
Presented at 2022 iDigBio Digital Data Conference
The Biodiversity Heritage Library has uploaded over 300,000 open source images extracted from its vast collection to the popular image hosting site Flickr. The images -- most of which are colorful scientific illustrations of organisms from across the tree of life -- are meticulously organized by source publication, and labeled with taxonomic tags. However it is very difficult to search by general descriptions like "a watercolor illustration of an insect with flowers" that might not be picked up by the tags. To enable these types of searches, I used OpenAI's CLIP (Contrastive Language–Image Pre-training) deep learning model to produce a vector representation of all images from the Flickr dataset. The CLIP model is trained on image-text pairs, so we can then encode text search terms with the same model, and find nearest neighbors from the BHL image set. Finally, a front-end interface was created using the Python framework Streamlit to allow for searching and exploration. Source code can be accessed at https://github.com/MikeTrizna/bhl_flickr_search.