Presenting the framework of the whole slide image file Babel fish: An OCR-based file labeling tool

J Pathol Inform. 2024 Oct 23:15:100402. doi: 10.1016/j.jpi.2024.100402. eCollection 2024 Dec.

Abstract

Introduction: Metadata extraction from digitized slides or whole slide image files is a frequent, laborious, and tedious task. In this work, we present a tool to automatically extract all relevant slide information, such as case number, year, slide number, block number, and staining from the macro-images of the scanned slide.We named the tool Babel fish as it helps translate relevant information printed on the slide. It is written to contain certain basic assumptions regarding, for example, the location of certain information. This can be adapted to the respective location. The extracted metadata can then be used to sort digital slides into databases or to link them with associated case IDs from laboratory information systems.

Material and methods: The tool is based on optical character recognition (OCR). For most information, the easyOCR tool is used. For the block number and cases with insufficient results in the first OCR round, a second OCR with pytesseract is applied.Two datasets are used: one for tool development has 342 slides; and another for one for testing has 110 slides.

Results: For the testing set, the overall accuracy for retrieving all relevant information per slide is 0.982. Of note, the accuracy for most information parts is 1.000, whereas the accuracy for the block number detection is 0.982.

Conclusion: The Babel fish tool can be used to rename vast amounts of whole slide image files in an image analysis pipeline. Furthermore, it could be an essential part of DICOM conversion pipelines, as it extracts relevant metadata like case number, year, block ID, and staining.

Keywords: Automatization; DICOM; Digital pathology; Optical character recognition.