Automatic Recognition of Learning Resource Category in a Digital Library

S Banerjee, DK Sanyal… - 2021 ACM/IEEE …, 2021 - ieeexplore.ieee.org
2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021ieeexplore.ieee.org
Digital libraries generally need to process a large volume of diverse document types. The
collection and tagging of metadata is a long, error-prone, workforce-consuming task. We are
attempting to build an automatic metadata extractor for digital libraries. In this work, we
present the Heterogeneous Learning Resources (HLR) dataset for document image
classification. The individual learning resource is first decomposed into its constituent
document images (sheets) which are then passed through an OCR tool to obtain the textual …
Digital libraries generally need to process a large volume of diverse document types. The collection and tagging of metadata is a long, error-prone, workforce-consuming task. We are attempting to build an automatic metadata extractor for digital libraries. In this work, we present the Heterogeneous Learning Resources (HLR) dataset for document image classification. The individual learning resource is first decomposed into its constituent document images (sheets) which are then passed through an OCR tool to obtain the textual representation. The document image and its textual content are classified with state-of-the-art classifiers. Finally, the labels of the constituent document images are used to predict the label of the overall document.
ieeexplore.ieee.org