Automatic Recognition of Learning Resource Category in a Digital Library

Soumya Banerjee1, Debarshi Kumar Sanyal2, Samiran Chattopadhyay3,
Plaban Kumar Bhowmick4, Partha Pratim Das5 145IIT Kharagpur, Kharagpur-721302, India, 2Indian Association for the Cultivation of Science, Kolkata-700032, India,
13Jadavpur University, Kolkata-700106, India
Email: [email protected], [email protected], [email protected],
[email protected], [email protected]

Abstract

Digital libraries generally need to process a large volume of diverse document types. The collection and tagging of metadata is a long, error-prone, manpower-consuming task. We are attempting to build an automatic metadata extractor for digital libraries. In this work, we present the Heterogeneous Learning Resources (HLR) dataset for document image classification. The individual learning resource is first decomposed into its constituent document images (sheets) which are then passed through an OCR tool to obtain the textual representation. The document image and its textual content are classified with state-of-the-art classifiers. Finally, the labels of the constituent document images are used to predict the label of the overall document.

Index Terms:

deep learning, transfer learning, digital library

I Introduction

A large digital library generally contains resources of different types. For example, the National Digital Library of India (NDLI) curates heterogeneous educational resources including scientific articles, books, paintings, etc. A library may receive curated metadata of resources directly or simply receive the resources from which it has to separately extract the metadata. In the latter case, knowing the type of the document is necessary because metadata extraction mechanisms (manual or automated) generally vary with resource types. Thus, it is worthwhile to explore methods of automatic classification of document types so that the correct metadata extraction process can be identified early.

There is considerable literature on automatic classification of documents, using either textual information in the documents, or layout-specific information, or a combination of both. While most of the research in layout-based classification over the last three decades focused on ingenious hand-crafted features and rule-based or shallow machine learning algorithms [1], the seminal publication by [2] in 2014 has sparked interest in the application of deep learning architectures to classify document images (see, e.g., [3]). However, recent approaches in existing literature mostly operate on the Tobacco dataset which has 3482 images from 10 classes or the larger more popular dataset RVL-CDIP dataset containing 400K document images from 16 classes. The limitations of these approaches are that both datasets deal with single-page English textual documents. For metadata extraction in the context of a digital library, we need to process textual as well as non-textual multilingual multi-page document images. Moreover, the existing datasets do not adequately represent the content types in a typical educational digital library.

Refer to caption — Figure 1: HLR Dataset Sample

In this paper, we introduce the Heterogeneous Learning Resources (HLR) dataset to address the highlighted problem. We present benchmarks and demonstrate how existing techniques can be extended to address the said limitations.

II The HLR Dataset

The proposed HLR dataset contains $3167$ images from $11$ classes, namely: catalog, handwritten, law reports, maps, music notations, newspaper articles, paintings, presentation, question paper, scientific articles, and thesis. The data has been collected from NDLI and Europiana. For document types with multiple pages, each page has been considered as independent samples. The dataset is split into an approximately $15:4:10$ training: validation: testing split. Additionally, a small set of multi-page documents are also included in the dataset for testing the said problem. Figure 1 shows a sample of the HLR dataset. The dataset and codes are available at https://github.com/soumyaxyz/DocumentClassify.

III The Classifier Architecture

We utilize a transfer learning-based training regime to train a classifier to identify document classes. This pre-trained classifier is employed to classify the multi-page documents. Architecturally, the deep learning model has two branches, the image subnetwork, and the textual subnetwork. The HLR dataset only contains images, with Tesseract-OCR, corresponding textual representations are generated. It is worth noting for a significant fraction of images, this textual representation is nothing but an empty string. Thus the textual subnetwork is strictly speaking an auxiliary network that assists the image subnetwork when possible. The image subnetwork the VGG16 architecture that is pre-trained on imagenet data as a feature extractor, the textual branch generates the GLOVE embeddings for the corresponding textual representation. The embeddings are passed through a bi-LSTM with self-attention. Both are eventually concatenated and the final $11$ dimensional softmax is trained with an adam optimizer against a categorical cross-entropy loss function.

The classifier, trained on the principal HLR dataset, achieves an impressive $94.15$ % accuracy. Figure 3 shows the confusion matrix for the classification task. We also carried out a brief ablation study, where we investigated the subnetworks separately. The text-only branch failed to learn anything, this is quite expected as the textual data is very sparse. The image-only branch performed almost as well as the full model and achieved a $92.1$ % accuracy. But it performed worse in distinguishing between highly textual classes like Catalog, Law reports, and Scientific articles.

IV Classification of Multi-Page Documents

The HLR dataset is primarily a collection of single-page document images. But most of the classes (i.e., except handwritten, paintings, newspaper articles) are derived from multi-page documents which often contain images from multiple classes. It is easy for a human to identify the primary class for such documents. However, programmatically it is a nontrivial task. The HLR dataset also contains a small collection of multi-page documents comprising of $1483$ document images across $20$ documents.

The trained classifier model is utilized to generate the labels for each page of these documents and the overall document class is predicted through a majority vote. This approach yields a respectable $80$ % accuracy. However, applying a bit of meta-knowledge about the dataset can significantly improve the performance. The map documents have many tables and catalogs along with the titular maps, however, documents from the class catalog do not contain any maps. Thus, this knowledge can be incorporated by checking the second label if the prediction is catalog with low confidence. If the second label is maps with a comparable confidence, it should be reclassified as maps. Similarly, thesis and scientific articles have very little distinction apart from the title page, thus scientific articles with the title page classified as thesis should be reclassified as thesis. Applying these corrections improves the accuracy to $95$ %. Figure 3 shows the confusion matrices for the multi-page documents classifications, before and after the correction is applied.

V Conclusion

We have presented a novel heterogeneous multi-lingual dataset for document image classification. We presented a deep-learning architecture for classifying heterogeneous document images. We also presented a system for on multi-page document classification. In the future, we will explore if the results generalize to larger multi-page datasets.

Acknowledgment

This work is supported by the National Digital Library of India Project sponsored by the Ministry of Education, Government of India at IIT Kharagpur.

References

[1] N. Chen and D. Blostein, “A survey of document image classification: problem statement, classifier architecture and performance evaluation,” International Journal of Document Analysis and Recognition (IJDAR), vol. 10, no. 1, pp. 1–16, 2007.
[2] L. Kang, J. Kumar, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for document image classification,” in 2014 22nd International Conference on Pattern Recognition. IEEE, 2014, pp. 3168–3172.
[3] A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2015, pp. 991–995.