Historical Ink: 19th Century Latin American Spanish Newspaper
Corpus with LLM OCR Correction

Laura Manrique-Gómez2  Tony Montes1  Rubén Manrique1
1 Systems and Computing Engineering Department, Universidad de los Andes
2 History and Geography Department, Universidad de los Andes
Bogotá D.C.
{l.manriqueg, t.montes, rf.manrique}@uniandes.edu.co
Abstract

This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.

Historical Ink: 19th Century Latin American Spanish Newspaper
Corpus with LLM OCR Correction


Laura Manrique-Gómez2  Tony Montes1  Rubén Manrique1 1 Systems and Computing Engineering Department, Universidad de los Andes 2 History and Geography Department, Universidad de los Andes Bogotá D.C. {l.manriqueg, t.montes, rf.manrique}@uniandes.edu.co


1 Introduction

The computational processing of old press texts is an undertaking frequently addressed. Newspapers, as key historical resources, contain a diverse range of information about political, economic, and cultural processes and are abundant due to focused efforts to preserve them within national archives. Indeed, the discipline of Digital Humanities, which emphasizes the incorporation of digital tools in humanities and social sciences research, has spent much of the past three decades on the task of digitization, resulting in a wealth of curated digital collections Berry and Fagerjord (2017); Dobson (2019). However, digitizing these corpora has brought plenty of challenges in transcribing the images into machine-readable texts.

A significant obstacle in this process is the accuracy of Optical Character Recognition (OCR) technology, especially when dealing with historical documents that often have degraded quality or non-standardized fonts. Traditional OCR methods frequently produce errors that hinder subsequent text analysis and research. Addressing these challenges requires advanced techniques for error correction to ensure the reliability of digitized texts.

To overcome these challenges, we employed a Large Language Model (LLM), specifically GPT-3.5, to perform OCR error correction and enhance the quality of transcribed texts. This approach leverages the sophisticated natural language understanding capabilities of GPT-3.5 to detect and correct errors that traditional OCR systems might miss. By incorporating this LLM as part of our framework, accuracy and readability improvement can be observed in the digitized texts.

1.1 Related Work

One notable achievement in this realm is the "Chronicling America" initiative. Produced as part of the American National Digital Newspaper Program and funded by the National Endowment for the Humanities and the Library of Congress, this project represents a major stride in the digitization of historical press Humanities . Another substantial project is the "Digging into Data Challenge". A part of the Transatlantic Partnership for Social Sciences and Humanities 2016, this initiative yielded a vast collection of 19th-century press materials known as "Atlas - Oceanic Exchanges. Tracing Global Information Networks in Historical Papers" Exchanges . Other significant works include “Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines” Cordell and Smith , a project that investigates 19th-century journalistic reports to understand the culture of reprinting in the United States before the Civil War, and the European project “Project Impresso: Media Monitoring the Past” Impresso , which provides significant insights into the specific requirements of the OCR tasks, necessary for transcribing old texts in English and other Germanic languages.

Despite these advancements, there’s a lack of specialized corpora for the Latin American press of the 19th century that allows for an understanding of the region’s unique historical, cultural, and Spanish linguistic specificities. To address this gap, our research presents a novel dataset of Latin American press texts written in old Spanish from this period, with the first version predominantly including newspapers from the region formerly known as Nueva Granada, which encompassed Colombia, Panama, Venezuela, and Ecuador. This dataset has been enriched with OCR-LLM models, aiding the detection of character recognition errors and distinguishing them from historical linguistic surface forms111The dataset is available at https://huggingface.co/datasets/Flaglab/latam-xix in its three versions ”original”, ”cleaned”, and ”corrected”.

2 Sourcing

The dataset was constituted from the digital catalog of Colombia’s most important newspaper archives, the Colombian National Library and the Luis Ángel Arango Library. The objective of the collection focuses on those publications with prints or illustrations for subsequent multimodal modeling. Given that public collections did not contain metadata with information about illustrations or cartoons, a manual revision was carried out that extended to the physical collections in situ, since the physical collection is only digitized to approximately 50%. 64 newspaper titles were identified (7% of the total of the 1,655 publications of the collections) whose geographic origin is mostly Bogotá but also includes publications from other cities of the Nueva Granada, as well as Guayaquil, Panamá City, and Lima. This corpus constitutes the version of the dataset used for the research in this paper. However, the intention is to continue broadening the collection to include other countries in Latin America.

The dataset consists of 4,032 pages of scanned images that were processed with a layout model, trained specifically for the task, that was able to separate the images from the texts. The texts were subsequently transcribed with the Azure AI Vision Model which provides an OCR service222Model available through Azure cloud services at https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-read. From a sample of 2,500 text strings of 1,000 characters manually supervised, 8.5% of the total proved unreadable. However, most of the texts contained multiple transcription errors due to the highly artisanal printing techniques and the use of the epoch’s different grammatical and lexical forms. Thus, impacting the readability of the texts and adding a bias for its use as input to NLP-LLM models.

2.1 Structuring the data

Once the source newspaper text pictures are processed through Azure’s OCR, the resulting JSON files contain the extracted text. These texts were manually complemented with the newspaper metadata essential for future analysis. This data was structured into a single parquet file with the following column structure: newspaper id, text id within the newspaper (in the format {file number}-{page}-{chunk number}), newspaper title, year, city, and the text itself. For example, the first dataset row id is PD168, 1-page_0-0, for the newspaper El oso from Lima, Peru.

3 Processing

The dataset contains samples of newspapers that were written by hand, or using carving machines. These machines normally wear out with use and therefore some features of the text turn out to be easily confused with backward accent marks, unwanted punctuations, or misplaced characters between words. This misreading interrupts the continuity of the text and doesn’t add any semantical meaning to it.

Detecting these errors automatically is a challenge due to the language change between modern Spanish and XIX-Century Spanish. There’s a lack of OCR models trained on those types of texts and historical semantic and orthographic shifts, thus not necessarily an error; instead, it’s a surface form of the word e.g. the addition connector "y" (and) used to be written as "i".

There were other types of texts that, in addition to being simply poorly read, were completely unintelligible for OCR (and also very difficult for humans to understand) due to the fonts of some newspapers. Also, the multiples newspapers’ layouts resulted in texts that contained lots of scores or numbers, or samples containing only chapter titles or numbering such as "III IV V" which add noise to the dataset.

It is possible to detect some of these errors with hardcoded rules that cover them, addressed in an initial cleaning and filtering step, but some of them such as surface form extraction or more complex OCR errors, are particularly hard to express in coding rules. On the other hand, OCR correction using LLMs is a very useful technique for this task as shown in Langlais (2024), but it must be used with caution due to the semantic change of text between periods, and the fact that LLMs are mostly trained on modern data, which of course, may bias their output for this correction task.

3.1 Cleaning and filtering

Some of the most common cleaning steps for text data consist of removing duplicates and removing noisy data. In particular, in this case, and for purposes of subsequent analysis, these steps are fundamental.

  1. 1.

    Remove duplicates and empty texts. 1.06%percent1.061.06\%1.06 % of rows were removed.

  2. 2.

    Filter out rows where over 50% of the characters are non-alphabetic. 1.28%percent1.281.28\%1.28 % of rows were removed.

  3. 3.

    Remove the rows with 4 or fewer tokens. For this, a new tokenizer was trained with a vocabulary size of 52,000, trained from the BETO pre-trained tokenizer Cañete et al. (2020). 0.81%percent0.810.81\%0.81 % of rows were removed.

3.2 LLM-made Correction

On the other hand, OCR errors from newspapers are very difficult to detect automatically and correct. Still, they are a huge source of noise as mentioned in Lopresti (2008), and much more in old sources such as newspapers from the 19th century, where these errors tend to appear more due to the wear and tear of newspaper paper or writing methodologies that are very different from the modern ones.

In this paper, we use a technique for detecting OCR errors and correcting them using GPT and taking advantage of the fact that LLMs were trained mostly on modern language, using manually-checked rules, it is possible to classify corrections between errors, word surface forms or none of both (hallucinations). These rules, explained in the following section, were revised and selected by a field expert who served as well as an evaluator for these corrections testing their precision for this case.

To date, most of the LLMs are not capable of effectively returning in a machine-readable format, all the corrections of a text, especially when texts are very long, which is the case. This is the reason why a diff algorithm was required to use the maximum capability of LLMs to correct the text and later with algorithm-based approaches detect the differences between the original text and the correct one to later be able to classify them. An example of an old original text, a corrected one, and the differences returned by the algorithm can be found in Appendix A, as well as the parameters chosen for this step.

Refer to caption
Refer to caption
Figure 1: Distribution of newspapers and years between the whole dataset

3.3 Corrections Classification

Once the corrections are detected and isolated through the diff algorithm, the last step is to classify them. Still, first, it is important to state the main differences between the possible labels for each correction:

  • Surface form: In linguistics, the term surface form (or word form) denotes the specific appearance of a word in a given context, contrasting with its lexical form, which pertains to its meaning Sarveswaran et al. (2019). During the 19th century in Latin America, certain words were documented with variant spellings reflecting language shifts over time. It’s important to note that changes in surface forms do not necessarily alter the semantic content of the word, but rather represent orthographic modifications.

  • OCR error: An OCR error, on the other hand, refers to every possible misread text from the real newspaper text. The OCR errors must be corrected but must be carefully selected from the real newspaper linguistic "errors" that contribute to the linguistics of the time.

  • Hallucinations: If none of the above is the case, the correction is an LLM hallucination or a translation to modern Spanish, which would be wrong, so these corrections must be omitted.

To enhance the analysis of classification rules, corrections were noted alongside their frequency across the entire dataset to assess their relevance. Additionally, all corrections were converted to lowercase to effectively group them. During this process, many corrections were reviewed and consolidated into a set of linguistic rules for categorization. This framework can be applied to identify and analyze similar changes and classification rules in other languages and specific contexts. In particular, this paper presents a collection of carefully validated rules and exceptions standardized for classifying corrections in the LatamXIX dataset.

3.3.1 Accent changes

When there are only accent changes (add or removal) between the original text and the corrected one, the correction refers mostly to a surface form, because Spanish accent rules in the 19th century tended to be very different from the modern ones Montgomery (1966), and in particular there was a lack of many accent rules, which allowed a very diverse set of accents expressions for the same meaning word, such as the word "antes", which in some cases used to be written as "ántes", with the accent. These kinds of surface forms mean a problem for some NLP tasks, because, in Spanish, some words without the accent may have a different meaning, specifically for the past forms of some verbs, such as "acepto" (present, e.g. I accept) and "aceptó" (past, He accepted). Therefore, for certain NLP tasks, it may be preferable to focus solely on the surface forms without accent changes, which is another outcome presented in this paper.

3.3.2 Specific changes

A particular set of changes was extracted to encapsulate an important set of surface words and another set of changes that encapsulates some common OCR errors. For example, the most common changes for surface words tend to be related to the usage of "y" instead of "i" or "g" instead of "j", for example in the words "mui", "jeneral" y "geroglíficos" (currently the words "muy", "general" and "jeroglíficos"); in fact, the connector "y" used to be written as "i" in most of the early 19th century texts Bouzouita and Gutiérrez (2015). On the other hand, the most common OCR errors tend to be accent misreading or number confusion such as "ó" being read as "6" or "i" being read as "1". A more detailed list of examples and a list of surface form changes is available in Appendix B and a list of OCR error changes in Appendix C.

3.3.3 Other letter-to-letter changes

Different from the previously stated changes, when the number of letters of the original and the corrected one coincide, in general, tend to refer to OCR errors, for example, "la" was misread as "In" or "señor" as "sefor".

3.3.4 Remaining changes

When the correction does not fit in any of the previous examples, the correction itself won’t refer to surface forms, but it’s a challenging task to automatically differentiate in this case between OCR errors and hallucinations because there may be multiword corrections. For this, a ratio of text similarity was computed based on the coincidence of positional characters between the original and the corrected texts, and based on this ratio, the number of words within the corrected text and the frequency of the correction, was categorized between OCR errors and hallucinations. For illustration, the OCR error detected "ascripeión" changed to "suscripción" returned a ratio of 0.76, but the hallucination "que" changed to "como" returned a ratio of 0.0 which effectively allows differentiating in most of the cases.

4 Results

Feature Value
Size 26MBsimilar-toabsent26𝑀𝐵\sim 26MB∼ 26 italic_M italic_B
Rows 10,1761017610,17610 , 176
Words 4.4Msimilar-toabsent4.4𝑀\sim 4.4M∼ 4.4 italic_M
Tokens 5.5Msimilar-toabsent5.5𝑀\sim 5.5M∼ 5.5 italic_M
Newspapers 58
Years Range 1845 - 1899
Surface Forms 11,397
Non-Accent Surface Forms 2,231
Table 1: Final Historical Ink: LatamXIX corrected dataset

With the execution of all the mentioned steps, we end up with a set of very useful tools more than just the LatamXIX dataset333There are 59 newspapers in total, but only one has the period 1806-1809, so it was excluded from the dataset overview. One of the most important outputs of this paper is the LLM OCR correction framework which was designed to be easily exchanged between datasets or LLMs so that it can be applied for further research on other datasets. Also, an important output of the mentioned process is the list of surface forms from 19th-century Latin American Spanish from newspapers, but it also brings a general framework to detect these surface forms in a wide range of contexts where it may play an important role.

In particular, Old Spanish surface forms benefit semantic change detection tasks. These forms capture the semantic meaning variations of particular words and can help to compare their historical evolution in different periods or among different Spanish-speaking regions 444The dataset, surface forms, and processing steps are available in https://github.com/historicalink/LatamXIX.

5 Future Work

This initial version of the dataset primarily includes newspapers from the region formerly known as Nueva Granada, encompassing Colombia, Panama, Venezuela, and Ecuador. Future dataset extensions will aim to incorporate newspapers from a broader range of Latin American countries to ensure a more comprehensive representation of the region. By extending the dataset to include diverse newspapers from various Latin American countries, we aim to provide a richer resource for historical and linguistic research.

Future work will also involve analyzing the semantic changes between 19th-century Spanish in Latin America and modern Spanish. Additionally, comparing these changes to the overall evolution of the Spanish language globally will provide valuable insights into linguistic shifts over time and across different regions.

Moreover, while the OCR correction using LLMs has moved closer to a fully automatic pipeline, a significant portion of rule-defining in the presented framework still requires manual professional effort. An automated method for evaluating OCR accuracy is currently missing, as most of the evaluations and rule definitions were performed manually by an expert. Future work should focus on developing a comprehensive automatic evaluation method for OCR, as well as further automating the rule-defining process, to enhance the efficiency and accuracy of the OCR correction framework.

References

Appendix A Example LLM Correction

The LLM response was successful for most of the texts except for some cases where Azure’s Content Policy was triggered due to text content, and for very long texts where the model started to hallucinate the whole text. An example of an original text, its retrieved LLM correction, and all the changes detected by the diff algorithm is the following (surface forms and OCR errors) is:

  • Original: La publicacion del Oso se harà dos veces cada se mana, y constará de un pliego en cuarto ; ofreciendo à mas sus redactores, dar los gravados oportunos, siempre que loexija el asuntode que trate. Redactado por un Num. 8. TEMA del Periodico. POLITICA MILITAR. OCTAVA SESION. Abierta la sesion á las dore y un minuto de la noche , 25 de Febrero de 1845 , con asistencia de todos los Señores Representantes, se leyó y aprobó la acta de la Asamblea anterior , ménos en lo tocante à la torre del Convento de Santo Domingo, punto que quedó para ventilarse en mejor ocasion. En seguida se dió cuenta de una nota del Ejecutivo , referente à que urjía la necesidad de organizar un Ejército ; pues decia el Excmo. Decano: - "Un poder sin bayonetas vale tanto como un cero puesto á la izquierda."

  • Corrected: La publicación del Oso se hará dos veces cada semana, y constará de un pliego en cuarto; ofreciendo además sus redactores, dar los grabados oportunos, siempre que lo exija el asunto de que trate. Redactado por un Num. 8. TEMA del Periódico. POLÍTICA MILITAR. OCTAVA SESIÓN. Abierta la sesión a las dos y un minuto de la noche, 25 de Febrero de 1845, con asistencia de todos los Señores Representantes, se leyó y aprobó la acta de la Asamblea anterior, menos en lo tocante a la torre del Convento de Santo Domingo, punto que quedó para ventilarse en mejor ocasión. Enseguida se dio cuenta de una nota del Ejecutivo, referente a que urgía la necesidad de organizar un Ejército; pues decía el Excmo. Decano: - "Un poder sin bayonetas vale tanto como un cero puesto a la izquierda."

Appendix B Specific Surface Form Changes

For the surface form extraction from the texts and its differentiation from OCR errors and LLM hallucinations, a set of surface form changes was constructed for 19th-century Latin American Spanish. The complete set of known changes with an example for each case is presented in Table B1.

(Previous, Corrected) Example
(á, a) & (a, á) hara \rightarrow hará
(é, e) & (e, é) fué \rightarrow fue
(í, i) & (i, í) decia \rightarrow decía
(ó, o) & (o, ó) ocasion \rightarrow ocasión
(ú, u) & (u, ú) ningun \rightarrow ningún
(i, y) & (y, i) mui \rightarrow muy
(j, g) & (g, j) jente \rightarrow gente
(v, b) & (b, v) gravado \rightarrow grabado
(s, x) & (x, s) espiró \rightarrow expiró
(j, x) & (x, j) méjico \rightarrow méxico
(c, s) & (s, c) faces \rightarrow fases
(s, z) & (z, s) dies \rightarrow diez
(z, c) doze \rightarrow doce
(q, c) quatro \rightarrow cuatro
(n, ñ) senor \rightarrow señor
(ni, ñ) senior \rightarrow señor
(k, qu) nikel \rightarrow níquel
(k, c) kiosko \rightarrow quiosco
(ou, u) boulevar \rightarrow bulevar
(s, bs) suscriciones \rightarrow subscripciones
(c, pc) suscriciones \rightarrow subscripciones
(s, ns) trasportar \rightarrow transportar
(t, pt) setiembre \rightarrow septiembre
(rt, r) libertar \rightarrow liberar
(r, rr) & (rr, r) vireinato \rightarrow virreinato
(...lo, lo ...) cambiólo \rightarrow lo cambió
(...se, se ...) acercóse \rightarrow se acercó
Table B1: Set of Surface Form Changes for its extraction from dataset (2)

Appendix C OCR Errors Form Changes

There’s also a set of known changes for OCR errors, presented in Table C1.

(Previous, Corrected) Observation
(6, ó)
(6, o)
(1, y) 1 \rightarrow i \rightarrow y
(4, a)
Table C1: Set of OCR Error Changes for its classification

Note that these known changes don’t cover the whole dataset’s OCR error cases, and are just a predefined set for quickly detecting some of the most common cases.