Rule-Based, Neural and LLM Back-Translation:
Comparative Insights from a Variant of Ladin

Samuel Frontull Georg Moser
Department of Computer Science
University of Innsbruck, Innsbruck, Austria
{samuel.frontull, georg.moser}@uibk.ac.at

Abstract

This paper explores the impact of different back-translation approaches on machine translation for Ladin, specifically the Val Badia variant. Given the limited amount of parallel data available for this language (only 18k Ladin–Italian sentence pairs), we investigate the performance of a multilingual neural machine translation model fine-tuned for Ladin–Italian. In addition to the available authentic data, we synthesise further translations by using three different models: a fine-tuned neural model, a rule-based system developed specifically for this language pair, and a large language model. Our experiments show that all approaches achieve comparable translation quality in this low-resource scenario, yet round-trip translations highlight differences in model performance.

Samuel Frontull and Georg Moser Department of Computer Science University of Innsbruck, Innsbruck, Austria {samuel.frontull, georg.moser}@uibk.ac.at

1 Introduction

In recent years, a variety of methods have been developed to apply neural machine translation (NMT) also in low-resource scenarios Shi et al. (2022); Haddow et al. (2022); Ranathunga et al. (2023). The back-translation technique has shown to be particularly effective in such settings Sennrich et al. (2016); Edunov et al. (2018), offering the potential for substantial improvements in translation quality.

This work investigates the influence of the back-translation model selection for a low-resource language. We do this, by comparing the results obtained by fine-tuning a pre-trained multilingual NMT model using synthesised translations generated by (i) a NMT system fine-tuned on the available parallel data, (ii) a rule-based machine translation (RBMT) system developed for the specific language pair, and (iii) a large language model (LLM) prompted to translate given texts, accompanied by 8 exemplary samples.

The quality of the synthesised data, which in turn is determined by the underlying models used to generate it, matters Burlot and Yvon (2018). In our case, the synthesised translations originate from three models based on a different paradigm. Thus, the synthesised data is characterised by the specific strengths and weaknesses of the respective paradigms.

Rule-based systems are robust and computationally lightweight, but may face challenges in dealing with ambiguity. Moreover, they lag behind at the grammatical level. Neural models show a high ability to adapt to provided texts, but perform less well when confronted with out-of-domain data Shen et al. (2021). In contrast, language model-based approaches (LLMs) are praised for their ability to produce fluent, coherent texts, but they are prone to hallucinations Rawte et al. (2023). It is therefore an interesting question to investigate how this affects the quality of the NMT models trained on this data. This comparative analysis sheds light on the nuanced contrasts inherent in the different MT methods.

Our results show that in this low-resource scenario the back-translation model does not have a significant impact, and the performance of the models converges to similar results in terms of BLEU/chrF++ points. This assertion is supported by an empirical analysis carried out on the Val Badia variant of Ladin. Our main contributions are:

•

we are the first to explore MT for Ladin in general, with a specific focus on the Val Badia variant,
•

we compare an RBMT-, an NMT- and an LLM-based back-translation, providing insights into the efficacy of the methods for Ladin,
•

we establish baseline results and make the test data, the RBMT system, as well as the best-performing models publicly available

In Section 2 we describe our data collection and corpus creation process. In Section 3 we present the three different methods used for the back-translation of monolingual Ladin data into Italian. Section 4 gives an overview of the conducted experiments and Section 5 presents the obtained results. Section 6 discusses related work and similar approaches. In Section 7, we summarize and discuss future work.

The Ladin Language

Ladin is an officially recognised minority language, and thus taught in schools, used in the media, and employed in public administration. For this reason, an effective machine translation system could make a significant contribution to facilitating and supporting communication in this language. However, Ladin is still an unexplored language in the field of machine translation. Indeed, nearly no parallel data¹¹1In a machine readable format. is publicly available for this language, except for a few hundred samples on OPUS Tiedemann (2012). This language, spoken by around 30,000 people in the northern Italian Dolomite regions, exhibits significant diversity across its five main variants (Val Badia, Gherdëina, Fascia, Fodom, Anpezo), each shaped uniquely by its development in different valleys. This diversity is not only evident in the spoken language but has also resulted in distinct standards for written communication. The first author of the paper originates from the Val Badia and is a native speaker of Ladin. Therefore, in this work we concentrate on the standard written language of this valley. In the rest of the paper, we will use lvb as language code to refer to this variant of Ladin, and ita for Italian.

2 Data

This section gives an overview of the linguistic resources available for the Ladin language and describes the method employed to collect data for the specific Val Badia variant of Ladin.

2.1 Available Resources

Publicly accessible parallel data for Ladin is scarce. The Open Parallel Corpus Tiedemann (2012) e.g. lists 1543 Ladin–German, 220 Ladin–Italian, and 81 Ladin–English sentences. However, these texts are mainly specific to the variants of Gherdëina and Fassa and were not disseminated by public institutions. For our experiments, we were provided with the archive of the weekly newspaper La Usc di Ladins²²2https://www.lausc.it/ and a digitised version of the dictionary Ladin Val Badia – Italian Moling et al. (2016). From these data sources we extracted monolingual texts as well as a small dataset of parallel sentences. We furthermore used the dictionary as the basis for implementing a RBMT system. The collection of other parallel texts is time-consuming and has therefore been left for future work.

2.2 Parallel Data

The Ladin (Val Badia) – Italian dictionary Moling et al. (2016) contains, alongside the word entries, also sentences that illustrate their usage. For these sentences the corresponding Italian translation is also given. We have collected this data to create our training dataset, which contains a total of $18,139$ sentences. These sentences are basic and short because they were created specifically to illustrate the use of words and phrases. The average length is $23.43$ and $25.69$ characters for Ladin and Italian respectively. This dataset has been publicly released.³³3https://www.doi.org/10.57967/hf/1878

2.3 Ladin Monolingual Data

The Ladin newspaper La Usc di Ladins, digitally archived since 2012, provides an extensive dataset of monolingual texts. These texts are published in five different variants, each corresponding to one of the five Ladin valleys. We extracted these texts from the PDF documents and segmented them into individual sentences using the NLTK library Bird et al. (2009), specifically setting Italian as the language to accommodate Ladin. In total, we accumulated $1,937,608$ sentences. These sentences had to be categorised by variant, as described below.

Variant Classification

In order to train a variant classifier, labeled training data is essential. However, the monolingual data from the newspaper PDFs lacked these labels. Therefore, we collected the texts from the newspaper’s website.⁴⁴4https://www.lausc.it Here, the article excerpts are categorized according to their origin valleys and the corresponding language variants, allowing us to create a labeled dataset.

We gathered a corpus of $7,766$ article excerpts with a total of $42,745$ individual sentences for training. These sentences were then split into training (comprising $75\%$ of the sentences) and test data (the remaining $25\%$ ). Using the $2,500$ most frequent $3$ -gram characters as features, we trained an XGBoost variant-classifier Chen and Guestrin (2016). On the test data, our classifier achieved $94.48\%$ accuracy in classifying these 5 labels.

The resulting model was used to predict the variant of each of the $1,937,608$ sentences in the monolingual dataset. Table 1 reports the respective number of classifications (and the total number of characters) for each variant. 746,704 sentences were classified as val-badia and were considered for further processing.

variant	# sentences	# characters
val-badia	$746.704$	$71.619.515$
gherdeina	$491.575$	$57.704.414$
fascia	$407.605$	$52.504.357$
fodom	$146.049$	$16.615.059$
anpezo	$145.674$	$16.425.301$

Table 1: Variant classification of monolingual data.

Data Preparation

Because of the spelling reform in 2015, we further processed the sentences classified as val-badia to exclude any with words that are no longer valid. To do this, we used the implementation of our RBMT system which is explained in more detail in Section 3.2. We used the system to identify unknown words and tried to adapt them to the new spelling according to certain rules. Sentences where this was not possible were left out. This process ensured that the filtered sentences fully adhered to the new spelling, which also facilitated the rule-based translations. We collected a total of 274,665 sentences ( $\approx 31\%$ of the extracted sentences) which constitute the monolingual Ladin data we used in our experiments. Among the unused sentences, $\approx 100k$ contain only one unknown word/typo so there would be still potential to acquire additional data if additional time were spent analysing and preparing these texts.

2.4 Italian Monolingual Data

As monolingual data for Italian, we used the ELRC-CORDIS_News dataset⁵⁵5https://elrc-share.eu/ from OPUS Tiedemann (2012), which contains 123,691 Italian sentences.

2.5 Test Data

This section introduces the three test sets on which the models were evaluated. This test data differs considerably from the training data, so that it can be considered out-of-domain data.

Testset 1

This dataset includes the statute of the Stiftung Südtiroler Sparkasse, a nonprofit foundation dedicated to supporting and promoting various initiatives and projects, primarily within the province of Bolzano. The document is rich in formal and legal terminology. It contains 424 sentences⁶⁶6https://huggingface.co/datasets/sfrontull/stiftungsparkasse-lld_valbadia-ita.

Testset 2

This dataset is a festive compendium of the history of the region associated with this language Kager (2022). It combines historical narratives with legal and administrative statements. The result is a mixture of stylistic elements and lexical domains. It contains 833 sentences⁷⁷7https://huggingface.co/datasets/sfrontull/autonomia-lld_valbadia-ita.

Testset 3

This dataset delves into the literary realm with the classic story of Pinocchio Collodi (2017), a text rich in narrative prose, dialogue and idiomatic expressions, challenging the models with its creative and figurative language. It contains 1563 sentences⁸⁸8https://huggingface.co/datasets/sfrontull/pinocchio-lld_valbadia-ita.

3 Back-translation Strategies

The so-called back-translation, first introduced in Sennrich et al. (2016), refers to the process of automatic translation of monolingual texts in the target language to the source language. This method of enriching additional training data in the source-to-target translation direction (where the target side remains authentic) has proven to be particularly effective and is particularly valuable in low-resource scenarios. In this section we present the three different back-translation strategies used in our research to translate monolingual Ladin texts into Italian.

3.1 Neural MT

There is evidence that low-resource languages benefit from multilingual models Aharoni et al. (2019). For this reason, we opted to utilise a pre-trained, multilingual model, specifically the Helsinki-NLP/opus-mt-ine-ine⁹⁹9https://huggingface.co/Helsinki-NLP/opus-mt-ine-ine model available from the Hugging Face Model Hub, as our base model. This model, which is part of OPUS-MT Tiedemann and Thottingal (2020), was trained to translate between 135 Indo-European languages, to which Ladin and Italian also belong.

The Marian MT model, configured for Helsinki-NLP/opus-mt-ine-ine, features 6 encoder and 6 decoder layers, each with 8 attention heads and a feed-forward dimension of 2048. The model employs a beam search size of 6, a dropout rate of 0.1, and an embedding size of 512. It shares embeddings between the encoder and decoder.

We fine-tuned this model for the two translation directions lvb $\rightarrow$ ita and ita $\rightarrow$ lvb on the available authentic training data. We trained a single model for both directions by using the tags >>ita<< for lvb $\rightarrow$ ita and >>lld_Latn<< ¹⁰¹⁰10We (re)used the tag >>lld_Latn<< because it is listed as a valid target language ID, as few Ladin texts were already included in the training of this model. for the opposite direction as prefixes of the source text. In the rest of the paper, we refer to this fine-tuned model as N1. For fine-tuning, we utilized the AdamW optimizer¹¹¹¹11https://huggingface.co/docs/transformers/v4.41.0/en/main_classes/optimizer_schedules#transformers.AdamW with the defaults settings.

The fine-tuning greatly improves the model in both translation directions, as the scores reported in Table 4 and 3 show. This demonstrates that the data is reliable and that the model adapts well.

3.2 Rule-based MT

For low-resource languages, RBMT frameworks offer a crucial advantage: leveraging linguistic expertise to overcome the limitations of data-driven methods Khanna et al. (2021). Considering the similar sentence structure and composition of Ladin and Italian (they are both Romance languages), it can be assumed that a rule-based MT system can also perform well without excessive structural transfer work. The available Ladin Val Badia-Italian dictionary served as the foundation for the rule-based MT system we developed in Apertium Forcada and Tyers (2016) for this language pair.

This dictionary provides, in addition to the individual words and word translations, also a list of all inflected forms for each lemma. To effectively utilise this dictionary within our translation system, we mapped the lexicographical data to paradigms within the framework of Apertium (monodix format). Specifically, we created 742 paradigms for a total of 19,034 lemmas. This extensive set includes multiple lexical categories: 597 adverbs, 3,366 adjectives, 11,496 nouns, 162 pronouns, and 2,439 verbs. Additionally, we incorporated proper nouns, short phrases, and wordgrams that were identified during the monolingual text extraction process. The resulting bilingual dictionary contains a total of 30,468 entries. The integration with Apertium was facilitated by connecting to and reusing the pre-existing module for Italian¹²¹²12https://github.com/apertium/apertium-ita. The Ladin module¹³¹³13https://github.com/schtailmuel/apertium-lld-ita and the Ladin–Italian¹⁴¹⁴14https://github.com/schtailmuel/apertium-lld module can be found on GitHub. In the rest of the paper, we refer to this RBMT system as R1.

According to aq-covtest¹⁵¹⁵15https://wikis.swarthmore.edu/ling073/Apertium-quality R1 has a coverage of $96.66\%$ on Testset 1, $95.81\%$ on Testset 2 and $95.90\%$ on Testset 3. However, since we did not develop disambiguation modules, we designed the system to select the first suggestion in cases of morphological and lexical ambiguity, which can sometimes result in incorrect choices that may distort the meaning of the texts. To counteract this, and to further enhance the rule-based translation system, we extracted the $900$ most common word $n$ -grams from the texts and added their corresponding translations as entries to the bilingual dictionary.

In addition to the data, we have also included 13 1-level structural transfer rules to avoid common errors. For example, in Ladin, the word pa is used to emphasize a question. In Italian, however, there is no corresponding word for this purpose. We have therefore developed a rule to exclude this word from the translation. The other rules include gender correction, dealing with reflexive verbs and prepositions.

3.3 MT with a Large Language Model

LLMs have shown remarkable capability in understanding and generating human-like text across various languages and domains Brown et al. (2020). However, their performance in MT tasks exhibits significant variability across languages, especially when comparing high-resource languages to low-resource languages Robinson et al. (2023). We explore the utilisation of a LLM, specifically GPT-3.5 Turbo OpenAI (2024), to generate translations from Ladin to Italian. The process involved leveraging the advanced capabilities of the LLM, accessed through the gpt-3.5-turbo-0125 API endpoint. In the rest of the paper, we refer to this LLM as L1.

To enhance throughput and reduce the number of API requests, we generated the translation of 16 Ladin texts in a single request. We provided a set of 8 example translations in JSON format, randomly selected from the available authentic training data and instructed the LLM to generate translations for 16 Ladin texts, which were also provided as a JSON dictionary, with empty Italian translations. Listing 1 (Appendix A) showcases an exemplary prompt.

With this prompting approach, we translated the entire monolingual Ladin corpus into Italian. By providing the exemplary translations as JSON, we were able to reduce the failure rate (invalid/incomplete answers). The extent to which these examples also helped with the translation itself remains open. The entire process spanned approximately 100 hours, with an average processing time of around 22 seconds per request.

4 Experiments

We used the opus-mt-ine-ine¹⁶¹⁶16https://huggingface.co/Helsinki-NLP/opus-mt-ine-ine model as base model for the experiments. In the rest of the paper, we use BM to refer to this model. We fine-tuned BM with the various data sets using the Transformers library Wolf et al. (2020), specifically leveraging the Seq2SeqTrainer module. We always trained a single model for both directions using the corresponding prefixes.

We configured the training to process batches of 16 samples, and restricted the input and output sequences to a maximum of 128 tokens to ensure manageable computation loads. The models were evaluated each 16,000 steps. As a stopping criterion, we used three consecutive evaluations resulting in an improvement of less than $0.2$ chrF points on the validation set. For training, we utilised an NVIDIA TITAN RTX graphics card with 24 GB. In total, we have trained 15 models:

•

Model N1: BM fine-tuned with the available parallel data consisting of 18,139 sentences.
•

Models N2/R2/L2: BM fine-tuned with authentic data and Ladin monolingual data backtranslated (BT) to Italian using N1/R1/L1 respectively.
•

Models N3/R3/L3: This iteration extends the training base of N2/R2/L2 by integrating Italian monolingual data that has been translated into Ladin utilising N2/R2/L2 respectively.
•

Models N4/R4/L4: BM fine-tuned with same training data as N3/R3/L3 models, but with Ladin and Italian monolingual data backtranslated with N3/R3/L3 model.
•

Models N5/R5/L5: This iteration extends the training base of N4/R4/L4 by adding also the forward-translations (FT) as training data.
•

Models A1/A2: A1 was trained on the combined training data used to trained N4, R4 and L4. In A2 we additionally included the forward-translations into the training data.

	Testset 1	Testset 2	Testset 3
ref	$425.7$	$306.3$	$697.4$
BM	$545.8$	$325.8$	$595.7$
N1	$1237.6$	$437.8$	$805.3$
N2	$633.3$	$414.0$	$695.1$
N3	${484.5}$	$331.8$	$606.8$
N4	$367.5$	$323.4$	$605.4$
N5	$476.2$	${320.9}$	${593.4}$
R1	$559.5$	$421.5$	$727.8$
R2	$593.8$	$405.7$	$722.2$
R3	$434.8$	$309.5$	$601.0$
R4	$402.4$	${305.3}$	${594.3}$
R5	${387.8}$	$306.1$	$608.1$
L1	${{380.3}}$	$294.3$	${{517.8}}$
L2	$695.6$	${{406.1}}$	$675.3$
L3	$396.0$	$345.4$	$634.4$
L4	$377.0$	$318.0$	$563.7$
L5	$393.3$	$316.1$	$569.6$

Table 2: Mean perplexity (ita) of selected models.

We refer to the models N1, …, N5 that were trained with NMT backtranslated data as N-models. Analogously, we use the term R-models and L-models to refer to RBMT and LLM models, respectively.

Models N4/R4/L4 illustrate the gains achieved through iterative back-translation Hoang et al. (2018). Additionally, models N5/R5/L5 demonstrate potential improvements achievable with synthetically generated forward-translation data.

We evaluated these models on the 3 test sets presented in Section 2.5. The results are presented and analysed in the following section.

		Testset 1	Testset 2	Testset 3
Ladin (Val Badia) $\rightarrow$ Italian		BLEU / chrF++	BLEU / chrF++	BLEU / chrF++
NMT opus-mt-ine-ine	BM	$8.17/34.81$	$8.07/34.27$	$2.29/21.12$
BM fine-tuned with authentic data	N1	$12.65/41.55$	$11.49/39.90$	$11.83/36.40$
$+$ lvb monolingual BT with N1	N2	$13.01/42.98$	$12.40/41.26$	$13.23/36.84$
$+$ ita monolingual BT with N2	N3	$21.98/50.32$	$19.37/47.35$	$15.01/39.15$
$+$ lvb and ita monolingual BT with N3	N4	$\underline{22.90}/\underline{50.67}$	$\underline{21.12}/\underline{48.38}$	$\underline{16.17}/\underline{40.41}$
$+$ lvb and ita monolingual FT with N3	N5	$21.49/49.94$	$20.53/48.16$	$15.10/39.47$
RBMT apertium-lld-ita	R1	$11.38/39.72$	$11.60/41.49$	$8.48/34.48$
BM fine-tuned with authentic data
$+$ lvb monolingual BT with R1	R2	$14.43/42.76$	$13.27/42.00$	$13.99/37.37$
$+$ ita monolingual BT with R2	R3	$22.17/50.33$	$19.27/48.17$	$15.89/40.19$
$+$ lvb and ita monolingual BT with R3	R4	$21.36/50.24$	$20.27/\underline{49.08}$	$16.34/\underline{\textbf{40.76}}$
$+$ lvb and ita monolingual FT with R3	R5	$\underline{22.50}/\underline{50.64}$	$\underline{20.37}/49.04$	$\underline{\textbf{16.36}}/40.47$
LLM gpt-3.5-turbo-0125	L1	$\underline{\textbf{26.77}}/\underline{\textbf{53.20}}$	$21.17/48.52$	$10.37/32.36$
BM fine-tuned with authentic data
$+$ lvb monolingual BT with L1	L2	$12.93/43.20$	$12.21/41.21$	$13.22/36.94$
$+$ ita monolingual BT with L2	L3	$22.69/50.74$	$20.37/48.40$	$\underline{15.26}/38.99$
$+$ lvb and ita monolingual BT with L3	L4	$23.01/51.17$	$\underline{21.38}/\underline{49.24}$	$15.12/\underline{39.37}$
$+$ lvb and ita monolingual FT with L3	L5	$23.11/50.84$	$20.86/48.50$	$15.19/39.29$
ALL BM fine-tuned with authentic data
$+$ lvb and ita monolingual BT with N3, R3, L3	A1	$23.58/50.68$	$21.30/48.78$	$15.32/39.56$
$+$ lvb and ita monolingual FT with N3, R3, L3	A2	$\underline{24.12}/\underline{51.42}$	$\underline{\textbf{22.24}}/\underline{\textbf{49.69}}$	$\underline{15.98}/\underline{39.64}$

Table 3: Evaluation Results for Ladin to Italian Translation

5 Results and Discussion

The results of the various experiments conducted are presented in Table 3 and Table 4, where the SacreBLEU Post (2018) and chrF++ Popović (2015) scores for different models and test sets are detailed. To facilitate comparison, the best scores for each approach have been underlined, and the overall best scores for each testset are highlighted in bold.

Additionally, as recommended in Edunov et al. (2020), in Table 2 we report the mean perplexity values for the Italian translations generated by the different models to complement BLEU’s emphasis on adequacy. Perplexity measures how well a language model can predict the next word in a sequence based on the preceding words. Lower perplexity means that the model is more confident and accurate in its predictions, indicating that it can better reproduce the structure and patterns of the language it generates. Therefore, we present the mean perplexity values obtained from GPT-2 Radford et al. (2019) , computed using the implementation available from Hugging Face¹⁷¹⁷17https://huggingface.co/spaces/evaluate-metric/perplexity.

Several findings can be deduced from these results, and will be discussed below. In general, there is evidence that augmenting the training data with monolingual data through back-translation is effective. N1 shows that fine-tuning the model with only authentic training data substantially improves the results in both directions (compared to BM) in terms of BLEU/chrF++ points. This shows on the one hand that the training is effective and on the other hand that the available data is adequate. However, it is also evident that the model generates less fluent text, as indicated by the perplexity scores which increase for this model.

The results reveal a progression in difficulty among the test sets, where Testset 3 emerges as the most challenging one. On this test set, all approaches achieve similar low scores, suggesting the presented approach may face limitations with more complex texts.

		Testset 1	Testset 2	Testset 3
Italian $\rightarrow$ Ladin (Val Badia)		BLEU / chrF++	BLEU / chrF++	BLEU / chrF++
NMT opus-mt-ine-ine	BM	$0.08/5.34$	$0.55/13.68$	$0.05/6.86$
BM fine-tuned with authentic data	N1	$10.22/37.11$	$10.14/37.48$	$12.76/35.31$
$+$ lvb monolingual BT with N1	N2	$19.09/46.92$	$18.05/45.44$	$16.50/37.46$
$+$ ita monolingual BT with N2	N3	$19.54/\underline{47.02}$	$\underline{19.45}/\underline{46.21}$	$\underline{\textbf{16.66}}/37.36$
$+$ lvb and ita monolingual BT with N3	N4	$19.61/46.35$	$19.16/45.63$	$16.40/\underline{37.84}$
$+$ lvb and ita monolingual FT with N3	N5	$\underline{20.24}/46.72$	$19.39/45.88$	$15.56/36.97$
RBMT apertium-lld-ita	R1	$4.94/37.50$	$4.50/36.89$	$3.19/27.44$
BM fine-tuned with authentic data
$+$ lvb monolingual BT with R1	R2	$19.18/46.59$	$16.96/44.97$	$15.21/36.76$
$+$ ita monolingual BT with R2	R3	$19.86/46.83$	$17.70/45.69$	$15.04/36.60$
$+$ lvb and ita monolingual BT with R3	R4	$\underline{20.93}/\underline{47.65}$	$\underline{19.32}/\underline{46.58}$	$\underline{16.65}/\underline{\textbf{38.16}}$
$+$ lvb and ita monolingual FT with R3	R5	$19.97/46.88$	$18.65/46.19$	$16.61/38.12$
LLM gpt-3.5-turbo-0125	L1	$5.54/29.03$	$3.84/28.98$	$1.16/18.60$
BM fine-tuned with authentic data
$+$ lvb monolingual BT with L1	L2	$\underline{\textbf{22.09}}/\underline{\textbf{48.69}}$	$19.71/46.59$	$14.16/35.67$
$+$ ita monolingual BT with L2	L3	$21.59/48.23$	$\underline{\textbf{19.96}}/\underline{\textbf{49.96}}$	$14.23/35.81$
$+$ lvb and ita monolingual BT with L3	L4	$20.82/47.86$	$19.87/46.59$	$\underline{16.55}/\underline{38.04}$
$+$ lvb and ita monolingual FT with L3	L5	$20.93/47.70$	$19.38/46.37$	$15.84/37.29$
ALL BM fine-tuned with authentic data
$+$ lvb and ita monolingual BT with N3, R3, L3	A1	$19.83/47.16$	$\underline{19.94}/\underline{46.40}$	$\underline{16.54}/\underline{37.91}$
$+$ lvb and ita monolingual FT with N3, R3, L3	A2	$\underline{20.81}/\underline{47.50}$	$19.71/46.36$	$16.36/37.82$

Table 4: Evaluation Results for Italian to Ladin Translation

In the translation direction lvb $\rightarrow$ ita, the best results were achieved by combining the different back-translations, as model A2 results indicate. This emphasises the importance of a broad and diversified dataset. Remarkably, the A2 model is also competitive in the reverse translation direction (ita to lvb), although it does not achieve the best results.

A comparison of the models N1, R1 and L1 suggests that the LLM generates more fluent texts (low perplexity) but perhaps does not always accurately reproduce the meaning, as attested by the performance on Testset 3 (low perplexity but also low BLEU score). In this assessment, the RBMT system R1 also performs better than the fine-tuned NMT model N1. One of the reasons for the high perplexity values of N1 is that this model tends to hallucinate because it has been fine-tuned with a small data set. However, this does not seem to affect the performance as the models trained on this data do not perform considerably worse.

The performance of the LLM varies significantly, with pronounced differences between the three test sets in both directions of translation. The significant difference observed between Testset 1 and Testset 2 in the translation direction from lvb $\rightarrow$ ita cannot be seen in the R- and N-models. It remains unclear to what extent the LLM benefits from the given examples in the prompt. However, by providing an example, the propensity for errors was minimised, resulting in fewer mistakes during execution. Even though LLMs are not (yet) suitable for generating texts in low-resource languages out-of-the-box (see performance of L1 in Table 4), Ladin and low-resource languages in general could benefit from this technology. Our experiments show that models trained on back-translations from L1 performed best on Testset 1 and Testset 2 in the translation direction ita $\rightarrow$ lvb.

The inclusion of forward translations in the training data did not consistently improve the models, with the exception of the R-models for lvb $\rightarrow$ ita. This suggests that these synthesised texts introduce too much noise. However, model A2 was able to benefit from this data in the translation direction lvb $\rightarrow$ ita. Filtering this data could slightly improve the model.

As the models achieve similar scores on the test data, we also examined the quality of round-trip translations to gain additional insights. For this, we used 10k sentences from the monolingual Ladin and Italian data (which were also used for training, hence the high scores), translated them into the other language and then back-translated them. This concept of so-called round-trip translation is a suitable evaluation method Zhuo et al. (2023). We used the R4/N4/L4 models for this purpose, applying one model $A$ for one direction and the same or a different model $B$ for the opposite direction. Table 5 shows the obtained results. It can be clearly seen that the results are worse when a different model is used for the reverse translation. This shows that although the models achieve similar results with the test data, they work differently. The R4 model proves to be the most stable here, as its translations can be back-translated well by all three models. For other combinations, a high variance can be observed.

The translation models N4¹⁸¹⁸18https://doi.org/10.57967/HF/2695, R4¹⁹¹⁹19https://doi.org/10.57967/HF/2693, and L4²⁰²⁰20https://doi.org/10.57967/HF/2694 have been released on Hugging Face, making them accessible for further research and application.

$A/B$	lvb $\xrightarrow[]{A}$ ita $\xrightarrow[]{B}$ lvb	ita $\xrightarrow[]{A}$ lvb $\xrightarrow[]{B}$ ita
	BLEU / chrF++	BLEU / chrF++
N4/N4	70.57 / 82.56	64.19 / 81.26
N4/R4	58.57 / 74.50	47.16 / 72.09
N4/L4	63.90 / 78.09	59.47 / 78.46
R4/N4	70.80 / 82.20	68.38 / 83.00
R4/R4	80.12 / 88.94	68.51 / 84.73
R4/L4	70.36 / 81.98	67.41 / 82.68
L4/N4	63.72 / 77.53	57.02 / 76.54
L4/R4	57.13 / 73.32	46.95 / 71.52
L4/L4	72.31 / 83.69	65.74 / 82.02

Table 5: Results for Round-Trip Translations

6 Related Work

Data augmentation such as back-translation Sennrich et al. (2016); Hoang et al. (2018) and transfer learning Zoph et al. (2016) are established strategies to improve MT systems. These concepts are discussed in Haddow et al. (2022); Ranathunga et al. (2023), with a focus on low-resource scenarios. The fact that the synthesised data plays a critical role in the quality of the systems trained on it, as it also introduces a certain degree of noise, was discussed extensively in Edunov et al. (2018); Xu et al. (2022). It was shown that tagging synthetic data can be beneficial in the training process Caswell et al. (2019). In our work we do not apply advanced techniques to differentiate synthetic from real translations in training. The fact that RBMT systems can still be valuable for low-resource languages and can even help to achieve better results was also demonstrated for Northern Sámi Aulamo et al. (2021) . In our experiments, we could also observe this in the translation direction lvb $\rightarrow$ ita. This could be due to the ability of the RBMT system to provide general knowledge that is not available in the relatively limited parallel training datasets Aulamo et al. (2021). The use of LLMs for MT and different prompting techniques was investigated in Zhang et al. (2023) and their performance in the machine translation of low-resource languages has already been analysed in Moslem et al. (2023). Even if they struggle to generate texts in low-resource languages Robinson et al. (2023), it has already been claimed that they can contribute to advances in machine translation of such languages. Our work is an example of how LLMs can be used in machine translation of a low-resource language; however, further prompt engineering is needed to make better use of such models.

7 Conclusion

In this work, we conducted a detailed comparison of RBMT, NMT and LLMs for back-translation in a low-resource scenario. We have tested various back-translation approaches and evaluated them for a previously unexplored language in the field of machine translation.

Our current methodology involved the exclusion of numerous Ladin monolingual sentences. However, this filtering would be less important for the translation direction lvb $\rightarrow$ ita. This previously discarded data could be re-incorporated to improve the performance of the models in this particular translation direction.

The round-trip translation scores indicate that the initial back-translation with the RBMT system leads to more robust models. Improving the ambiguity resolution of this rule-based translation system could lead to even better results.

The simplicity of the prompts used to feed the LLMs provides a further starting point for investigations. In particular, the question arises as to whether the results can be improved by further prompt engineering, e.g., by including the meaning for the distinct words occurring in a text using the available dictionary. Investigating the effects of prompt optimisation could provide new insights into maximising the efficiency of LLMs in machine translation, especially in low-resource scenarios.

We plan to address these research questions in our future work.

Acknowledgements

This research was conducted in collaboration with the Ladin Cultural Institute Micurà de Rü and funded by the Regione Autonoma Trentino-Alto Adige/Südtirol.

References

Aharoni et al. (2019) Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
Aulamo et al. (2021) Mikko Aulamo, Sami Virpioja, Yves Scherrer, and Jörg Tiedemann. 2021. Boosting neural machine translation from Finnish to Northern Sámi with rule-based backtranslation. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 351–356, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python. O’Reilly Media, Inc.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. CoRR, abs/2005.14165.
Burlot and Yvon (2018) Franck Burlot and François Yvon. 2018. Using monolingual data in neural machine translation: a systematic study. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 144–155, Brussels, Belgium. Association for Computational Linguistics.
Caswell et al. (2019) Isaac Caswell, Ciprian Chelba, and David Grangier. 2019. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 53–63, Florence, Italy. Association for Computational Linguistics.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA. Association for Computing Machinery.
Collodi (2017) Carlo Collodi. 2017. Les aventöres de Pinocchio. Istitut Ladin Micurá de Rü, San Martin de Tor.
Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500, Brussels, Belgium. Association for Computational Linguistics.
Edunov et al. (2020) Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, and Michael Auli. 2020. On the evaluation of machine translation systems trained with back-translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2836–2846, Online. Association for Computational Linguistics.
Forcada and Tyers (2016) Mikel L. Forcada and Francis M. Tyers. 2016. Apertium: a free/open source platform for machine translation and basic language technology. In Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products, Riga, Latvia. Baltic Journal of Modern Computing.
Haddow et al. (2022) Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindřich Helcl, and Alexandra Birch. 2022. Survey of low-resource machine translation. Computational Linguistics, 48(3):673–732.
Hoang et al. (2018) Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 18–24, Melbourne, Australia. Association for Computational Linguistics.
Kager (2022) Thomas Kager. 2022. Alto Adige: un’Europa in piccolo – I 50 anni del Secondo Statuto di autonomia. Provincia autonoma di Bolzano – Alto Adige, Bolzano.
Khanna et al. (2021) Tanmai Khanna, Jonathan N. Washington, Francis M. Tyers, Sevilay Bayatlı, Daniel G. Swanson, Tommi A. Pirinen, Irene Tang, and Hèctor Alòs i Font. 2021. Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages. Machine Translation, 35(4):475–502.
Moling et al. (2016) Sara Moling, Ulrike Frenademetz, and Marlies Valentin. 2016. Dizionario Italiano–Ladino Val Badia/Dizionar Ladin Val Badia–Talian. Istitut Ladin Micurá de Rü, San Martin de Tor.
Moslem et al. (2023) Yasmin Moslem, Rejwanul Haque, John D. Kelleher, and Andy Way. 2023. Adaptive machine translation with large language models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 227–237, Tampere, Finland. European Association for Machine Translation.
OpenAI (2024) OpenAI. 2024. GPT-3.5 Turbo [gpt-3.5-turbo-0125]. Available at: https://platform.openai.com. Accessed on: February 2024.
Popović (2015) Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Ranathunga et al. (2023) Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, and Rishemjit Kaur. 2023. Neural machine translation for low-resource languages: A survey. ACM Comput. Surv., 55(11).
Rawte et al. (2023) Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S.M Towhidul Islam Tonmoy, Aman Chadha, Amit Sheth, and Amitava Das. 2023. The troubling emergence of hallucination in large language models - an extensive definition, quantification, and prescriptive remediations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2541–2573, Singapore. Association for Computational Linguistics.
Robinson et al. (2023) Nathaniel Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. 2023. ChatGPT MT: Competitive for high- (but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392–418, Singapore. Association for Computational Linguistics.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
Shen et al. (2021) Jiajun Shen, Peng-Jen Chen, Matthew Le, Junxian He, Jiatao Gu, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato. 2021. The source-target domain mismatch problem in machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1519–1533, Online. Association for Computational Linguistics.
Shi et al. (2022) Shumin Shi, Xing Wu, Rihai Su, and Heyan Huang. 2022. Low-resource neural machine translation: Methods and trends. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(5):1–22.
Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
Tiedemann and Thottingal (2020) Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
Xu et al. (2022) Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang, Shuming Shi, Lihui Chen, and Lemao Liu. 2022. On synthetic data for back translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 419–430, Seattle, United States. Association for Computational Linguistics.
Zhang et al. (2023) Biao Zhang, Barry Haddow, and Alexandra Birch. 2023. Prompting large language model for machine translation: a case study. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
Zhuo et al. (2023) Terry Yue Zhuo, Qiongkai Xu, Xuanli He, and Trevor Cohn. 2023. Rethinking round-trip translation for machine translation evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 319–337, Toronto, Canada. Association for Computational Linguistics.
Zoph et al. (2016) Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer Learning for Low-Resource Neural Machine Translation. arXiv preprint. ArXiv:1604.02201 [cs].

Appendix A Prompt template

⬇

I’ll give you samples for the translation from Ladin to Italian:

{

"translations": [

{

"Ladin": "scrí sües minunghes",

"Italian": "scrivere le proprie opinioni"

{

"Ladin": "mëte la secunda",

"Italian": "mettere la seconda"

{

"Ladin": ’"zessa, i á prescia!"’,

"Italian": ’"scansati, ho fretta!"’

{

"Ladin": "passé ia le rü",

"Italian": "oltrepassare il fiume"

...

{

"Ladin": "chësc liber é to",

"Italian": "questo libro è tuo"

]

}

Please generate the translation of each of the 16 entries in the given dictionary, where the translations are empty. Return the same JSON dictionary where the values for Italian are filled:

{

"translations": [

{

"Ladin": "Sperun da salvé almanco val’, dijun:",

"Italian": ""

{

"Ladin": "Ince tröc toponims y cognoms ladins desmostra che l’identité ladina é coliada ala natöra y ala cultura da munt",

"Italian": ""

{

"Ladin": "De profesciun este pech... co este pa rové pro chësc laur?",

"Italian": ""

...

{

"Ladin": "I dormi n pü’ domisdé y spo ciamó val’ ora dan mesanöt",

"Italian": ""

}

]

}

Listing 1: Prompt template used to obtain the Italian translations from the LLM.

Rule-Based, Neural and LLM Back-Translation: Comparative Insights from a Variant of Ladin

Abstract

1 Introduction

The Ladin Language

2 Data

2.1 Available Resources

2.2 Parallel Data

2.3 Ladin Monolingual Data

Variant Classification

Data Preparation

2.4 Italian Monolingual Data

2.5 Test Data

Testset 1

Testset 2

Testset 3

3 Back-translation Strategies

3.1 Neural MT

3.2 Rule-based MT

3.3 MT with a Large Language Model

4 Experiments

5 Results and Discussion

6 Related Work

7 Conclusion

Acknowledgements

References

Appendix A Prompt template

Rule-Based, Neural and LLM Back-Translation:
Comparative Insights from a Variant of Ladin