Aligning Neural Machine Translation Models:
Human Feedback in Training and Inference

Miguel Moura Ramos^1,2 Patrick Fernandes^1,2,3 António Farinhas^1,2
André F. T. Martins^1,2,4
¹Instituto Superior Técnico, Universidade de Lisboa (ELLIS Unit Lisbon)
²Instituto de Telecomunicações ³Carnegie Mellon University ⁴Unbabel
[email protected]

Abstract

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate. A core ingredient in RLHF’s success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation. In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach. Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality. Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality.

1 Introduction

Neural machine translation (NMT) models [Bahdanau et al. (2015, Vaswani et al. (2017] are typically trained with maximum likelihood estimation (MLE), maximizing the log-probability of the next word in a translation given the previous words and the source sentence. While this approach has been effective at training high-quality MT systems, the difference between the training and inference objective can lead to exposure bias [Bengio et al. (2015, Ranzato et al. (2016, Wiseman and Rush (2016], which hinders the model’s ability to recover from early mistakes. Furthermore, the suitability of model likelihood as a proxy for generation quality has been questioned in machine translation [Koehn and Knowles (2017, Ott et al. (2018] and beyond [Perez et al. (2022]. These challenges sparked interest in alternative training and decoding paradigms for MT, such as reinforcement learning (RL; ?)) or minimum Bayes risk decoding (MBR; ?)).

More recently, the widespread success of reinforcement learning from human feedback [Stiennon et al. (2022] has highlighted the importance of a good reward model that approximates well to human preferences for the task at hand. While, in general, this requires training a reward model from scratch for the specific problem, in the case of machine translation (MT), the evaluation community has achieved significant progress in developing automatic quality estimation and evaluation metrics learned from human quality annotations (e.g. COMET-QE [Rei et al. (2020], COMET [Rei et al. (2022a], BLEURT [Sellam et al. (2020], which can be repurposed as reward models. As a consequence, recent research integrating these metrics into the training [Gulcehre et al. (2023] or decoding [Fernandes et al. (2022] procedures has had considerable success in improving the quality of translations. However, none of the previous work has systematically compared the effect of integrating metrics at different stages of the MT pipeline or has attempted to combine these techniques in a unified approach.

In this work, we perform a comprehensive study on the integration of MT quality metrics into the MT pipeline as reward models. As illustrated in Figure 1, we assess their use at different stages: as a means for data filtering, during the training process through RL, and at inference time by way of reranking techniques. Furthermore, we explore the results of combining these methods.

Refer to caption — Figure 1: Preference models can have multifaceted roles within the MT pipeline. They can serve as effective data filters, refining datasets by incorporating user preferences. They can also assume a pivotal role in classic RL training by providing rewards to optimize the MT model performance. Finally, they can act as rerankers during the decoding phase, selecting the final translation by maximizing their scores derived from user preferences.

We attempt to answer the following research questions:

•

Can data filtering based on estimated quality help minimize RL training instability?
•

Which metrics are more suitable as reward models in RL training? Are reference-free metrics competitive with reference-based ones?
•

How does the quality of translations achieved through RL training compare with those produced through reranking approaches? Can these two approaches be effectively combined to further enhance translation quality?

Our main contributions arise from the research questions mentioned above:

•

Inspired by ?) where they use cross-lingual encoders to score translation representations in an aligned multilingual vector space, we propose an alternative data filtering method that uses COMET-QE [Rei et al. (2020], a more robust model, to curate a high-quality dataset that empirically helps to minimize RL training instability.
•

We show that neural metrics such as COMET(-QE) [Rei et al. (2022a, Rei et al. (2020] are more suitable than BLEU [Papineni et al. (2002] for RL training. Contrary to what happens with MBR decoding, RL training results in improved scores across all types of metrics, not only neural ones. In particular, using a reward model based on QE works surprisingly well, possibly paving the way for unsupervised training of NMT systems.
•

Experiments in EN→DE and EN→FR show that both RL training and reranking techniques enhance translation quality, with RL training often outperforming reranking methods. Furthermore, combining RL and MBR decoding results in more consistent improvements across various evaluation metrics.
•

We quantify and discuss the trade-offs in running time at both training and inference, clarifying the efficiency and suitability of each approach.

2 Background

2.1 Neural Machine Translation

An NMT model has learnable parameters, $\theta$ , to estimate the probability distribution, $p_{\theta}(y|x)$ over a set of hypotheses $\mathcal{Y}$ , conditioned on a source sentence $x$ . MLE is the training principle of estimating $\theta$ , given parallel data, formalized as

\mathcal{L}(\theta,y_{1:L})=-\frac{1}{L}\sum_{t=1}^{L}\log p_{\theta}(y_{t}|y_% {0},..,y_{t-1}).

(1)

NMT systems typically employ maximum a posteriori (MAP) decoding to generate translations,

\hat{y}_{\mathrm{MAP}}=\arg\max_{y\in\mathcal{Y}}\log{p_{\theta}(y|x)},

(2)

where algorithms such as greedy decoding or beam search [Reddy (1977] approximate the most probable translation given the source. An alternative approach is to sample translations according to $p_{\theta}(y|x)$ , using techniques such as top- $k$ or nucleus sampling [Fan et al. (2018, Holtzman et al. (2020].

In §3.3 of this paper, we also consider two distinct reranking approaches [Fernandes et al. (2022], namely $N$ -best reranking and MBR decoding. While $N$ -best reranking selects the candidate translation that maximizes a given (reference-free) metric, MBR decoding ranks candidates using reference-based metrics, maximizing the expected utility (or minimizing the risk).

2.2 MT Evaluation

Human evaluations are the most reliable way to assess the performance of MT systems, but they are time-consuming and costly. For that reason, the standard way to evaluate MT is through automatic evaluation metrics, which can be reference-based or quality estimation (QE) metrics.

Reference-based metrics compare the generated translation to human-written reference texts. Lexical reference-based metrics, such as the widely used BLEU [Papineni et al. (2002], rely on word overlap and n-gram matching, making them ineffective for translations that have the same meaning but are substantially different from the reference. On the other hand, neural metrics, such as COMET [Rei et al. (2022a], are a recent alternative that relies on neural networks trained on human-annotated data and that leverages contextual embeddings to address semantic similarity.

QE assesses translation quality without human references, being particularly useful in dynamic, data-intensive environments, where references are costly and time-consuming to obtain. This paper focuses on sentence-level QE as a reward model, providing a single quality assessment for each translation. COMET-QE [Rei et al. (2020] is a state-of-the-art reference-free quality estimation metric derived from COMET used to evaluate MT performance.

Neural reference-based and QE metrics are valuable preference models because they offer a more accurate and contextually-aware measure of translation quality, aligning better with human preferences and judgments [Freitag et al. (2022b].

2.3 Reinforcement Learning Training in NMT

In MT, approaches based on reinforcement learning (RL; ?)) cast the problem as a Markov decision process (MDP; ?)), where a source sentence $x=(x_{1},...,x_{n})$ is translated into a target sentence $y=(y_{1},...,y_{m})$ . Under this perspective, the NMT system can be viewed as the agent with a conditional probability distribution based on its parameters, $p_{\theta}(y_{t}|x,y_{<t})$ . The states of the MDP are defined by the target sentence that has already been decoded, $s_{t}=(y_{1},....,y_{t<m})$ , and the action corresponds to the selection of the next word, $y_{t+1}$ . Based on the states and actions, all transitions are deterministic and the reward function, $R$ , is provided by the MT evaluation model which returns a quality score for the generated translation $\hat{y}$ . The main purpose of using RL in NMT is to provide learning signals that go beyond a single reference translation, by providing reward signals for arbitrary translations. MLE provides less robust learning signals that are more susceptible to the shortcomings of noisy references. However, it is essential to note that if the reward model used relies on reference-based metrics, some vulnerability to noisy references may still persist. Accordingly, the goal of RL training is to maximize the expected reward, $L_{\mathrm{rl}}(\theta)=\mathbb{E}_{p_{\theta}(\hat{y}|x)}[R(\hat{y})].$ Commonly used RL training procedures include REINFORCE [Williams (1992], minimum risk training [Och (2003, Shen et al. (2016], and proximal policy optimization (PPO; ?)).

3 Aligning MT with Reward Models

3.1 Data Filtering

The success of fine-tuning NMT models with MLE is highly dependent on the quantity and quality of the training dataset [Wang et al. (2018, Koehn and Knowles (2017, Khayrallah and Koehn (2018]. This is because accurate references are crucial for computing meaningful learning signals that correctly guide the NMT model towards improved translations [Kong et al. (2018]. Despite its recent successes, RL-based training can be unstable, so using only high-quality data could help mitigate this instability. This can be addressed via data filtering, by seeking a good balance between the aggressiveness of filtering and the resulting dataset size: if the original dataset is already small, too much filtering can be detrimental to the performance of NMT systems [Zoph et al. (2016, Jiao et al. (2020]. Furthermore, when looking at the RL scenario, having a sufficiently large training dataset can help guarantee that the NMT model explores a wide range of scenarios for policy improvement.

We apply our data filtering method on the considerably large and noisy WMT datasets [Bojar et al. (2015, Bojar et al. (2016] since they have been reported to have less relevant and uncorrelated sentences that can lead to sub-optimal results when used during training [Koehn et al. (2020, Malli and Tambouratzis (2022]. We do not perform data filtering to the IWSLT2017 [Cettolo et al. (2012, Cettolo et al. (2017] dataset due to concerns about its limited amount of available data. Further dataset filtering could potentially result in a too-small training dataset, which is not be desirable for training MT systems.

As illustrated in Figure 1, to perform the training dataset filtering, we use a filter that reranks the sentence pairs according to quality scores that indicate the correlation and relevance of each sentence and its given reference. This approach allows us to filter out low-quality sentence pairs, thereby improving the overall quality of the data. In our approach, we use a robust preference model called COMET-QE [Rei et al. (2020] as the data filter, which combines the use of encoders and a regression model trained on human-annotated data to estimate the quality score of each sentence pair. This reference-less model is expected to be more accurate in quality score estimation and have a superior alignment with human judgments than just resorting to the currently used cross-lingual encoders which only take into account vector-space mapping similarity [Bane and Zaretskaya (2021]. Furthermore, COMET-QE seems particularly suitable as our preference model during data filtering, as it is a multilingual reference-free neural-based metric trained on human annotations of translation quality, and therefore can be used to filter by thresholding on predicted quality or on the number of sentences in the training set. After scoring all sentence pairs, we select the threshold based on the number of high-quality sentence pairs to use as the filtered dataset for RL training. For that, we apply different thresholds and sizes to the reranked sentences. We, then, MLE fine-tune our baseline on these subsets and select the subset that gives the overall best-performing model on the dev. set. These best-performing models serve as baselines for our RL-based training and reranking methods during decoding.

In conclusion, it is worth noting that our data filtering method is, as shown in Figure 1, one of three methods we cover for employing a preference model in the MT pipeline. This filtering method can significantly increase the performance of MT systems by introducing feedback in an earlier stage of the pipeline.

3.2 Training Phase

The use of RL-based training has the potential to bridge the gap between MLE training objectives, MT evaluation metrics and human-like translations. However, it faces challenges of instability and inefficiency, especially in gradient variance and reward computation. As illustrated in Figure 1, the RL training process is composed of an NMT model that generates translations that are evaluated by the reward model through rewards that represent the quality of the translation. This reward is used by the policy gradient algorithm to update the NMT model’s policy. To address the problem of gradient variance, we employ PPO [Schulman et al. (2017] as our policy gradient algorithm since it is a stable and efficient algorithm that updates the policy parameters in a controlled way with a predetermined proximity bound, avoiding sudden changes that might destabilize the learning.

Reward computation is the most crucial part of this entire process as it guides the NMT model during training. Previous work on RL-based NMT systems predominantly used BLEU as the reward function. However, BLEU has several limitations, as discussed in §2.2. To address these shortcomings, we leverage robust preference models during RL training, such as the reference-based COMET [Rei et al. (2022a] and the reference-free COMET-QE [Rei et al. (2020], as highlighted in Figure 1. Since learning these models is a complex task, we incorporate these pre-trained preference models, which have already been shown to correlate well with human judgments [Freitag et al. (2022b, Rei et al. (2022a, Rei et al. (2020], to ensure that RL systems can better capture the nuanced preferences of the user by receiving human-like feedback as rewards. These models assign numerical quality scores to each translation hypothesis based on their desirability, making them similar to utility functions. Our study aims to demonstrate that training with RL can generate higher-quality NMT models using neural metrics and investigate the competitiveness of COMET-QE as a reward model.

Another crucial decision was related to the exploitation vs. exploration problem of RL in the context of MT [Wu et al. (2018]. The beam search algorithm generates more accurate translations by exploiting the probability distribution/policy of the NMT model, while sampling aims to explore more diverse candidates. During generation, we observed that sampling techniques generally led to candidates of lower quality when compared to beam search, according to the preference models used. Therefore, all RL-based models used beam search during their training and inference.

3.3 Decoding Phase

Reranking methods [Ng et al. (2019, Bhattacharyya et al. (2021, Fernandes et al. (2022, Eikema and Aziz (2022] are an alternative to MAP-based decoding that relies on reranking techniques and presupposes access to $N$ candidate translations for each source sentence, generated by the NMT system through methods like beam search or sampling. The generated candidates are reranked according to their quality given an already determined metric/reward model.

We employ two reranking methods to select a final translation: $N$ -best reranking [Ng et al. (2019, Bhattacharyya et al. (2021] and minimum Bayes risk decoding (MBR; ?)).

$N$ -best reranking (3) employs a reference-free metric, $M_{\textsc{qe}}$ , to reorder a set of $N$ candidate translations, denoted as $\bar{\mathcal{Y}}$ , and selects the candidate with the highest estimated quality score as the final translation, $\hat{y}_{\textsc{rr}}$ ,

\hat{y}_{\textsc{rr}}=\arg\max_{y\in\bar{\mathcal{Y}}}M_{\textsc{qe}}(y).

(3)

Considering the previous equation, and assuming $C_{M_{\mathrm{QE}}}$ as the computational cost of evaluating a candidate translation with QE metric, $M_{\mathrm{QE}}$ , we obtain the final computational cost of finding the best translation from $N$ candidate translations as $O(N\times C_{M_{\mathrm{QE}}})$ .

MBR decoding, in contrast, relies on a reference-based metric and chooses the candidate that has the highest quality when compared to other possible translations (in expectation). We define $u(y^{*},y)$ as the utility function, quantifying the similarity between a hypothesis $y\in\mathcal{Y}$ and a reference $y^{*}\in\bar{\mathcal{Y}}$ . In our context, the utility function is represented by either BLEU or COMET. Therefore, MBR decoding can be mathematically expressed as

\displaystyle\hat{y}_{\textsc{mbr}}

\displaystyle=\operatorname*{arg\,max}_{y\in\bar{\mathcal{Y}}}~{}\underbrace{% \mathbb{E}_{Y\sim p_{\theta}(y\mid x)}[u(Y,y)]}_{\textstyle\approx~{}\frac{1}{% N}\sum_{j=1}^{N}u(y^{(j)},y)},

(4)

where in Eq. 4 the expectation is approximated as a Monte Carlo sum using model samples $y^{(1)},\ldots,y^{(N)}\sim p_{\theta}(y|x)$ . These samples may be obtained through biased sampling (e.g., nucleus-p or top-k) or beam search. Knowing that the utility function is a reference-based metric $M_{\mathrm{REF}}$ with computational cost, $C_{M_{\mathrm{REF}}}$ , and that to find the best translation we need to do pairwise comparisons between hypotheses, we obtain the final computational cost as $O(N^{2}\times C_{M_{\mathrm{REF}}})$ . These reranking methods become particularly effective when $N$ is not excessively large, making the process computationally more manageable.

Preference models capture the preferences of human evaluators and can be used during the decoding stage to influence MT systems, as shown in Figure 1. By doing this, the MT system will prioritize translations that are more aligned with human judgments, therefore reducing the chances of generating severely incorrect translations. We believe that incorporating preference models during the decoding stage can lead to even better translation quality, even if the underlying model has already been RL-trained using the same or a different preference model. The benefits we expect to see include improved fluency, adequacy, and consistency compared to the respective baselines since our preference models have been trained on annotations that aim to optimize these linguistic aspects.

4 Experiments

4.1 Setup

During the training phase, we investigate the advantages of RL training (with and without data filtering §3.1) for enhancing the performance of NMT systems. We employ a T5 model¹¹1We leverage the T5-Large model available in Huggingface’s Transformers framework [Wolf et al. (2020]., pre-trained on the C4 dataset [Raffel et al. (2019]. First, we fine-tune the models using MLE training with Adam [Kingma and Ba (2017] as the optimization algorithm, learning rate decay starting from $5\times 10^{-6}$ and early stopping. For RL training²²2Our RL implementation relies on the Transformer Reinforcement Learning X framework [Castricato et al. (2023, trlX]., we use PPO with learning rate set as $2\times 10^{-5}$ , $\gamma$ set as $0.99$ , trajectory limit set as $10,000$ , beam search size set as $5$ and mini-batch updates were conducted using stochastic gradient descent with a batch size of $32$ , gathered over $4$ PPO epochs. In the inference phase, our emphasis shifts towards reranking techniques and their impact on the performance of NMT systems. As for the candidate generation method used, early experiments, omitted for relevancy, show that the best configuration is to generate 100 candidates per source sentence and then use sampling with $p=0.6$ and $k=300$ to select the best translation. Consequently, the evaluation encompasses all the baseline and RL-trained models, both with and without $N$ -best reranking and MBR decoding. These evaluations are conducted across the following datasets:

•

The small IWSLT2017 datasets [Cettolo et al. (2012, Cettolo et al. (2017] for English to German (EN → DE) and English to French (EN → FR), featuring 215k and 242k training examples, respectively.
•

The large WMT16 dataset [Bojar et al. (2016] for English to German (EN → DE) with 4.5M training examples.
•

The large WMT15 dataset [Bojar et al. (2015] for English to French (EN → FR) with over 40M training samples.

We assess the performance of each NMT system using well-established evaluation metrics, which include BLEU, chrF [Popović (2015], METEOR [Banerjee and Lavie (2005], COMET, COMET-QE, and BLEURT. Additionally, for certain experiments executed on a single NVIDIA RTX A6000 GPU, we provide wall clock time measurements to offer insights into computational efficiency.

4.2 Finding the Optimal Quality Subset Size

In this section, we discuss our approach to quality-aware data filtering as a stabilizing strategy (§3.1), for the WMT datasets. Figure 2(a) summarizes our findings for the WMT16 EN→DE dataset [Bojar et al. (2016] on the influence of a high-quality subset on translation performance as we vary the subset size, based on various evaluation metrics and COMET-QE sentence filtering. Across all metrics, a consistent trend emerges: after reaching training sizes of $\numprint{500000}$ , there is a notable decline in performance. Particularly, this decline is less prominent for lexical metrics, possibly due to their inherent limitations [Freitag et al. (2022b]. A similar analysis for WMT15 EN→FR that can be found in Figure 2(b) results in an optimal training size of $\numprint{300000}$ examples.

While the data filtering process has led to remarkable improvements in performance, it is important to note that the effectiveness of this process is dependent on the selected reranking metric. Using metrics that are not closely aligned with human judgments can result in poorly correlated and misaligned sentences, which can make the training process more unstable. Therefore, it is recommended to use robust QE models, such as COMET-QE. The more recent COMETKIWI [Rei et al. (2022b] model may offer even greater performance improvements.

4.3 Impact of Quality-aware Data Filtering

Training Data Lexical Metrics Neural Metrics SL Data RL Data BLEU ChrF METEOR COMET COMET-QE BLEURT MLE Original - 35.04 61.30 61.91 84.40 39.50 74.70 Random - 34.43 61.00 61.36 83.90 39.10 74.30 XLM-R - 33.24 60.35 60.20 84.80 41.80 72.60 MUSE - 35.10 61.90 62.20 85.10 40.40 74.30 COMET-QE - 35.45 62.00 62.75 85.50 42.00 75.90 RL w/ BLEU Original Original 34.70 60.90 61.45 85.60 42.20 74.60 Random Random 34.49 61.10 61.49 85.60 42.20 74.40 XLM-R XLM-R 33.21 60.41 60.10 85.10 42.70 73.10 MUSE MUSE 35.34 62.10 62.73 85.60 40.80 74.50 Original COMET-QE 35.37 61.70 62.04 85.40 41.00 74.20 COMET-QE COMET-QE 35.55 62.10 62.77 86.80 45.00 76.10 RL w/ COMET Original Original 35.05 61.30 61.82 85.60 41.80 74.40 Random Random 34.96 61.40 61.80 85.60 41.80 74.20 XLM-R XLM-R 33.60 60.74 60.40 85.00 42.00 72.90 MUSE MUSE 35.18 61.90 62.56 85.50 41.90 74.60 Original COMET-QE 35.58 61.80 62.20 85.70 41.70 74.50 COMET-QE COMET-QE 35.90 62.20 63.06 86.70 44.10 75.70 RL w/ COMET-QE Original Original 34.21 60.50 61.10 85.60 42.40 74.80 Random Random 34.88 61.30 61.69 85.50 41.80 74.10 XLM-R XLM-R 33.57 60.73 60.40 85.10 42.20 73.20 MUSE MUSE 35.03 61.90 62.57 85.70 41.30 74.70 Original COMET-QE 35.48 61.70 62.10 85.70 41.70 74.50 COMET-QE COMET-QE 35.96 62.30 63.07 86.70 44.70 75.90 Training Data Lexical Metrics Neural Metrics SL Data RL Data BLEU ChrF METEOR COMET COMET-QE BLEURT MLE Original - 31.49 57.18 55.80 78.60 5.30 66.20 Random - 31.27 57.07 60.01 80.00 12.80 65.20 XLM-R - 25.04 48.78 48.60 77.40 12.10 57.10 MUSE - 35.49 59.10 60.55 80.10 13.10 67.50 COMET-QE - 35.62 59.90 61.11 80.50 13.50 68.10 RL w/ BLEU Original Original 35.47 59.90 61.03 80.20 16.90 67.10 Random Random 32.75 58.10 60.20 80.03 14.10 66.35 XLM-R XLM-R 25.78 49.69 49.30 77.70 13.30 57.80 MUSE MUSE 35.55 60.10 60.56 81.90 17.10 67.50 Original COMET-QE 35.67 60.10 61.01 81.20 17.10 67.30 COMET-QE COMET-QE 36.26 60.40 61.51 82.10 17.50 67.70 RL w/ COMET Original Original 35.50 59.90 61.00 80.40 16.80 67.00 Random Random 34.15 59.50 60.93 80.50 15.50 67.10 XLM-R XLM-R 25.08 48.84 48.60 77.50 12.40 57.20 MUSE MUSE 36.00 60.10 61.20 80.80 17.00 67.30 Original COMET-QE 35.98 60.00 61.09 81.80 17.10 67.20 COMET-QE COMET-QE 36.62 60.60 61.79 82.20 17.40 67.60 RL w/ COMET-QE Original Original 35.50 60.00 61.10 82.20 17.50 68.00 Random Random 32.10 58.30 60.50 81.00 14.40 66.70 XLM-R XLM-R 24.67 48.38 48.10 77.60 12.60 56.80 MUSE MUSE 35.62 60.45 59.30 82.22 17.45 67.80 Original COMET-QE 35.90 60.10 61.22 82.27 17.53 68.02 COMET-QE COMET-QE 36.25 60.50 61.58 82.40 17.70 68.10

Table 1: Automatic evaluation metrics for the MLE and RL-trained models on the WMT16 EN→DE (top) and WMT15 EN-FR (bottom) original datasets, quality subsets obtained from COMET-QE, XLM-R and MUSE and a randomly selected subset. The training data used for MLE and RL can be found in the SL and RL Data, respectively. We experimented with BLEU, COMET and COMET-QE as reward models for the RL training. The best overall values are bolded and the best for each specific group are underlined.

After obtaining the best configuration for our data filtering process, we experiment with the use of the curated high-quality training subset from COMET-QE and assess its impact on the MLE and RL training performance. We compare our filtering method with no filtering by using the original full training dataset, random filtering and cross-lingual embedding similarity filtering using MUSE [Lample et al. (2017] and XLM-R [Conneau et al. (2019].

Table 1 provides a comprehensive overview of the experimental results using BLEU, COMET and COMET-QE as reward models. Both MT tasks demonstrate the same tendency when trained using MLE. COMET-QE and MUSE high-quality subsets have enough reduced noise to provide more stable training, as evidenced by the overall increase in performance across all metrics compared to the baseline training on the full original dataset. Moreover, a randomly selected subset fine-tuned with MLE performs worse or at most not significantly better than the baseline trained on the original dataset, as expected. Furthermore, in accordance with our expectations [Bane and Zaretskaya (2021], XLM-R filtering does not improve training and is actually the worst-performing model.

Regarding RL-based training on both MT tasks, we observe that most RL-trained models outperform their MLE-trained baseline counterparts across various metrics. Notably, the best-performing models are the ones that were MLE fine-tuned and then RL-trained on the COMET-QE high-quality subset using both COMET and COMET-QE as reward models. On top of that, we can see that in some cases RL training solely does not yield significant improvements, but when combined with high-quality training subsets, it results in substantial enhancements and a competitive edge over the normal, random and XLM-R baselines. Additionally, we see impressive BLEU scores with RL training with COMET(-QE) as reward model. This finding underscores that optimizing for COMET(-QE) yields superior BLEU scores compared to direct optimization for BLEU. This phenomenon is likely attributed to COMET(-QE) providing more effective reward signals during training, thus highlighting the limitations of BLEU.

The excellent performance gains with COMET-QE as a data filter and also as a reward model emphasize the potential of RL-based NMT models trained with a QE reward model (which does not require a corpus with references) to outperform other RL-trained models, offering promising opportunities for unsupervised NMT training with monolingual data, especially for low-resource languages, by eliminating the need for reference translations in evaluation and reward signal generation.

In conclusion, we highlight the importance of thoughtful data selection for achieving better translation quality, showing that COMET-QE can consistently outperform the remaining filtering methods. Furthermore, the top-performing models were RL-trained with neural metrics, showing once again that human-aligned preference models can constantly outperform simpler metrics, such as BLEU.

\hdashline BLEU
Model	WMT16 EN→DE						WMT15 EN→FR
Model	BLEU	METEOR	ChrF	COMET	COMET-QE	BLEURT	BLEU	METEOR	ChrF	COMET	COMET-QE	BLEURT
High-Quality Subset Baseline (HQSB)	35.45	62.00	62.75	85.50	42.00	75.90	35.62	59.90	61.11	80.50	13.50	68.10
\hdashlineHQSB + RL	35.55	62.10	62.77	86.80	45.00	76.10	36.26	60.40	61.51	82.10	17.50	67.70
HQSB + MBR	35.53	62.30	62.80	86.70	44.20	75.90	35.73	60.40	61.42	81.60	15.60	67.20
HQSB + RL + MBR	35.22	61.90	62.62	86.20	43.10	75.50	36.72	60.80	61.89	82.00	16.30	67.20
\hdashline COMET
\hdashlineHQSB + RL	35.90	62.20	63.06	86.70	44.10	75.70	36.62	60.60	61.79	82.20	17.40	67.60
HQSB + MBR	33.58	60.70	61.48	88.00	47.90	76.50	34.89	59.60	60.94	85.00	27.00	69.80
HQSB+ RL + MBR	34.92	61.80	62.84	88.10	47.60	76.90	35.97	60.20	61.45	84.40	24.50	69.20
\hdashline COMET-QE
\hdashline HQSB + RL	35.96	62.30	63.07	86.70	44.70	75.90	36.25	60.50	61.58	82.40	17.70	68.10
HQSB + $N$ -RR	31.46	58.70	60.41	87.10	53.80	75.90	29.99	54.80	56.87	82.80	39.10	66.20
HQSB + RL + $N$ -RR	32.73	59.80	61.32	87.30	53.20	76.30	32.61	57.40	58.96	83.40	36.10	67.60
HQSB + $N$ -RR + MBR w/ COMET	33.73	60.90	61.79	88.10	49.60	76.70	34.34	59.40	60.69	84.80	29.40	69.50
HQSB + RL + MBR w/ COMET	34.61	61.60	62.72	88.20	50.10	77.20	35.47	59.90	61.26	84.90	28.80	70.00
Model	IWSLT2017 EN→DE						IWSLT2017 EN→FR
Model	BLEU	METEOR	ChrF	COMET	COMET-QE	BLEURT	BLEU	METEOR	ChrF	COMET	COMET-QE	BLEURT
Normal Baseline (NB)	32.75	62.40	60.04	84.80	38.30	74.80	41.47	68.40	66.20	84.40	21.70	73.30
\hdashline BLEU
\hdashlineNB + RL	34.48	62.90	60.51	85.20	39.70	74.40	44.58	68.60	66.76	85.20	24.70	72.70
NB + MBR	33.87	62.20	60.05	85.00	38.90	74.50	44.08	68.70	66.52	85.20	24.40	73.20
NB + RL + MBR	34.46	62.50	60.22	85.00	39.00	74.10	44.25	68.30	66.50	85.00	24.20	72.40
\hdashline COMET
\hdashlineNB + RL	34.17	62.20	59.88	85.10	39.30	74.40	44.48	68.70	66.74	85.20	24.60	72.80
NB + MBR	33.33	62.10	59.97	86.70	43.80	75.60	39.04	65.30	63.32	86.80	37.40	75.00
NB + RL + MBR MBR	33.75	61.90	59.72	86.10	41.80	74.90	44.24	68.50	66.62	86.30	28.30	73.60
\hdashline COMET-QE
\hdashlineNB + RL	34.53	62.90	60.49	85.30	40.00	74.70	44.56	68.70	66.87	85.30	24.90	72.90
NB + $N$ -RR	32.31	60.70	59.06	86.40	50.00	75.60	42.48	67.20	65.38	86.60	38.30	74.00
NB + RL + $N$ -RR	32.98	61.50	59.48	86.40	48.70	75.40	43.29	67.50	65.90	86.50	36.00	73.70
NB + $N$ -RR + MBR w/ COMET	33.53	61.90	59.95	86.70	46.00	75.80	39.41	65.40	63.42	87.00	40.00	75.30
NB + RL + MBR w/ COMET	34.18	62.50	60.27	86.60	43.50	75.40	44.07	68.20	66.55	86.70	32.50	74.00

Aligning Neural Machine Translation Models:
Human Feedback in Training and Inference

Abstract

1 Introduction