Sentence-level Aggregation of Lexical Metrics Correlate
Stronger with Human Judgements than Corpus-level Aggregation

Paulo Cavalin IBM Research, [email protected] Pedro H. Domingues PUC-Rio, [email protected] Claudio Pinhanez IBM Research, [email protected]

Abstract

In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

1 Introduction

We can currently group machine translation (MT) metrics into two main groups: lexical and neural metrics [3]. While lexical metrics such as the BLEU score [15] have considerably contributed to the progress in MT in the past 20 years, neural metrics emerged in the past few years as viable alternatives to overcome not only the shortcomings of lexical matches of n-grams, but also to leverage corpus-based training to improve MT evaluation [23, 26, 28].

Although it is quite clear that neural metrics are more robust and will eventually replace the lexical ones [13, 5, 4], in this paper we argue not only that lexical metrics are still quite needed but also that there is room to improve the robustness of such metrics. We say that they are needed because most progress with neural metrics is observed on the one hundred or so most-resourced languages in the world, but almost 7,000 languages in the world still lack the minimum amount of data to train a MT model [12]. It is thus unrealistic to think that a neural metric will be applied for such scenarios in the near future. With this perspective in mind, we argue that there is a very under-explored and important component of lexical metrics, which is the aggregation method.

Lexical metrics such as BLEU and chrF usually rely on corpus-level aggregation (CLA) [15, 17], but one can easily rely on segment-level aggregation (SLA). The main difference between CLA and SLA is that, while in the former we compute n-gram matching statistics for all samples in a first step and then we compute a global score for the entire test set, with the latter we compute the statistics and the score for each sample individually and then use the mean of these scores for evaluating a test set. We notice that SLA is a very under-explored method for lexical metrics (see Appendix A for detailed statistics) and, for the best of our knowledge, there is no previous work demonstrating why corpus-level should be preferred besides theoretical assumptions.

At a glance, it surely looks that, in the worst-case-scenario, both CLA and SLA present comparable results, so there would be no good reason to question the aggregation method choice. But in this paper we show that, counter-intuitively against the common belief that CLA is better and should be the aggregation method of choice for lexical metrics, SLA is far more correlated to human judgements and to more robust neural metrics. To support this claim, we first show that there is a conceptual difference between these two aggregation methods, which we can mathematically relate to the average of ratios vs ratios of averages problem. And based on empirical experiments, we show not only that the choice between the aggregation method significantly impacts the resulting system-level scores but also that CLA is not statistically robust.

In greater details, in a first set of experiments we investigated whether the aforementioned mathematical differences between CLA and SLA statistically impact the scores provided by the metrics. For this, we considered 492 system outputs from the WMT23 metrics shared task [3], and computed system-level scores considering not only these two aggregation methods with both BLEU and chrF as base metrics but also scores computing the mean of bootstrap-resampled scores (BRS), as a reference point of a more statistically-robust approach. For this evaluation, we computed the pairwise Pearson correlation of the scores of each pair of metric and the results clearly show not only that the scores from CLA and SLA differ considerably but also that the scores from SLA correlate quite stronger with those from BRS.

We then conducted a deeper investigation of the statistical robustness of the two aggregation methods, where we focused on evaluating the impact of the size of the test set on the correlation of scores. For that, we relied on downsampled test sets and computed the correlations of scores from downsampled versions of both CLA and SLA against each other, and against the statistically-robust BRS. The results corroborated our previous findings that SLA is not only more statistically robust than CLA but also show that this method is as statistically robust as BRS and can replace it as a much computationally-cheaper alternative.

However, the most surprising and relevant result, in our opinion, is that CLA is not only statistically weaker than SLA but it actually lacks any statistical robustness. In other words, when compared to BRS, the correlations of CLA scores computed on larger test sets are quite close to those computed on a test set with only a single sample. That means that the corpus-level evaluation could be simply replaced by single-sample evaluations.

Finally, in order to materialize what actually means the previous mathematical and statistical differences between the aggregation methods, in terms of the impact on the resulting quality of system-level scores, we computed correlations between the metrics and human judgements. For this, we considered human annotations from the WMT23 Metrics Shared Task, and we also included additional neural metrics, i.e. COMET, BLEURT, and BERTScore, to provide a better view on the impact of the aggregation method. Our results provide strong evidence that SLA correlates much stronger with human judgements, and are much more comparable to the outcomes of BERTScore. Considering that BERTScore is the only neural metric among these three which does not takes into account the input sentence to compute the score, which is compatible to the way lexical metrics work, we believe that our results show that the use of segment-level aggregation reduces considerably the gap between lexical and neural metrics.

2 Related work

The most well-known lexical metric for machine translation evaluation is the BLEU score, introduced more than two decades ago as a solution to make the development of MT system more scalable [16]. The idea was to take advantage of a set of translations created by humans and to somehow measure the discrepancy between the outputs generated by a MT system and the reference translations. With that approach, one could develop different systems and select the one which produced the highest BLEU score in a completely automated fashion.

The way BLEU works is based on computing a Precision-like metric on overlaps of n-grams between the MT outputs and the references. That is done by counting up the number of n-grams generated in the MT outputs that also appear in the references. This computation is then heuristically refined to address some issues such as wrongly-generated repetitions and very short texts, and to combine different n-gram levels. BLEU also inspired other popular lexical metrics such as chrF [18] and chrF++[19], which play important roles to expand current NLP efforts into low-resource languages.

Despite the wide adoption of BLEU for about two decades, several works have focused on exploiting and overcoming its limitations [7, 13, 4, 3]. One issue already tackled by the community is the reliance in a large set of parameters and lack of standard and transparency in reporting results [20]. But another limitation, which is the reliance on lexical matches, resulted in the proposal of different alternative metrics, notably neural metrics such as COMET [24], BLEURT [26], and UniTE [27].

As outlined in the WMT22 shared task results [4], across diverse domains and tasks, neural-based metrics like MetricX XXL [9], COMET-22 [22], UniTE [27] and BLEURT-20 [25] consistently outperformed BLEU and other non-neural counterparts in capturing evaluation nuances. In the subsequent WMT23 shared task [3], the evaluation framework has been enhanced, expanding the metrics set and relying on a global score calculated through a weighted average across tasks. The results underscored the better alignment of neural-based metrics with human judgments than with non-neural ones.

Nevertheless, it is worth highlighting that neural metrics come with additional cost. Some metrics such as UniTE and COMET compute scores by relying also on the input provided to generate the translation, which obviously limits the application to cases where both the source and the target languages were used to train the underlying model. Even the neural metrics that consider only the MT outputs and the reference to compute the scores are quite limited, since they are usually trained with just dozens of languages. That limits neural metrics to, at best, hundreds of languages, the high- and mid-resourced ones.

Considering the currently-limited application of neural metrics and the vast number of under-studied languages in the world, there is still a vast application field for metrics based on overlaps of n-grams such as BLEU and chrF. At the same time, we observe here that there is a gap on a better understanding on the shortcomings of corpus-level aggregation, considering that most of the recent metrics rely on averages of segment-level scores.

As we show in Appendix A, from 345 papers we inspected, only one relied on segment-level aggregation for the BLEU metric [2]. And we can find the use of segment-level aggregation for chrF in very few works, such as in the reports for the WMT Metrics shared task [5, 4, 3]. What seems to be missing is a work showing the impact of the choice of the aggregation method. This paper aims to bridge this gap.

3 Corpus- vs Segment-level Aggregation

In this section we focus on describing in details both corpus-level aggregation (CLA) and segment-level aggregation (SLA) methods, and on explaing why the choice of one over another is mathemathically different and might affect the resulting scores. For the sake of simplicity, we will focus on the BLEU score, but in our empirical evaluations presented afterwards we demonstrate that our hyphoteses are not limited to BLEU and are at least also applicable to chrF.

3.1 A case study with BLEU

With a case study with BLEU, in this section we discuss why we expect differences in the results according to the the aggregation method. For that we will rely on a simplified abstraction of BLEU, considering that this metric consists of computing a Precision-like score for the generated translation or for a set of generated translations [16].

Before explaining the aggregation method, it is worth explaining how BLEU is computed for a single sentence, i.e. the so-called sentence-level BLEU score. For this case, we basically compute the total of n-gram matches between the candidate sentence, i.e. the sentence generated by the ML model, and the references, representing the ground-truth translations generated by a person that is fluent enough in the target language. The matching is computed by evaluating the number of n-grams present in the candidate sentence that also appear in the references. After we computed that number of matches, we divide this number by the total of n-grams contained in the candidate output, to compute a Precision-ish score that represents the sentence-level evaluation score. Of course, since this is a simplified abstraction, we are not taking into account modified n-gram precision, clipping, combination of different n-gram levels, and brevity penalty, but this does not affect our rationale.

When we expand sentence-level BLEU to evaluate the MT system on a corpus of text or on an entire test set, for instance, the main approach is to rely on the so-called corpus-level BLEU (check Section 2.2.1 in [16] for further details). The corpus-level aggregation (CLA) of BLEU consists of a global computation of n-gram matches and a single scoring for all samples in the test set, done at once. That is, all n-gram matches are counted and summed up, and this number is divided by the sum of the lengths of all candidate sentences.

Although corpus-level BLEU is the usual choice and the default options in tools such as SacreBLEU¹¹1https://github.com/mjpost/sacrebleu, it is quite easy to adopt averages of segment-level scores, or simply segment-level aggregation (SLA), to compute system-level scores [1, 14]. The implementation is straightforward, where one simply need to compute sentence-level BLEU scores for each individual test sample, than using the mean of such scores as the final system-level scores. This aggregation method presents some advantages, such as allowing to compute statistical metrics such as standard deviations, which is not possible with corpus-level aggregation. Notice that bootstrap resampling is a somewhat popular method to compute statistical significance tests on corpus-level scores [10, 8, 6] and can also be used to compute statistical metrics, but it is a more expensive approach in terms of computation requirements.

At a first sight, it is reasonable to believe that CLA and SLA provide the same results, so the main advantage of the latter would only be the less expensive way to compute statistical metrics. But there is a conceptual mathematical difference between these two aggregation methods, which we explain next.

3.2 CLA and SLA as differently weighted ratio averages

We dive now into a mathematical explanation of the difference between corpus-level and segment-level aggregations, to demonstrate why these two methods may present differences in the results. We argue that the main difference between the two aggregation methods can be seen as a classical case of ratio of averages vs average of ratios.

To understand this, let us adopt a simplified definition of BLEU as a ratio of the number of matches $m$ by n-grams $w$ in a corpus, as in the previous section. And let us also refer to the corpus-level and segment-level BLEU as simply BLEU and m-BLEU, respectively. Considering all the $n$ sentences $i$ of the corpus, it is evident that BLEU can be computed by the ratio between the sum of all partial matches $m_{i}$ in each sentence $i$ by the sum of n-grams in all sentences $i$ , $w_{i}$ :

\mbox{BLEU}=\frac{m}{w}=\frac{\sum_{i=1}^{n}m_{i}}{\sum_{i=1}^{n}w_{i}}

(1)

Accordingly, m-BLEU is the average of ratios between the number of matches $m_{i}$ and the number of words $w_{i}$ of each sentence $i$ :

\mbox{m-BLEU}=\frac{1}{n}\sum_{i=1}^{n}\frac{m_{i}}{w_{i}}=\sum_{i=1}^{n}\left% (\frac{1}{n}\right)\frac{m_{i}}{w_{i}}

(2)

It is easy to derive that BLEU is the weighted average of the sentence ratios by the proportional length of each sentence $i$ :

	BLEU	$\displaystyle=\frac{\sum_{i=1}^{n}m_{i}}{\sum_{i=1}^{n}w_{i}}=\sum_{i=1}^{n}% \frac{m_{i}}{w}=\sum\limits_{i=1}^{n}\frac{w_{i}}{w_{i}}\frac{m_{i}}{w}$		(3)
		$\displaystyle\mbox{BLEU}=\sum\limits_{i=1}^{n}\left(\frac{w_{i}}{w}\right)% \frac{m_{i}}{w_{i}}$		(3)

As we see, m-BLEU weights the ratios equally with $1/n$ weights while BLEU weights the ratios with the value $w_{i}/w$ which is proportional to the length of each sentence $i$ . Therefore, BLEU results are likely to be biased by the proportion of matches and candidate sentences lengths, while m-BLEU considers the performance independently of that.

4 Empirical evaluation

In this section we present experiments aiming at investigating whether the choice of aggregation method actually impacts the score provided by a lexical metric. For that, we consider three different implementations of BLEU and chrF and, considering the outputs from 492 different systems, we present a detailed analysis on the distribution of scores provided by these metrics.

4.1 The dataset

For this investigation we rely on the WMT 2023 Metrics Shared Task dataset²²2https://wmt-metrics-task.github.io/, or simply WMT23 dataset, comprising the results of 492 different MT systems, involving different languages and domains. This dataset contains 468,850 system outputs with the corresponding inputs and references, in a quite diverse setting, containing 147 different language pairs, being 48 different source languages and 44 target languages.

We converted the 468,850 raw entries to 492 system evaluations by grouping the data by dataset type (challengesets2023 or generaltest2023), dataset (challenge_ACES, challenge_DFKI, challenge_NRC-MSLC23, or generaltest2023), language pair (147 options), and system (two systems for the challegeset2023 dataset type and 14 systems for the generaltest2023). Those are all inner structures of the dataset and the aggregated data will be made publicly available³³3anonimous link to dataframe.

4.2 Three implementations of BLEU

We considered three different implementations for BLEU. Two of them are based on the corpus- (CLA) and segment-level aggregation (SLA), and the third relies on bootstrap-resampled scores (BRS) for providing more robust statistical estimates, so that we can use this approach as a reference point for statistically-reliable scores.

As a consequence, the first implementation is referred to as simply BLEU, consisting of the traditional BLEU score with CLA, here computed with the SacreBLEU tool with default parameters [21]⁴⁴4nrefs:1—case:mixed—eff:no—tok:13a—smooth:exp—version:2.0.0. For this metric, given a test set with samples and the corresponding reference MT outputs, we take all MT outputs and references at once, in a single list, and compute the score with the corpus_bleu function in Python code.

The second metric, referred to as m-BLEU, implements SLA. For this we simply compute the score of each segment (i.e. a test sample) with SacreBLEU’s sentence_bleu function, again with default parameters, and calculate the overall average to provide the system-level score.

The third metric, named x-BLEU, consists of implementing the BRS method, as in [14, 11]. This is an alternative to computing system-level scores with higher statistical robustness, where the resamplings represent varied rearrangements of the test set to compute corpus-level BLEU. We rely on 1,000 resamplings, with replacement, of 1,000 samples⁵⁵5Notice that 1,000 samples is roughly very close the mean of samples in the WMT23 dataset, which is 952.94. for each system, and we applied the same SacreBLEU’s corpus_bleu function, with default parameters, on top of each resampled set, generating 1,000 scores for each system. We then provide the average of those 1,000 scores as the system-level score.

4.3 Three implementations of chrF

In order to investigate whether our observation also generalize to other lexical metrics, we explored also three different implementations of chrF. Notice that chrF is quite similar to BLEU, where overlaps of characters n-grams are computed instead of overlap of words, and the final score is based on an F1 score instead of the precision-ish score used by BLEU. More importantly, the same variety of aggregation methods are possible to be used with chrF, and those varieties can be put in practice with the corpus_chrf and a sentence_chrf functions from SacreBLEU.

Consequently, the three different chrF implemenations that we consider in this work are: chrF, the CLA version computed with corpus_chrf SacreBLEU’s function; m-chrF, the SLA implementation relying on the average of segment-level scores computed with sentence_chrf; and x-chrF, the BRS metric computed with averages of corpus_chrf on 1,000 resamplings with 1,000 samples. Notice that we always rely on the default SacreBLEU’s parameters for such functions.

4.4 CLA and SLA correlate weakly to each other, and SLA correlates strongly to BRS

Our first evaluation focused on investigating the distribution of scores provided by the three different aggregation methods, how these distributions correlate to each other, and how they correlate to more statistically-robust scores. We focus first on analysing BLEU, and then show results with chrF to understand the impact of the base metric.

Refer to caption — Figure 1: Correlation plots of the different BLEU metrics to each other and, in the diagonal, the distribution of the scores of the 3 metrics.

Figure 1, displaying a grid of 9 plots, presents the main results of this analysis. Notice that in the diagonal we show the histogram of the distribution of scores for each metric, considering the scores for the 492 systems, and in the upper and lower non-diagonal cells we present the pairwise scatter plots and Pearson correlations, in the -1 to 1 range, considering the metrics scores. The main intuition of computing Pearson correlation is to assess the linear correlation between the metrics. That is, a highly-positive correlation, i.e. a value close to 1.0, indicates that high scores in one metric correspond to high scores in another metric, and vice-versa. This correlation helps determine whether the metrics are aligned in capturing what are the good and bad translation results and what is in-between.

From both the distributions and correlation values, we can clearly see that x-BLEU and m-BLEU tend to produce much closer results to one-another than to BLEU. We can see that the distribution of BLEU is right-skewed, while the other distributions are more centralized. In details, the means of the distributions are 0.29, 0.43, and 0.39, for BLEU, x-BLEU, and m-BLEU; and the correlations of BLEU to x-BLEU and m-BLEU are of 0.53 and 0.48, respectively, and x-BLEU and m-BLEU present 0.95 of Pearson correlation.

The results of a repetition of the experiments with chrF as the base metric is presented in Figure 2, and we observe quite similar outcomes. That is, chrF correlates weakly with the other metrics and m-chrF correlates strongly to x-chrF. However, we observe that, unlike the BLEU metrics, the distributions of chrF, x-chrF, and m-chrF can be seen as centralized, presenting means of 0,54, 0.57, and 0.58, respectively. In terms of Pearson correlation, x-chrF and m-chrF present almost perfect correlations to each other with 0.99, while chrF correlates weakly to the other methods, with correlations of 0.56 and 0.55.

What it is very noticeable from these experiments, is that SLA presents a quite higher correlation to BRS than CLA does, which is strong evidence that SLA is not only as capable as BRS to compute statistically-robust scores but also a more-cost-effective option. Moreover, it also indicates that CLA is a quite less statistically-robust method.

4.5 Aggregation methods present strong cross-metric correlation

We now focus on investigating the correlations between the different aggregation methods of the different BLEU and chrF implementations. The experiments are analogous to those presented in the previous sections, so in Figure 3 we present the scatter plot and the Pearson correlation of each pair of similar implementations, i.e. BLEU vs chrF, x-BLEU vs x-chrF, and m-BLEU vs m-chrF.

Interestingly, these results indicate that similar aggregation methods correlate strong with each other. Although the correlation of BLEU to chrF is the weakest, with 0.77, it is above the 0.50 to 0.55 correlation values that such metrics presented to the non corpus-level ones. And the non corpus-level metrics correlate quite strong to each other, since x-BLEU vs x-chrF present a correlation of 0.88, and m-BLEU vs m-chrF of 0.93. These results indicate that, as we mentioned in Section 3.2, corpus-level aggregation might biased by the ratio between n-gram matches and sentence lengths, and that can explain why BLEU and chrF correlate strong to one another, and weakly to the other methods.

In order to gather additional evidence, in Figure 4 we present a cross-metric comparison betweem m-BLEU and m-chrF, against BLEU, chrF, x-BLEU, and x-chrF, respectively. Again, a weak correlation of BLEU and chrF to the non corpus-level metrics is seen, given that BLEU correlates weakly to x-chrF and m-chrF, and vice-versa. Moreover, m-BLEU and x-BLEU also correlate strongly to x-chrF and m-chrF, respectively, indicating that the corpus-level aggregation introduces a bias that can hinder the statistical robustness of lexical metrics.

4.6 Characterizing the statistical robustness of CLA and SLA

Given that one main outcome from the previous section is that corpus-level aggregation (CLA) metrics correlate weakly to their more statistically robust counterparts such as bootstrap-resampled scores (BRS), and that the segment-level aggregated (SLA) metrics correlate strongly to BRS, in this section we focus at investigating the statistical robustness of the aggregation methods.

Our methodology consists of using BRS as an upper bound for statistical robustness for system-level scores, since they rely on the well-known bootstrap resampling method, and on not only evaluating the correlation between the distributions of BRS-based scores against downsampled versions of both CLA- and SLA-based scores, but also the correlation of these downsampled metrics to each other. With this approach we believe we can understand the sensitiveness of the aggregation methods to the number of samples and how their statistical robustness is affected by the number of samples.

In greater detail, we created downsampled test sets for each of 492 datasets, considering three different sizes, i.e. $N=\{1,10,100\}$ , and computed scores with both CLA- and SLA-based metrics on each of these downsampled test sets. Then, we computed the Pearson correlation between the scores of these downsampled version of each aggregation method and between these scores and the more statistically-robust ones computed with BRS. This process were repeated 1,000 times for a better statistical estimate. The distribution of correlations, for both BLEU- and chrF-related evaluations, are presented in Figure 5.

The results show that CLA-based metrics are statistically weak. That is, as we can see in both top and bottom left-most plots, the downsampled CLA- and SLA-based metrics correlate strongly only in the case of datasets with only one sample, i.e. $N=1$ , with correlation values approaching 1.0. Nevertheless, as the number of samples increases, their correlation reduces quite drastically, to about 0.7 with $N=10$ and about 0.6 with $N=100$ . That means that, statistically speaking, we can claim that those metrics differ considerably to each other.

We can go further and dare to claim that an evaluation of an entire test set with CLA is similar to conducting the evaluation of the same system with just a single sample. In the central plots of Figure 5, we can see that the correlation of their downsampled scores to the more statistically-robust BRS-based methods does not increase, or increases very insignificantly, with the number of samples. On the other hand, as we can observe in the right-most plots of the same figure, the correlation of donwsampled SLA-based metrics correlate weakly with the bootstrapped scores with $N=1$ , but the correlation increases significantly with $N=10$ and $N=100$ , showing that this aggregation method does not suffer from the same lack-of-statistical-robustness problem.

5 Impact of the aggregation method

In order to materialize the actual impact of the aggregation method in computing system-level scores, in this section we present an evaluation comparing the scores of the metrics described in Section 4 compared to ground-truth data from human judgements. For that we rely on three language pairs with Multidimensional Quality Measurements (MQM), used to benchmark metrics for the WMT23 Metrics Shared Task [3]. The human scores are computed with the weighted average of the multiple dimensions. To present a more extensive evaluation, we also consider the eight language pairs annotated with Direct Assessment (DA) scores⁶⁶6We relied on the DA assessments available in the wmt23 folder of the data downloaded with the code in https://github.com/google-research/mt-metrics-eval.

By evaluating MQM and DA data individually, we compute the mean Pearson correlation of the system scores from each language pair to the implementation of BLEU and chrF described in Section 4. Again, we consider plain Pearson correlation values, ranging from -1 to 1. For comparison purposes, we also include three neural metrics to present a better reference point related the impact of the aggregation method compared with these more robust metrics: COMET⁷⁷7Unbabel/wmt22-cometkiwi-da base model ⁸⁸8https://github.com/Unbabel/COMET, BLEURT⁹⁹9BLEURT-20 base model¹⁰¹⁰10https://github.com/google-research/bleurt, and BERTScore¹¹¹¹11bert-base-multilingual-cased base model¹²¹²12https://github.com/Tiiiger/bert_score.

The results are presented in Table 1, where we list the rankings of the metrics according to their correlation to the human-annotated data. The results, interestingly, show that the SLA-based methods, i.e. m-BLEU and m-chrF, not only considerably outperform BLEU and chrF, the CLA-based ones, but also that they provide correlations that are much closer to those of the neural metrics. We can also see that, with chrF and MQM annotations, we can improve from a moderate-to-low correlation of 0.392 to a strong correlation of 0.806, which is just 0.049 correlation points weaker than the correlation of BERTScore and just 0.066 of BLEURT. With the DA annotations, we can observe that m-chrF performs even closer to the neural metrics. Notice that the SLA-based metrics also outperform their BRS counterparts, i.e. x-BLEU and x-chrF, demonstrating again the statistical robustness of the SLA method.

Table 1: Rankings of Person correlations from metric scores to human judgements

	MQM		DA
	metric	corr.	metric	corr.
1	COMET	0.923	COMET	0.926
2	BLEURT	0.872	BLEURT	0.917
3	BERTScore	0.855	BERTScore	0.821
4	m-chrF	0.806	m-chrF	0.802
5	x-chrF	0.804	x-chrF	0.793
6	m-BLEU	0.776	m-BLEU	0.729
7	x-BLEU	0.762	x-BLEU	0.456
8	BLEU	0.425	chrF	0.285
9	chrF	0.392	BLEU	-0.006

6 Final discussion

In this work we presented a comparison of the traditional corpus-level aggregation against the less popular method based on averaging individual segment-level scores, showing that the latter can generate system evaluation scores which correlate stronger to human judgements and to neural metrics. We demonstrate that this difference happens because of a fundamental mathematical difference: CLA metrics are biased towards the performance on long sentences, considerably reducing the capability of lexical metrics to correlate with human judgements when corpus-level aggregation is considered. The main issue we observed is that corpus-level aggregation voids the statistical robustness of a test set-based evaluation, providing scores that are comparable to evaluations with a single sample.

One main outcome of this paper should be regarded as a clear recommendation of MT researchers: stop using corpus-level aggregation. As we show, segment-level aggregation is not only better than corpus-level in terms of correlation to human judgments, but also comparable, if not better, than robust statistical evaluations based on bootstrap resampling. Similarly, MT researchers should use segment-level scores for statistical evaluation instead of the expensive computations for bootstrap resampling, although one could also bootstrap resample segment-level scores to get even more robust estimates.

Finally, we would like to draw the attention for the vast application field that lexical metrics still have since neural metrics cover only about a hundred or so languages. Moreover, it is important to understand that some of the bad reputation which lexical metrics received in the past years might be under-deserved because of the wide, but wrong, adoption of corpus-level aggregation.

7 Limitations

One clear limitation of this work is relying on a single data source, which is the WMT23 Shared Task dataset. Nevertheless, since it is a very recent dataset, considerably large, and coming from a very well-known workshop on machine translation, we believe that this dataset is strong enough to experimentally provide evidence as used in our empirical evaluations.

Another limitation is to rely on a single tool to compute BLEU and chrF, which is SacreBLEU, even though there are other implementations available. Again, since this tool can be viewed as a de facto standard BLEU implementation [21], we believe that the tool is strong enough to experimentally prove our assumptions in the empirical evaluation.

Additionally, we have not thoroughly evaluated the metrics, in the sense of changing parameters of the metrics such as the maximum n-gram lengths and thus forth. We stayed with the default SacreBLEU’s parameters only. But again, given the level of impact the changing the aggregation method presents, as we show, we do not believe that simply changing the base metrics’ parameters would affect significantly the outcomes of this paper.

8 Ethical considerations

We are not aware of any ethical issue that this paper might raise. All of the data used for this research are publicly-available, and the outcomes of this paper are likely to contribute to improving the quality of the research of the whole machine translation community.

References

[1] Emanuele Bugliarello, Sabrina J. Mielke, Antonios Anastasopoulos, Ryan Cotterell, and Naoaki Okazaki. It’s easier to translate out of English than into it: Measuring neural translation difficulty by cross-mutual information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1640–1649, Online, July 2020. Association for Computational Linguistics.
[2] Mingda Chen, Paul-Ambroise Duquenne, Pierre Andrews, Justine Kao, Alexandre Mourachko, Holger Schwenk, and Marta R. Costa-jussà. BLASER: A text-free speech-to-speech translation evaluation metric. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9064–9079, Toronto, Canada, July 2023. Association for Computational Linguistics.
[3] Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore, December 2023. Association for Computational Linguistics.
[4] Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics.
[5] Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online, November 2021. Association for Computational Linguistics.
[6] Dennis Fucci, Marco Gaido, Sara Papi, Mauro Cettolo, Matteo Negri, and Luisa Bentivogli. Integrating language models into direct speech translation: An inference-time solution to control gender inflection. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11505–11517, Singapore, December 2023. Association for Computational Linguistics.
[7] Yvette Graham and Timothy Baldwin. Testing for significance of increased correlation with human judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 172–176, Doha, Qatar, October 2014. Association for Computational Linguistics.
[8] Josef Jon and Ondřej Bojar. Breeding machine translations: Evolutionary approach to survive and thrive in the world of automated evaluation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2191–2212, Toronto, Canada, July 2023. Association for Computational Linguistics.
[9] Juraj Juraska, Mara Finkelstein, Daniel Deutsch, Aditya Siddhant, Mehdi Mirzazadeh, and Markus Freitag. MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 756–767, Singapore, December 2023. Association for Computational Linguistics.
[10] Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain, July 2004. Association for Computational Linguistics.
[11] Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. Scheduled sampling based on decoding steps for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3285–3296, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
[12] Michela Lorandi and Anya Belz. High-quality data-to-text generation for severely under-resourced languages with out-of-the-box large language models. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 1451–1461, St. Julian’s, Malta, March 2024. Association for Computational Linguistics.
[13] Nitika Mathur, Timothy Baldwin, and Trevor Cohn. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online, July 2020. Association for Computational Linguistics.
[14] Xing Niu, Prashant Mathur, Georgiana Dinu, and Yaser Al-Onaizan. Evaluating robustness to input perturbations for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8538–8544, Online, July 2020. Association for Computational Linguistics.
[15] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
[16] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
[17] Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
[18] Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
[19] Maja Popović. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
[20] Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics.
[21] Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics.
[22] Ricardo Rei, Ana C Farinha, José G.C. de Souza, Pedro G. Ramos, André F.T. Martins, Luisa Coheur, and Alon Lavie. Searching for COMETINHO: The little metric that could. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 61–70, Ghent, Belgium, June 2022. European Association for Machine Translation.
[23] Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, November 2020. Association for Computational Linguistics.
[24] Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, November 2020. Association for Computational Linguistics.
[25] Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online, July 2020. Association for Computational Linguistics.
[26] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text generation. In Proceedings of ACL, 2020.
[27] Yu Wan, Dayiheng Liu, Baosong Yang, Haibo Zhang, Boxing Chen, Derek Wong, and Lidia Chao. UniTE: Unified translation evaluation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8117–8127, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[28] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.

Appendix A An comprehensive evaluation of of the use of lexical metrics in recent machine translation papers

Table 2: Metrics used in 345 recent MT papers (a single paper may use more than one metrics).

Metric	Total	%
BLEU	341	98.8
COMET	36	10.4
chrF	30	8.4
METEOR	17	4.9
BERTScore	15	4.3
TER	15	4.3
BLEURT	10	2.9
BLONDE	3	0.8
UniTE	3	0.8
Others	17	4.9

Table 3: Distribution of BLEU implementations in 341 recent MT papers which used BLEU as a metric.

Metric	Total	%
BLEU (inferred)	210	60.1
Unclear	122	36.2
BLEU (explictly)	6	1.7
x-BLEU	3	0.9
m-BLEU	1	0.3

In this section we present an evaluation of the usage of lexical metrics in MT papers. To demonstrate the impact of our results, we analysed MT-related papers published in recent editions of ACL and EMNLP, two of the major venues for MT research, and catalogued the ways the evaluation of MT systems are reported.

We inspected 345 papers which contained evaluations of MT systems published in the past four editions of ACL and EMNLP conferences, i.e. from 2020 to 2023. We selected those papers both by searching for terms such as Translation, Machine Translation, MT, and BLEU in the title, and by a further fine-grained manual inspection. We considered all papers that conducted evaluations of systems for MT-related tasks, an average of 43.12 per conference edition.

The distribution of metrics that appear in those papers is presented in Table 2. All but only 4 out of 345 papers did not report the use of BLEU, meaning that BLEU was employed by 98.8% of papers we inspected. Also, the adoption of neural metrics is still low, with COMET being the metric with the most prominent level of usage, but only in 10.4% of the papers. Notice that all neural metrics combined are reported in less than 1/4 of the papers, i.e. about 80 papers, considering also all the 17 metrics with less than 3 mentions grouped in the Others row.

Given the domination of the BLEU score in these papers, we then conducted a study on understanding what aggregation methods for this specific lexical metric are used and report the results in Table 3. Only 10 papers explicitly mention the aggregation method used for BLEU, where 6 of them explicitly mention that they are using corpus-level aggregation (simply BLEU in the table), 3 stated they are using averages of bootstrapped resamplings (x-BLEU), and only a single paper relies on averages of segment-level BLEU scores (m-BLEU). For the remaining 335 papers, we inferred the aggregation method by looking for references of specific tools they used, such as SacreBLEU [21] and Moses. We observed that 210 papers, 60.1%, contained at least some minimal references to such tools, being 183 to SacreBLEU and 36 to Moses. Considering that both tools implement the corpus-level aggregation as default, we considered that most of these papers relied on corpus-level BLEU, so there is likely 216 (63%) papers which implemented corpus-level aggregation for BLEU.

Considering that it is likely that about 2/3 of these papers relied on corpus-level BLEU scores, one first remarkable conclusion is that the results of those papers should be looked with care considering the low statistical significance of corpus-level aggregation. However, we noticed that there are at lest 40 of those papers (about 20%) that rely on bootstrap resamplings to compute significance tests for system comparisons [10], which can make the statistical evaluation more robust, even though they in the end reported plain corpus-level BLEU scores [8, 6].

Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation