GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

Jianhao Yan^1,2 Pingchuan Yan^3∗ Yulong Chen^4∗
Judy Li⁵ Xianchao Zhu⁵ Yue Zhang

{}^{2,6,\text{{\char 0\relax}}}

¹ Zhejiang University ² School of Engineering, Westlake University

³ University College London ⁴ University of Cambridge ⁵ Lan-Bridge Group

⁶ Institute of Advanced Technology, Westlake Institute for Advanced Study

[email protected]
These authors contributed equally to this work.

Abstract

This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also observe the imbalanced performance across different languages and domains, with GPT-4’s translation capability gradually weakening from resource-rich to resource-poor directions. In addition, we qualitatively study the translation given by GPT-4 and human translators, and find that GPT-4 translator suffers from literal translations, but human translators sometimes overthink the background information. To our knowledge, this study is the first to evaluate LLMs against human translators and analyze the systematic differences between their outputs, providing valuable insights into the current state of LLM-based translation and its potential limitations.

1 Introduction

Recent studies show that LLMs can serve as a strong translation system and a good substitute for NMT models Jiao et al. (2023a); Wang et al. (2023a); Enis and Hopkins (2024); Huang et al. (2023); Wu et al. (2024a); Hendy et al. (2023); Peng et al. (2023). For example, Jiao et al. (2023a) and Wang et al. (2023a) find that GPT-4 can outperform commercial machine translation systems via automatic and human evaluations. Such impressive results have hastened a wide range of applications, including the use of GPT-4 for literary translation Wu et al. (2024b).

Despite their impressive capabilities, the nature of LLM output compared to human translators remains unclear. This raises two critical questions: (1) How do LLMs compare to human experts in translation quality? and (2) Are there fundamental differences in their outputs? These inquiries are particularly relevant in light of recent research demonstrating significant distinctions between LLM-generated and human-generated texts in general Li et al. (2023); Bao et al. (2023). Such findings suggest that even if LLMs produce high-quality translations, their outputs may possess unique characteristics that distinguish them from human-produced translations.

To determine where LLMs fall within the spectrum of human translation proficiency, which ranges from novice translators to seasoned professionals, we study the problem by taking the current representative LLM, i.e., GPT-4, and comparing it against human translators with different expertise. We first conduct a preliminary study comparing human translations against GPT-4 translations, finding that even experts cannot reach a consensus on which translation is better. Given these findings, we take a finer-grained evaluation across different languages and domains, so that translation quality can be better calibrated and systematic differences can be measured. Our evaluation covers three language pairs from resource-rich to resource-poor, i.e., Chinese $\leftrightarrow$ English, Russian $\leftrightarrow$ English, and Chinese $\leftrightarrow$ Hindi, and three domains, i.e., News, Technology, and Biomedical. Given a source sentence, we ask junior, medium, and senior translators and GPT-4 to generate the corresponding translation in the target language. Then given each translation pair, we hire independent expert annotators to label the errors in the target sentence under the MQM schema Freitag et al. (2021). We find that GPT-4 reaches a comparable performance to junior translators in the perspective of total errors made, and lags behind senior ones with a considerable gap.

Our further analyses and qualitative studies show that there are imbalanced performances for different languages and domains. From resource-rich to resource-poor directions, GPT-4’s translation capability gradually weakens. For resource-rich directions like Chinese $\leftrightarrow$ English, GPT-4 performs comparably with junior translators and even close to medium translators, but in Chinese $\leftrightarrow$ Hindi, it even lags behind our baseline system. The weaknesses mentioned above are also general shortcomings of large models and reflect that although large models have achieved universal translation with a focus on one language, translation between low-resource languages remains a relative weakness.

To our knowledge, we are the first to evaluate LLMs against human translators and analyze the systematic differences between LLMs and human translators.

2 Related Work

Benchmarking LLMs

Previous studies have benchmarked LLMs on various NLP tasks. Xu et al. (2020) benchmark several LLMs on Chinese text, evaluating their Chinese ability. Ye et al. (2024) assess LLMs through Question Answering (QA), MMLU Hendrycks et al. (2021), and other metrics. From these tests, LLMs with larger scales are generally proved to be more accurate except for certain tasks. Yuan et al. (2023) demonstrates that LLMs perform well in long-context understanding and are more capable with Out-of-Distribution, which means LLMs have a certain degree of generalization ability.

Further to the MT field, Jiao et al. (2023b) find that GPT-4 performed competitively with other SotA translation products. Wang et al. (2023a) further investigated the capability of GPT-4 in document-level translation, the results show that GPT-4 performs better than commercial translation products and document NMT methods. Compared to them, our work empirically shows that GPT-4 is comparable to junior human translators.

LLMs as Human Experts

Due to the great capacities of GPT-4 over traditional NLP models, researchers have investigated and compared the performance of GPT-4 as human experts in multiple NLP tasks. Zhu et al. (2024) highlight that GPT-4 and GPT-4-turbo show top performance on a Chinese financial language understanding task. Liu et al. (2023b) find the LLMs can be beneficial to biomedical NLP tasks. Goyal et al. (2022) compare GPT models with several summarization models and humans, and find that GPT can generate summaries preferred by humans. In AI for education area, Nguyen and Allan (2024) show GPT-4’s can provide teaching feedback for students. Maloney et al. (2024) find that GPT-4 shows close performance compared with human participants in coordination games. Siu (2023) show that GPT-4 is comparable to humans on technical translation tasks. Bojic et al. (2023) find that GPT-4 can outperform human experts on linguistic pragmatic tasks. In clinical diagnostics, Han et al. (2023) find that GPT-4 can give comparable performance to humans, and GPT-4v (vision version) can even outperform human experts.

Human Evaluation for MT

Graham et al. (2013) first propose Direct Assessment (DA), which uses a continuous score from 0 to 100 to represent the quality of a hypothesis. DA has been adopted in WMT translation tasks for the past few years Farhad et al. (2021); Kocmi et al. (2022, 2023). MQM Lommel et al. (2014), the annotation used in this paper, is another widely used annotation scheme Klubička et al. (2018); Rei et al. (2020a). It requires the annotators to annotate the error span for each hypothesis and is shown to be more accurate and reliable than DA Freitag et al. (2021). Thus, it is utilized in the metrics tasks of 2022 and 2023 WMT challenges Freitag et al. (2022, 2023).

Human Parity

The human parity for machine translation systems is first claimed by Hassan et al. (2018), which describes a comparable performance on the WMT 2017 news translation task from Chinese to English when compared to professional human translations. However, this claim is challenged by the following research, raising concerns about the limited scope of human parity. These limitations include the expertise of human evaluators (Fischer and L"̈aubli, 2020), the origin and quality of source sentences (Toral et al., 2018; Kim et al., 2023), the limited scenario of comparison (Poibeau, 2022) and difficulty of translation (Graham et al., 2020), indicating significant gaps between NMT models and the professional translators. In this work, we evaluate whether the SOTA LLM GPT-4 performs comparable to professional translators and what differs between human translators and LLMs. With the above lessons in mind, we address these limitations by hiring expert annotators, avoiding target-origin source text, manually evaluating source sentences, and covering high-resource to low-resource language pairs and various domains.

3 Preliminary Study

This section presents our preliminary study. We aim to first compare GPT-4 translations with human translations qualitatively, in a coarse manner. Our comparison is simple and direct. We sample human-translated texts and prompt GPT-4 to translate the same source sentence. Then, we ask expert annotators to determine which translation is better.

Particularly, to have a quick overview of the qualities of human translations against GPT-4 translations, we first utilize COMET-QE¹¹1Unbabel/wmt23-cometkiwi-da-xl to score our in-house Chinese to English human-translated documents, and select two documents with the highest score and the lowest score. Note that our in-house translated documents are all translated by professional translators. In this way, we gather 40 pairs of translations from professional translators and GPT-4, respectively. Recent findings Freitag et al. (2021) have demonstrated that crowd-sourced human ratings are less reliable for high-quality MT evaluation. Thus, we hire six expert annotators to compare the two translations and select the better translations they find. We randomly shuffle the GPT-4 and human translations to prevent annotators from identifying GPT-4.

The average win rate of GPT is 15.5/40 (36.25%). It looks like a clear win for human translators, but when delving deeper, we find that the expert annotators have a low ratio of agreement with each other. In Table 1, most annotators only agree with each other at around 60% (the baseline is 50%) of an agreed winner at each source sentence. We further conduct a significance test and only annotator B finds human translation significantly better than GPT’s translation and other annotators have high p-values. Given annotators’ expertise and our task is straightforward, these results indicate that even expert annotators find it difficult to agree on which translation is better, and GPT-generated translations might have different advantages against human-generated ones. These results motivate us to conduct a finer-grained and comprehensive evaluation to reveal the systematic difference between GPT-4 and human translations.

Annotators	A	B	C	D	E	F
A	100.0	57.5	65.0	65.0	62.5	67.5
B	-	100.0	52.5	52.5	50.0	50.0
C	-	-	100.0	65.0	82.5	67.5
D	-	-	-	100.0	57.5	62.5
E	-	-	-	-	100.0	70.0
F	-	-	-	-	-	100.0
p-value	1.000	0.038	0.268	0.081	0.154	0.875

Table 1: Ratio(%) of agreed winner across expert annotators and significance p-value for binomial test. P-value < 0.05 denotes a significant difference between GPT-4 and Human.

Type	Error Name	Explanations
Accuracy	Mistranslation	Translation does not accurately represent the source.
	Addition	Information not present in the source.
	MT Hallucination	Information that has nothing related to source; or gibberish; or repeats
	Omission	Missing content from the source.
	Untranslated	Not translated.
	Wrong Name Entity and Term	Wrong usage of NE and Terminology.
Fluency	Grammar	Problems with grammar of target language.
	Punctuation	Incorrect punctuation (for locale or style)
	Spelling	Incorrect spelling or capitalization.
	Register	Wrong grammatical register (e.g., inappropriately informal pronouns).
	Inconsistent Style	Internal inconsistency ( not related to terminology )
	Unnatural Flow	Translations that are too literal or sound unnatural.
Other	Non-translation	-

Table 2: Error category and explanations. We mainly follow the guidelines from Unbabel, and merge some errors to reduce the efforts for annotators to understand the annotation system. Concrete examples for each error category can be found in the Appendix.

4 Main Experimental Setup

Motivated by the results from our preliminary study, we conduct a comprehensive and fine-grained evaluation, for revealing the systematic difference between humans and GPTs. Specifically, we employed the widely recognized Multidimensional Quality Metrics (MQM) framework Lommel et al. (2014) and compared human translators with varying levels of expertise to GPT-4. Our evaluation spans multiple languages and domains, aiming to furnish broad insights into these comparisons.

4.1 Data Collection

We collect multilingual and multi-domain source sentences. Our multilingual evaluation data contains six language directions, covering high resource to low resource, including English to Chinese, Chinese to English, English to Russian, Russian to English, English to Hindi, and Hindi to English.

For general domain Chinese $\Leftrightarrow$ English and English $\Leftrightarrow$ Russian, we sample source sentences from test sets of WMT2023 and WMT2022, respectively. For Chinese $\Leftrightarrow$ Hindi, we extract source news text from public websites. For multi-domain evaluation data, we evaluate two domains, i.e., biomedical and technology and we evaluate Chinese to English. The source sentences are extracted news texts from public websites. We ensure that all sources are source language origin to avoid the effect of translationese. We manually evaluate all source sentences for these tasks and ensure the source sentences are not too easy or too short. Finally, each task contains 200 sentences, making our evaluation a total of 1600 sentences.

4.2 Human Translators and Machine Translators

We ask different human translators to translate our source sentences into the target language. Translators are of three different levels of expertise, categorized into junior-level, medium-level, and senior-level translators. The level of expertise is ranked by in-house criteria covering the translators’ educational background, translation experience, and practical proficiency. See Appendix A for more details. For a fair comparison, we request the experts not use machine translation or GPTs as assistance. For all directions except Zh-Hi and Hi-Zh, we collect three human translation results from each level of expertise. For Zh-Hi and Hi-Zh, we only have medium-level and senior-level translators due to the scarcity of translators.

Except for human translators, we use gpt-4-1106-preview, the current state-of-the-art large language model released by OpenAI and Seamless M4T Communication et al. (2023) as the representative of traditional machine translations to complement our experiments. We directly prompt GPT-4 to obtain the translation, as it is the most common practice for normal users, the easiest to reproduce, and to avoid confusion by various techniques.

4.3 Prompt Search

Previous study Zhao et al. (2021); Liu et al. (2023a) shows that different prompts with LLMs can result in distinctive performance. Thus, we collect three candidate prompts used in previous research Xu et al. (2023); Jiao et al. (2023a) and use COMET-QE Rei et al. (2020b) to select the best prompt to make the best use of GPT-4, as shown in Table 3. In particular, we use these three prompts to prompt GPT-4 to translate 100 source sentences in our Chinese-to-English test set and adopt COMET-QE to evaluate the quality of translations. We find that the third prompt yields the best performance, and hence we adopt this prompt for all following experiments.

Prompt	COMET
Please translate the following sentence from Chinese into English. Your language and style should align with the language conventions of a native speaker. \n{SOURCE}\n	0.775
You are an expert translator for translating Chinese to English. Your language and style should align with the language conventions of a native speaker. \n[Chinese]: {SOURCE}\n[English]:	0.755
Please provide the English translation for these sentences. Your language and style should align with the language conventions of a native speaker. \n{SOURCE}\n	0.780

Table 3: Taking Chinese to English as an example, our three prompts and corresponding scores with COMET-QE. {SOURCE} represents the source sentence to be translated.

4.4 Annotation Protocol

To evaluate the results of candidates’ systems, we hire experts to annotate the errors of translations blindly. The annotation platform is Doccano Nakayama et al. (2018), and the error tags are made according to MQM standards. MQM requires the annotators to annotate the span of errors in each hypothesis. All hypotheses of the same source sentence are shown to the annotator together to help decide which is better. We have 13 error categories and two severities, as shown in Table 2. Our categorization for errors mostly follows Unbabel’s practice ²²2https://help.unbabel.com/hc/en-us/articles/6444304419479-Annotation-Guidelines-Typology-3-0 and we focus on most common error types. Each tag has subtags with two severities, i.e., Minor or Major. A screenshot of the annotation system is given in Figure 5.

For each task, we first ask the two expert annotators to carefully read our manual and conduct a training round on the first 10 groups of translations. Then, we manually check these annotations to provide feedback and ask the two annotators to check their disagreements and revise their results. After two rounds of such training processes, we ask the annotators to finish the remaining sentences without knowing each other’s results.

After the first round of annotation, we conduct a second round to further refine the evaluation results. In particular, we hire another two experts for each task and show them the previous annotation results. They are asked to approve and make necessary modifications to previous round annotations.

Reference, Re-Annotated by Freitag et al. (2021)
Task	Cohen Kappa(Segment)	Krippendorffs(Span)
WMT 2020 En-De	0.208	0.456
WMT 2021 En-De	0.230	0.501
Ours
General Zh-En	0.257	0.436
General En-Zh	0.544	0.579
General En-Ru	0.461	0.566
General Ru-En	0.341	0.875
General Zh-Hi	0.256	0.443
General Hi-Zh	0.234	0.495
Technology Zh-En	0.306	0.581
Biomedical Zh-En	0.373	0.616
Average	0.321	0.555

Table 4: Cohen Kappa (segment-level) and Krippendorffs’ Alpha (span-level) agreement of our annotations.

4.5 Inter-Annotator Agreement

Error annotation with MQM is challenging, and previous work demonstrates that the agreement scores between MQM annotations are relatively low Lommel et al. (2014). Reasons for this could be disagreement on precise spans and ambiguous error categorization Lommel et al. (2014). Despite the low agreement scores, MQM is more reliable than other evaluation protocols like Direct Assessment Freitag et al. (2021).

To compute inter-annotator agreement for MQM, we employ segment-level Cohen’s Kappa Cohen (1960) and span-level Krippendorff’s alpha (Krippendorff, 1980). For reference, we calculate the agreement on the annotated results of the 2020 and 2021 WMT English-to-German tasks by Freitag et al. (2021). Our IAA results are shown in Table 4. Thanks to our two-round annotation process, our IAA scores show a favorable agreement, indicating a good annotation quality.

5 Main Results

5.1 Overall Results

Analysis of Error Severity

The upper part of Figure 1 plots the averaged number of errors of different systems and translators. Compared to our MT baseline (seamless), GPT-4 has much fewer errors. It performs almost as well as the junior-level translator at the level of total errors, as GPT-4 is annotated with only slightly more minor and major errors than junior translators. However, GPT-4 still has clear performance gaps between medium or senior human translators, as it makes considerably more mistakes than experienced translators. To our knowledge, we are the first to report how GPT-4 is on translation against human translators.

Refer to caption — Figure 1: *Upper*: Error severity for each system. The gray line represents the standard deviation for each system across tasks. *Bottom*: Error category analysis for each system.

Analysis of Error Categories

Furthermore, we plot the categories of errors in the bottom part of Figure 1. Compared with junior human translators, GPT-4 makes more errors in the accuracy of translations, which accounts for most of the disparity. Interestingly, GPT-4 surpasses junior translators in fluency issues, denoting a better capability of language usage.

In addition, Figure 2 shows the top 5 categories of errors made by different systems. ‘Mistranslation’ is the most frequent error made by all systems. Improving much over the seamless baseline, GPT-4 makes comparable numbers of ‘Mistranslation’ with junior and medium human translators.

For all translators, ‘Unnatural Flow’ is among the most frequent errors. Seamless, GPT-4, and junior translators have similar levels of ‘Unnatural Flow’, indicating possible issues of literal translation and not following language conventions. In contrast, medium and senior translators are annotated with significantly fewer errors of ‘Unnatural Flow’.

In addition, we notice even though GPT-4 makes much fewer ‘Wrong Name Entity(NE)’ errors compared to Seamless, which could be beneficial because of its huge knowledge acquired in the pre-training stage, it still has a gap compared to human translators.

Finally, we notice that GPT-4 does not have Omission or Addition problems in its top-5 errors, whereas even senior translators have Addition errors.

5.2 Detailed Results for Each Language

In Figure 3, we present detailed results for each language pair, averaged over two directions.

English-Chinese

From Figure 3(a), GPT-4 shows the great capability of translating English to Chinese and vice versa. From the radar chart, we can see that GPT-4 makes almost the same or slightly fewer semantic errors (Omission, Addition, and Mistranslation errors) than Junior and Senior translators. Especially mistranslation errors, which are generally considered most semantically detrimental, are better than junior and senior translators. For omission and addition errors, GPT-4 reaches almost the same level as senior translators. However, GPT-4 made significantly more lexical, stylistic, and grammatical errors than human translators do. The error distribution of translation of GPT-4 meets our expectations, as in the absence of reference, GPT-4 will translate unfamiliar words directly and literally instead of seeking online materials or other forms of help like human translators. Furthermore, due to the complexity and variability of Chinese, the translation of entity names or proper nouns is usually not one-to-one, two above reasons together cause the inferiority of the performance of GPT-4 in these aspects.

English-Russian

For the English-Russian translation tasks, GPT-4 made slightly more semantic errors but the number of mistranslation errors made by GPT-4 is almost at the same level as medium and senior translators. However, GPT-4 generally made less stylistic, grammatical, and wrong name entity & term than junior translators. The English-Russian translation tasks are quite challenging and the performance of translators varies significantly, but GPT-4 still maintains the average level overall.

Hindi-Chinese

As the low-resource language pair we evaluate, GPT-4 demonstrates the worst performance across evaluated translators. We observe that GPT4 is inferior to our MT baseline. This may be due to the small portion of Hindi and Chinese corpora in its pre-training dataset. Specifically, making the most ‘Mistranslation’ errors of GPT-4 indicates a distance away from the language understanding of human translators. As a comparison, SeamlessM4T performs better in both semantic and lexical errors.

Discussion

Our results here manifest an imbalance of multilinguality for LLMs Wang et al. (2023b). Our results imply that GPT-4 can serve as a reliable translator for resource-high such as Chinese to English but is doubtful for low-resource directions like Chinese-Hindi. In the low-resource scenario, machine translator is more reliable.

5.3 Detailed Results for Different Domains

Figure 4 presents our results for different domains in Chinese-to-English translation. We compare three different domains, including news, technology, and biomedical.

General News Domain

GPT-4 performs worse in the news domain than human translators of three levels. The number of semantic errors made by GPT-4 is quite close to junior and medium translators. Nonetheless, GPT-4 made more lexical and grammatical errors compared to human translators. We hypothesize the reasons for the situation described above to happen are mainly because of the literariness and timeliness. Because GPT-4 is not able to access the online materials to confirm the name of a specific entity or event.

Technology Domain

The performance of GPT-4 is relatively close to medium-level translators. Except for the Wrong Name Entity & Terms, GPT-4 makes almost the same or even fewer errors than medium-level translators across all aspects. Specifically, the number of semantic errors made by GPT-4 is almost the same to medium-level translators and it makes much fewer structural and grammatical errors. It means that in this field, GPT-4 might understand the original text better than junior or medium-level translators and be able to conduct a translation that is more in line with the original meaning.

Biomedical Domain

Similar to the technology domain, the qualities of the translations made by GPT-4 and medium-level translators stand at the same level. Despite slightly more Wrong Name Entity & Terms errors made, GPT-4 performs better than junior and medium-level translators in other aspects.

Discussion

For specific domains like technology, we show that GPT-4 is comparable with junior/medium translators. We still notice a similar imbalance issue as in the multilingual setting, but GPT-4’s performance is not as sensitive as in the change of language.

Source	{CJK}UTF8gbsn巨人网络有限公司
GPT-4	Giant Network Group Inc.
Human	Giant Interactive Group Inc.

Table 5: Named Entity cases.

5.4 Case Study

We also qualitatively understand the difference between the translations given by GPT-4 and human translators.

Literal Translations

Among the error cases, the typical one is literal translations. Specifically, we find that GPT-4 sometimes translates with semantically correct, but in-native and literal translations. This is problematic with named entities, especially those occurring less frequently. As shown in Table 5, when not knowing the correct translation of {CJK}UTF8gbsn‘巨人网络有限公司’, GPT-4 translates the term word by word. However, the issue of name entities occurs less for human translators, partially because they would google it to find the correct translation. Thus, this issue might be resolved by incorporating web-search into agent-like translation Feng et al. (2024); Wu et al. (2024c).

Except for named entities, we notice that the literal translation causes Unnatural Flows. As shown in Table 6, when translating ‘It’s just a white screen’, GPT-4 translates the phrase to {CJK}UTF8gbsn‘它(it)只是(is just)一个(a)白屏(white screen)’, but human translator translates this phrase to ‘{CJK}UTF8gbsn‘页面显示空白(The page display is white)”, which represents a preciser meaning and follows local conventions.

Source	It’s just a white screen or it times out loading it, or the page becomes unresponsive!
GPT-4	{CJK}UTF8gbsn它只是一个白屏，要么是加载时超时，要么页面变得无响应了！
Human	{CJK}UTF8gbsn页面要么显示空白，要么加载超时或是无响应。

Table 6: Unnatural-Flow cases. Red represents the literal translation and green is more natural and native in Chinese.

Source	He has health concerns atm but we also have Daley entering his 2nd year and is a decent safety net.
GPT-4	{CJK}UTF8gbsn他目前有健康问题，但我们还有戴利进入他的第二年，他是一个不错的安全保障。
Human	{CJK}UTF8gbsn他目前有健康问题。不过，戴利两岁了，是个不错的备选人。

Table 7: Human imagination cases. Red denotes the imagined part.

Human Imagination

We find human translators also have drawbacks compared to the GPT-4 translator. When the source sentence contains insufficient information to translate, human translators tend to fill the gap by imagination or overthinking. An example is given in Table 7. The translator wrongly understands the phrase ‘entering his 2nd year’ as Daley is a two-year-old baby, but the sentence describes a 2nd-year player for sports. This may be due to daily language habits, misunderstanding, or not paying attention, and could be related to the hallucination Zhang et al. (2023) of LLMs. GPT-4’s literal translation helps in this, as it keeps faithful to the source sentence. This also aligns with our findings in Section 5.1 that GPT-4 has fewer Additions or Omissions.

6 Conclusion

In this study, we comprehensively evaluated the translation quality of GPT-4 against human translators of varying expertise levels across multiple language pairs and domains. Our findings showed that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We also notice that GPT-4’s translation capability gradually weakens from resource-rich to resource-poor language pairs. Qualitative analysis revealed that GPT-4 tends to produce more literal translations compared to human translators but suffers less from imagined information.

The results of this study demonstrate that GPT-4 has made significant strides in approaching human-level translation quality, as well as highlighting the nuanced difference between them. This suggests promising opportunities for collaboration and enhancement of translation workflows. As research continues to advance, we anticipate that LLMs will become increasingly valuable tools in the translation industry, working alongside human translators to improve productivity, efficiency, and overall translation quality.

7 Limitations

Our work is limited in the following aspects: (1) We benchmark GPT-4 for translation tasks, as it is a representative large language model and shows state-of-the-art performance for many text-based tasks. However, our evaluations can be extended to other LLMs such as Claude-3. (2) Our evaluation covers three languages and six directions from resource-rich to resource-poor. However, for other languages, there might be linguistic-specific phenomena that are not covered in this paper.

References

Bao et al. (2023) Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2023. Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. arXiv preprint arXiv:2310.05130.
Bojic et al. (2023) Ljubisa Bojic, Predrag Kovacevic, and Milan Cabarkapa. 2023. Gpt-4 surpassing human performance in linguistic pragmatics. arXiv preprint arXiv:2312.09545.
Cohen (1960) Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
Communication et al. (2023) Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, and Skyler Wang. 2023. Seamlessm4t: Massively multilingual & multimodal machine translation.
Enis and Hopkins (2024) Maxim Enis and Mark Hopkins. 2024. From llm to nmt: Advancing low-resource machine translation with claude. arXiv preprint arXiv:2404.13813.
Farhad et al. (2021) Akhbardeh Farhad, Arkhangorodsky Arkady, Biesialska Magdalena, Bojar Ondřej, Chatterjee Rajen, Chaudhary Vishrav, Marta R Costa-jussa, España-Bonet Cristina, Fan Angela, Federmann Christian, et al. 2021. Findings of the 2021 conference on machine translation (wmt21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88. Association for Computational Linguistics.
Feng et al. (2024) Zhaopeng Feng, Yan Zhang, Hao Li, Bei Wu, Jiayu Liao, Wenqiang Liu, Jun Lang, Yang Feng, Jian Wu, and Zuozhu Liu. 2024. Tear: Improving llm-based machine translation with systematic self-refinement.
Fischer and L"̈aubli (2020) Lukas Fischer and Samuel L"̈aubli. 2020. What’s the difference between professional human and machine translation? a blind multi-language study on domain-specific MT. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 215–224, Lisboa, Portugal. European Association for Machine Translation.
Freitag et al. (2021) Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
Freitag et al. (2023) Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. 2023. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore. Association for Computational Linguistics.
Freitag et al. (2022) Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. 2022. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Goyal et al. (2022) Tanya Goyal, Junyi Jessy Li, and Greg Durrett. 2022. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
Graham et al. (2013) Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
Graham et al. (2020) Yvette Graham, Christian Federmann, Maria Eskevich, and Barry Haddow. 2020. Assessing human-parity in machine translation on the segment level. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4199–4207, Online. Association for Computational Linguistics.
Han et al. (2023) Tianyu Han, Lisa C Adams, Keno Bressem, Felix Busch, Luisa Huck, Sven Nebelung, and Daniel Truhn. 2023. Comparative analysis of gpt-4vision, gpt-4 and open source llms in clinical diagnostic accuracy: A benchmark against human expertise. medRxiv, pages 2023–11.
Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding.
Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
Huang et al. (2023) Hui Huang, Shuangzhi Wu, Xinnian Liang, Bing Wang, Yanrui Shi, Peihao Wu, Muyun Yang, and Tiejun Zhao. 2023. Towards making the most of llm for translation quality estimation. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 375–386. Springer.
Jiao et al. (2023a) Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023a. Is chatgpt a good translator? yes with gpt-4 as the engine.
Jiao et al. (2023b) Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023b. Is chatgpt a good translator? yes with gpt-4 as the engine.
Kim et al. (2023) Ahrii Kim, Yunju Bak, Jimin Sun, Sungwon Lyu, and Changmin Lee. 2023. The suboptimal wmt test sets and its impact on human parity. Preprints.
Klubička et al. (2018) Filip Klubička, Antonio Toral, and Víctor M Sánchez-Cartagena. 2018. Quantitative fine-grained human evaluation of machine translation systems: a case study on english to croatian. Machine Translation, 32(3):195–215.
Kocmi et al. (2023) Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, et al. 2023. Findings of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42.
Kocmi et al. (2022) Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, et al. 2022. Findings of the 2022 conference on machine translation (wmt22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45.
Krippendorff (1980) Klaus Krippendorff. 1980. Validity in content analysis. Computerstrategien für die Kommunikationsanalyse, 69:45p.
Li et al. (2023) Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang. 2023. Deepfake text detection in the wild. arXiv preprint arXiv:2305.13242.
Liu et al. (2023a) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023a. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
Liu et al. (2023b) Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, Yi Pan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, et al. 2023b. Evaluating large language models for radiology natural language processing. arXiv preprint arXiv:2307.13693.
Lommel et al. (2014) Arle Lommel, Maja Popovic, and Aljoscha Burchardt. 2014. Assessing inter-annotator agreement for translation error annotation. In MTE: Workshop on Automatic and Manual Metrics for Operational Translation Evaluation, pages 31–37. Language Resources and Evaluation Conference Reykjavik.
Maloney et al. (2024) Laurence T Maloney, Maria F Dal Martello, Vivian Fei, and Valerie Ma. 2024. A comparison of human and gpt-4 use of probabilistic phrases in a coordination game. Scientific reports, 14(1):6835.
Nakayama et al. (2018) Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. 2018. doccano: Text annotation tool for human. Software available from https://github.com/doccano/doccano.
Nguyen and Allan (2024) Ha Nguyen and Vicki Allan. 2024. Using gpt-4 to provide tiered, formative code feedback. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, pages 958–964.
Peng et al. (2023) Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. 2023. Towards making the most of chatgpt for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5622–5633.
Poibeau (2022) Thierry Poibeau. 2022. On" human parity" and" super human performance" in machine translation evaluation. In Language Resource and Evaluation Conference.
Rei et al. (2020a) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020a. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
Rei et al. (2020b) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020b. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702.
Siu (2023) Sai Cheong Siu. 2023. Chatgpt and gpt-4 for professional translators: Exploring the potential of large language models in translation. Available at SSRN 4448091.
Toral et al. (2018) Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. 2018. Attaining the unattainable? reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 113–123, Brussels, Belgium. Association for Computational Linguistics.
Wang et al. (2023a) Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023a. Document-level machine translation with large language models.
Wang et al. (2023b) Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael R Lyu. 2023b. Not all countries celebrate thanksgiving: On the cultural dominance in large language models. CoRR.
Wu et al. (2024a) Minghao Wu, Thuy-Trang Vu, Lizhen Qu, George Foster, and Gholamreza Haffari. 2024a. Adapting large language models for document-level machine translation. arXiv preprint arXiv:2401.06468.
Wu et al. (2024b) Minghao Wu, Yulin Yuan, Gholamreza Haffari, and Longyue Wang. 2024b. (perhaps) beyond human translation: Harnessing multi-agent collaboration for translating ultra-long literary texts. arXiv preprint arXiv:2405.11804.
Wu et al. (2024c) Minghao Wu, Yulin Yuan, Gholamreza Haffari, and Longyue Wang. 2024c. (perhaps) beyond human translation: Harnessing multi-agent collaboration for translating ultra-long literary texts.
Xu et al. (2023) Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2023. A paradigm shift in machine translation: Boosting translation performance of large language models.
Xu et al. (2020) Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al. 2020. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
Ye et al. (2024) Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. 2024. Benchmarking llms via uncertainty quantification.
Yuan et al. (2023) Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and Maosong Sun. 2023. Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations.
Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning, pages 12697–12706. PMLR.
Zhu et al. (2024) Jie Zhu, Junhui Li, Yalong Wen, and Lifan Guo. 2024. Benchmarking large language models on cflue–a chinese financial language understanding evaluation dataset. arXiv preprint arXiv:2405.10542.

Appendix A Expertise of Human Annotators

To categorize translators into junior, medium, or senior levels, we have established a comprehensive set of criteria that take into account various factors indicative of a translator’s expertise and experience. These factors include the translator’s educational background, particularly the prestige of the institution from which they graduated, as well as their length of service in the translation industry, the duration of their translation career, the number of translations completed, and any professional certifications they have obtained. To ensure the ongoing competence of our translators, we conduct quarterly assessments to evaluate their performance. For instance, to be classified as a senior-level translator, an individual must possess a minimum of ten years of translation experience, demonstrate exceptional proficiency by achieving a score of 99% on our assessments, and hold the distinguished CATTI++ translation certification. By considering these stringent criteria, we aim to maintain a highly qualified and skilled pool of translators across all levels of expertise.

Appendix B Annotation Requirements

B.1 Error Types

Our annotation system is built upon the open-sourced doccano system ³³3https://github.com/doccano/doccano. In Figure 5, we provide a screenshot of our annotation system. For each source sentence, outputs for different systems are given and the annotators can select spans of the text and annotate the error type and severity.

Appendix C Detailed Explanation and Guidance for Each Error Types

Our evaluation protocol largely follows the MQM criteria released by Unbabel⁴⁴4. We provide a detailed annotation manual for annotators, including an explanation for each error type as well as illustrative examples for error types. It is included in the following:

C.1 Annotation Requirements

The minimum unit that can be selected and annotated is a whole word, a whitespace, a punctuation mark, or an isolated character. In the following example, the version in French has an extra exclamation mark, so it’s necessary to annotate it as a Punctuation error:

[EN] Thank you very much.

[FR] Merci beaucoup!

{mdframed}

Wrong selection $\rightarrow$ Merci [beaucoup!]PUNCTUATION

Correct selection $\rightarrow$ Merci beaucoup[!]PUNCTUATION

If the issue occurs in a multiword expression, you will need to select the whole expression; if, for example, an entire sentence was translated and it shouldn’t have been, you should select the entire sentence.

In the following example, we have an Unnatural Flow error:

[EN] Hi, Mary here.

[ES] Hola, Mary aquí.

{mdframed}

Wrong selection $\rightarrow$ Hola, [Mary aquí.]UNNATURAL FLOW

Correct selection $\rightarrow$ Hola, [Mary aquí]UNNATURAL FLOW.

C.2 Error Types

Accuracy

•
Mistranslation
- –
  
  Description: Translation does not accurately represent the source.
- –
  
  Example: {mdframed} [EN] It has to be done by the book.
  
  [FR] Il doit être fait [par le livre]MISTRANSLATION
  
  [Reason] The word-for-word translation into French doesn’t work.
•
Addition
- –
  
  Description: Information not present in the source.
- –
  
  Example: {mdframed} [EN] That way you can be sure that you were the one who made the changes.
  
  [ES] Así puedes estar seguro de que fuiste tú quien hizo [todos ADDItIoN los cambios.
  
  [Reason] [Todos] (meaning ’all’ in Spanish) is not present in the source and it is incorrectly added in the target text.
•
MT Hallucination
- –
  
  Description: information that has nothing related to source; or gibberish; or repeats
- –
  
  Example: {mdframed} [EN] You can send us a follow-up email at this address [EMAIL].
  
  [ES] [Hágame saber si tiene alguna otra pregunta]MT HALLUCINATION.]
  
  [Reason]: The Spanish translation reads please let me know if you have any other questions and it’s grammatically correct and fluent, but it has no relation at all with the source.]
•
Omission
- –
  
  Description: Missing content from the source.
- –
  
  Example: {mdframed} [EN] We do not have much information on this.
  
  [FR] Nous ne disposons pas [] OMISSION beaucoup d’informations à ce sujet.
  
  [Reason]: The French sentence requires the preposition [de] (disposer de).
•
Untranslated
- –
  
  Description: Not translated.
- –
  
  Example: {mdframed} [EN] How To Make Pizza Dough
  
  [FR] Comment faire de [Pizza Dough|UNTRANSLATED
  
  [Reason]: [Pizza Dough] is not a named entity and is untranslated in the French version.
•
Wrong Name Entity & Term
- –
  
  Description: Wrong usage of NE and Terminology.
- –
  
  Example: {mdframed} [EN] Dear Wiley,
  
  [IT] Gentile [Wilar WRONG NAMED ENTITY,
  
  [Reason]: The name in the Italian version doesn’t match the original.

Fluency

•
Grammar
- –
  
  Description: Problems with grammar of target language.
- –
  
  Example: {mdframed} [EN] I understand that you want to check in online.
  
  [CS] chàpu, ze se chcete [odbavení]gRAMMaR online.
  
  [Reason]: Wrong part of speech makes the sentence ungrammatical in Czech.
•
Punctuation
- –
  
  Description: incorrect punctuation (for locale or style).
- –
  
  Example: {mdframed} [EN] Original copy of the Proof of Purchase or Invoice (not a screenshot):
  
  [PT] C’opia original do comprovante de compra ou nota fiscal (não uma captura de tela)[.]PUNCTUATION
  
  [Reason]: There’s a period instead of a colon in the Brazilian Portuguese version of this sentence.
•
Spelling
- –
  
  Description: incorrect spelling or capitalization.
- –
  
  Example: {mdframed} [EN] This sort of damage is not covered under the warranty, but we will seek assistance from a higher support and see what we can do regarding this issue.
  
  [IT] Questo tipo di danno non è coperto dalla garanzia, ma chiederò comunque aiuto ai responsabili dell’assistenza per capire che cosa [Zi]SPELLING può fare per quanto riguarda questo problema.
  
  [Reason]: There’s a typo in the sentence in Italian: the word [zi] should be [si] instead.
•
Register
- –
  
  Description: Wrong grammatical register (e.g., inappropriately informal pronouns).
- –
  
  Example: {mdframed} [EN] Wishing you a great day ahead.
  
  [DE] Ich wünsche [Ihnen]REGISTER einen schönen Tag.
  
  [Reason]: The required register for the German translation is Informal but the pronoun [Inhen] is Formal.
•
Inconsistent Style
- –
  
  Description: internal inconsistency (not related to terminology).
- –
  
  Example: {mdframed} [EN] Please click on this link. […] This link will expire in 24 hours.
  
  [NN] Klikk på denne [lenken].[…]Denne [linken]INCONSISTENCY utloper om 24 timer.
  
  [Reason]: Both [lenk] and [link] are correct in Norwegian, but in the same document, only one should be used. Note: this is a single error, not two
•
Unnatural Flow
- –
  
  Description: translations that are too literal or sound unnatural.
- –
  
  Example: {mdframed} [EN] Zebras are ideal for animal matching.
  
  [DE] [Zebras sind ideal, um bestimmte Tiere zu finden]UNNATURAL FLOW.
  
  [Reason] The German translation sounds too literal, it reads like a translation, using the verb [finden] (finding) as a translation for matching. The verb matching should be translated as [detektieren] (detect) to read as if it was originally written in the target language: [Zebras sind ein ideales Beispiel zur Detektion von Wildtieren.]

Other

•

Non-translation

Appendix D Extra Details

D.1 Translation Prompt in Preliminary Study

In two experiments, the translation prompt we use is as follows:

•

Please translate the following sentences from <SRC_LANG> to <TGT_LANG>. Ensure line alignment across the document while maintaining the fluency of overall translation.

The prompt asks GPT4 to maintain the sentence alignment of the given document, so each sentence can be aligned back to its source sentence while being translated at the document level. In practice, we find most times GPT4 can follow our instructions. Occasionally, it fails to keep the sentence structure of the document and merges some sentences in one row. In these cases, we manually split the merged sentences.

D.2 Model and Decoding

For GPT-4, we use greedy search for decoding, to ensure the reproducibility of the results. For SeamlessM4T, we use the 2.3B version of seamlessM4T_v2_large and adopt beam search with beam size 5.