Neural machine translation: Difference between revisions

Content deleted Content added
c/e
 
(29 intermediate revisions by 26 users not shown)
Line 1:
{{short description|Approach to machine translation using artificial neural networks}}
{{Copy edit|date=July 2023}}
'''Neural machine translation''' ('''NMT''') is an approach to [[machine translation]] that uses an [[artificial neural network]] to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model.
 
It is the dominant approach today{{r|Koehn2020|p=293}}{{r|Stahlberg2020|p=1}} and can produce translations that rival human translations when translating between high-resource languages under specific conditions.{{r|Popel2020|}} However, there still remain challenges, especially with languages where less high-quality data is available,{{r|Haddow2022}}{{r|Poibeau2022}}{{r|Koehn2020|p=293}} and with [[Domain adaptation#Domain shift|domain shift]] between the data a system was trained on and the texts it is supposed to translate.{{r|Koehn2020|p=293}} NMT systems also tend to produce fairly literal translations.{{r|Poibeau2022}}
==Properties==
 
They require only a fraction of the memory needed by traditional [[statistical machine translation]] (SMT) models. Furthermore, unlike conventional translation systems, all parts of the neural translation model are trained jointly (end-to-end) to maximize the translation performance.<ref name="KalchbrennerBlunsom" /><ref name="sequence" /><ref name="Properties" />
==Overview==
 
In the translation task, a sentence <math>\mathbf{x} = x_{1,I}</math> (consisting of <math>I</math> tokens <math>x_i</math>) in the source language is to be translated into a sentence <math>\mathbf{y} = x_{1,J}</math> (consisting of <math>J</math> tokens <math>x_j</math>) in the target language. The source and target tokens (which in the simple event are used for each other in order for a particular game ] vectors, so they can be processed mathematically.
 
NMT models assign a probability <math>P(y|x)</math>{{r|Stahlberg2020|p=5}}{{r|Tan2020|p=1}} to potential translations y and then search a subset of potential translations for the one with the highest probability. Most NMT models are ''auto-regressive'': They model the probability of each target token as a function of the source sentence and the previously predicted target tokens. The probability of the whole translation then is the product of the probabilities of the individual predicted tokens:{{r|Stahlberg2020|p=5}}{{r|Tan2020|p=2}}
 
<math display=block>P(y|x) = \prod_{j=1}^{J} P(y_j | y_{1,i-1}, \mathbf{x})</math>
 
NMT models differ in how exactly they model this function <math>P</math>, but most use some variation of the ''encoder-decoder'' architecture:{{r|Tan2020|p=2}}{{r|Goodfellow2013|p=469}} They first use an encoder network to process <math>\mathbf{x}</math> and encode it into a vector or matrix representation of the source sentence. Then they use a decoder network that usually produces one target word at a time, taking into account the source representation and the tokens it previously produced. As soon as the decoder produces a special ''end of sentence'' token, the decoding process is finished. Since the decoder refers to its own previous outputs during, this way of decoding is called ''auto-regressive''.
 
==History==
Deep learning applications appeared first in [[speech recognition]] in the 1990s. The first scientific paper on using neural networks in machine translation appeared in 2014, when Bahdanau et al.<ref group=R>Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: ''Proceedings of the 3rd International Conference on Learning Representations''; 2015 May 7–9; San Diego, USA; 2015.</ref> and Sutskever et al.<ref group=R>Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: ''Proceedings of the 27th International Conference on Neural Information Processing Systems''; 2014 Dec 8–13; Montreal, QC, Canada; 2014.</ref> proposed end-to-end neural network translation models and formally used the term "neural machine translation". The first large-scale NMT system was launched by Baidu in 2015. The following year Google also launched an NMT system, as did others.<ref name=progr>Haifeng Wang, Hua Wu, Zhongjun He, Liang Huang, Kenneth Ward Church [https://reader.elsevier.com/reader/sd/pii/S2095809921002745?token=7DC2A458FA8F2BE948962C5B3805F32396358594C0634CC929527DC477C12DA0F5B9EE7A5E563A63A3744AF94287FFCC&originRegion=eu-west-1&originCreation=20220727183755 Progress in Machine Translation] // Engineering (2021), doi: https://doi.org/10.1016/j.eng.2021.03.023</ref> It was followed by a lot of advances in the following few years. (Large-vocabulary NMT, application to Image captioning, Subword-NMT, Multilingual NMT, Multi-Source NMT, Character-dec NMT, Zero-Resource NMT, Google, Fully Character-NMT, Zero-Shot NMT in 2017). In 2015 there was the first appearance of a NMT system in a public machine translation competition (OpenMT'15). WMT'15 also for the first time had a NMT contender; the following year it already had 90% of NMT systems among its winners.<ref name="WMT16" />
 
===Early approaches===
Since 2017, neural machine translation has been used by the European Patent Office to make information from the global patent system instantly accessible.<ref name="vid"/> The system, developed in collaboration with [[Google]], is paired with 31 languages, and as of 2018, the system has translated over nine million documents.<ref name="vid"/>
In 1987, Robert B. Allen demonstrated the use of [[feedforward neural network|feed-forward neural networks]] for translating auto-generated English sentences with a limited vocabulary of 31 words into Spanish. In this experiment, the size of the network's input and output layers was chosen to be just large enough for the longest sentences in the source and target language, respectively, because the network did not have any mechanism to encode sequences of arbitrary length into a fixed-size representation. In his summary, Allen also already hinted at the possibility of using auto-associative models, one for encoding the source and one for decoding the target.{{r|Allen1987}}
 
Lonnie Chrisman built upon Allen's work in 1991 by training separate [[recursive auto-associative memory]] (RAAM) networks (developed by [[Jordan Pollack|Jordan B. Pollack]]{{r|Pollack1990}}) for the source and the target language. Each of the RAAM networks is trained to encode an arbitrary-length sentence into a fixed-size hidden representation and to decode the original sentence again from that representation. Additionally, the two networks are also trained to share their hidden representation; this way, the source encoder can produce a representation that the target decoder can decode.{{r|Chrisman1991}} Forcada and Ñeco simplified this procedure in 1997 to directly train a source encoder and a target decoder in what they called a ''recursive hetero-associative memory''.{{r|Forcada1997}}
==Workings==
NMT departs from phrase-based [[statistical machine translation|statistical]] approaches that use separately engineered subcomponents.<ref name="Medical" /> Neural machine translation (NMT) is not a drastic step beyond what has been traditionally done in statistical machine translation (SMT). Its main departure is the use of vector representations ("embeddings", "continuous space representations") for words and internal states. The structure of the models is simpler than phrase-based models. There is no separate language model, translation model, and reordering model, but just a single sequence model that predicts one word at a time. However, this sequence prediction is conditioned on the entire source sentence and the entire already produced target sequence.
NMT models use [[deep learning]] and [[representation learning]].
 
Also in 1997, Castaño and Casacuberta employed an [[Elman network|Elman's recurrent neural network]] in another machine translation task with very limited vocabulary and complexity.{{r|Castano1997a}}{{r|Castano1997b}}
The word sequence modeling was at first typically done using a [[recurrent neural network]] (RNN).
A bidirectional recurrent neural network, known as an ''encoder'', is used by the neural network to encode a source sentence for a second RNN, known as a ''decoder'', that is used to predict words in the [[target language (translation)|target language]].<ref name="align&translate" /> Recurrent neural networks face difficulties in encoding long inputs into a single vector. This can be compensated by an attention mechanism<ref name="attention" /> which allows the decoder to focus on different parts of the input while generating each word of the output. There are further Coverage Models addressing the issues in such attention mechanisms, such as ignoring of past alignment information leading to over-translation and under-translation.<ref>{{Cite arXiv|eprint=1601.04811|class=cs.CL|first1=Zhaopeng|last1=Tu|first2=Zhengdong|last2=Lu|title=Modeling Coverage for Neural Machine Translation|last3=Liu|first3=Yang|last4=Liu|first4=Xiaohua|last5=Li|first5=Hang|year=2016}}</ref>
 
Even though these early approaches were already similar to modern NMT, the computing resources of the time were not sufficient to process datasets large enough for the computational complexity of the machine translation problem on real-world texts.{{r|Koehn2020|p=39}}{{r|Yang2020|p=2}} Instead, other methods like [[statistical machine translation]] rose to become the state of the art of the 1990s and 2000s.
Convolutional Neural Networks (Convnets) are in principle somewhat better for long continuous sequences, but were initially not used due to several weaknesses. These were successfully compensated for in 2017 by using "attention mechanisms".<ref name="DeepL" />
 
===Hybrid approaches===
The [[Transformer (machine learning model)|Transformer]]<ref>{{cite arXiv|last1=Vaswani|first1=Ashish|last2=Shazeer|first2=Noam|last3=Parmar|first3=Niki|last4=Uszkoreit|first4=Jakob|last5=Jones|first5=Llion|last6=Gomez|first6=Aidan N.|last7=Kaiser|first7=Lukasz|last8=Polosukhin|first8=Illia|date=2017-12-05|title=Attention Is All You Need|class=cs.CL|eprint=1706.03762}},</ref> an attention-based model, remains the dominant architecture for several language pairs.<ref>{{Cite journal|last1=Barrault|first1=Loïc|last2=Bojar|first2=Ondřej|last3=Costa-jussà|first3=Marta R.|last4=Federmann|first4=Christian|last5=Fishel|first5=Mark|last6=Graham|first6=Yvette|last7=Haddow|first7=Barry|last8=Huck|first8=Matthias|last9=Koehn|first9=Philipp|last10=Malmasi|first10=Shervin|last11=Monz|first11=Christof|date=August 2019|title=Findings of the 2019 Conference on Machine Translation (WMT19)|url=https://www.aclweb.org/anthology/W19-5301|journal=Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)|location=Florence, Italy|publisher=Association for Computational Linguistics|pages=1–61|doi=10.18653/v1/W19-5301|doi-access=free}}</ref> The self-attention layers of the Transformer model learn the dependencies between words in a sequence by examining links between all the words in the paired sequences and by directly modeling those relationships. It's a simpler approach than the gating mechanism that RNNs employ. And its simplicity has enabled researchers to develop high-quality translation models with the Transformer model, even in low-resource settings.<ref name="sicilian">{{cite arXiv
| last = Wdowiak
| first = Eryk
| eprint = 2110.01938
| title = Sicilian Translator: A Recipe for Low-Resource NMT
| class = cs.CL
| date = 2021-09-27
}}</ref>
 
During the time when statistical machine translation was prevalent, some works used neural methods to replace various parts in the statistical machine translation while still using the log-linear approach to tie them together.{{r|Koehn2020|p=39}}{{r|Stahlberg2020|p=1}} For example, in various works together with other researchers, Holger Schwenk replaced the usual [[n-gram language model]] with a [[neural language model|neural one]]{{r|Schwenk2006}}{{r|Schwenk2007}} and estimated phrase translation probabilities using a feed-forward network.{{r|Schwenk2012}}
== Applications ==
One application for NMT is low resource machine translation, when only a small amount of data and examples are available for training. One such use case is ancient languages like [[Akkadian language|Akkadian]] and its dialects, Babylonian and Assyrian.<ref>{{Cite journal |last=Gutherz |first=Gai |last2=Gordin |first2=Shai |last3=Sáenz |first3=Luis |last4=Levy |first4=Omer |last5=Berant |first5=Jonathan |date=2023-05-02 |editor-last=Kearns |editor-first=Michael |title=Translating Akkadian to English with neural machine translation |url=https://academic.oup.com/pnasnexus/article/doi/10.1093/pnasnexus/pgad096/7147349 |journal=PNAS Nexus |language=en |volume=2 |issue=5 |doi=10.1093/pnasnexus/pgad096 |issn=2752-6542 |pmc=10153418 |pmid=37143863}}</ref>
 
===seq2seq===
== Problems with NMT ==
{{Main|seq2seq}}
The most common translation problem found in NMT output is a lack of cohesion between sentences. The same term is often translated with a different word in adjacent sentences, causing the reader to wonder whether the same concept is being mentioned. Other problems include translating very similar terms as the same term (e.g., in computer security terminology, translating all three of German ''Zutrittskontrolle, Zugangskontrolle, Zugriffskontrolle'' as simply ''access control'', although they are distinct types which should be ''physical access control, network access control, data access control'') and translating proper names as common nouns due to the capitalization of nouns in German.
In 2013 and 2014, end-to-end neural machine translation had their breakthrough with Kalchbrenner & Blunsom using a [[convolutional neural network]] (CNN) for encoding the source{{r|KalchbrennerBlunsom2013}} and both Cho et al. and Sutskever et al. using a [[recurrent neural network]] (RNN) instead.{{r|Cho2014EncDec}}{{r|Sutskever2014}} All three used an RNN conditioned on a fixed encoding of the source as their decoder to produce the translation. However, these models performed poorly on longer sentences.{{r|Cho2014Properties|p=107}}{{r|Koehn2020|p=39}}{{r|Stahlberg2020|p=7}} This problem was addressed when Bahdanau et al. introduced [[attention (machine learning)|attention]] to their encoder-decoder architecture: At each decoding step, the state of the decoder is used to calculate a source representation that focuses on different parts of the source and uses that representation in the calculation of the probabilities for the next token.{{r|Bahdanau2015}} Based on these RNN-based architectures, [[Baidu]] launched the "first large-scale NMT system"{{r|Wang2022|p=144}} in 2015, followed by [[Google Neural Machine Translation]] in 2016.{{r|Wang2022|p=144}}{{r|Wu2016}} From that year on, neural models also became the prevailing choice in the main machine translation conference Workshop on Statistical Machine Translation.{{r|WMT2016}}
 
Gehring et al. combined a CNN encoder with an attention mechanism in 2017, which handled long-range dependencies in the source better than previous approaches and also increased translation speed because a CNN encoder is parallelizable, whereas an [[RNN encoder]] has to encode one token at a time due to its recurrent nature.{{r|Gehring2017|p=230}} In the same year, “Microsoft Translator released AI-powered online neural machine translation (NMT).<ref>{{Cite web |last=Translator |first=Microsoft |date=2018-04-18 |title=Microsoft brings AI-powered translation to end users and developers, whether you're online or offline |url=https://www.microsoft.com/en-us/translator/blog/2018/04/18/microsoft-brings-ai-powered-translation-to-end-users-and-developers-whether-youre-online-or-offline/ |access-date=2024-04-19 |website=Microsoft Translator Blog |language=en-US}}</ref> [[DeepL Translator]], which was at the time based on a [[CNN encoder]], was also released in the same year and was judged by several news outlets to outperform its competitors.{{r|DeepLTechCrunch}}{{r|DeepLLeMonde}}{{r|DeepLGolem}} It has also been seen that [[OpenAI]]'s [[GPT-3]] released in 2020 can function as a neural machine translation system. Some other machine translation systems, such as Microsoft translator and SYSTRAN can be also seen to have integrated neural networks into their operations.
==Remarks==
 
<references group=R/>
=== Transformer ===
{{Main|Transformer (deep learning architecture)}}
Another network architecture that lends itself to parallelization is the [[Transformer (machine learning model)|transformer]], which was introduced by Vaswani et al. also in 2017.{{r|Vaswani2017}} Like previous models, the transformer still uses the attention mechanism for weighting encoder output for the decoding steps. However, the transformer's encoder and decoder networks themselves are also based on attention instead of recurrence or convolution: Each layer weights and transforms the previous layer's output in a process called ''self-attention''. Since the attention mechanism does not have any notion of token order, but the order of words in a sentence is obviously relevant, the token embeddings are combined with an [[Transformer_(machine_learning_model)#Positional_encoding|explicit encoding of their position in the sentence]].{{r|Stahlberg2020|p=15}}{{r|Tan2020|p=7}} Since both the transformer's encoder and decoder are free from recurrent elements, they can both be parallelized during training. However, the original transformer's decoder is still auto-regressive, which means that decoding still has to be done one token at a time during inference.
 
The transformer model quickly became the dominant choice for machine translation systems{{r|Stahlberg2020|p=44}} and was still by far the most-used architecture in the Workshop on Statistical Machine Translation in 2022 and 2023.{{r|WMT2022|p=35–40}}{{r|WMT2023|p=28–31}}
 
Usually, NMT models’ weights are initialized randomly and then learned by training on parallel datasets. However, since using [[Large language model|large language models]] (LLMs) such as [[BERT (language model)|BERT]] pre-trained on large amounts of monolingual data as [[Fine-tuning (deep learning)|a starting point for learning other tasks]] has proven very successful in wider [[Natural language processing|NLP]], this paradigm is also becoming more prevalent in NMT. This is especially useful for low-resource languages, where large parallel datasets do not exist.{{r|Haddow2022|p=689–690}} An example of this is the mBART model, which first trains one transformer on a multilingual dataset to recover masked tokens in sentences, and then fine-tunes the resulting [[autoencoder]] on the translation task.{{r|Liu2020}}
 
====Generative LLMs====
 
Instead of fine-tuning a pre-trained language model on the translation task, sufficiently large [[Generative model#Deep generative models|generative models]] can also be directly prompted to translate a sentence into the desired language. This approach was first comprehensively tested and evaluated for [[GPT 3.5]] in 2023 by Hendy et al. They found that "GPT systems can produce highly fluent and competitive translation outputs even in the [[Zero-shot learning|zero-shot]] setting especially for the high-resource language translations".{{r|Hendy2023|p=22}} The WMT23 evaluated the same approach (but using [[GPT-4]]) and found that it was on par with the state of the art when translating into English, but not quite when translating into lower-resource languages.{{r|WMT2023|p=16–17}} This is plausible considering that GPT models are trained mainly on English text.{{r|GPT3LanguagesByCharacterCount2020}}
 
==Comparison with statistical machine translation==
 
NMT has overcome several challenges that were present in statistical machine translation (SMT):
 
* NMT's full reliance on continuous representation of tokens overcame sparsity issues caused by rare words or phrases. Models were able to generalize more effectively.{{r|KalchbrennerBlunsom2013|p=1}}{{r|Russell2020|p=900–901}}
* The limited n-gram length used in SMT's n-gram language models caused a loss of context. NMT systems overcome this by not having a hard cut-off after a fixed number of tokens and by using attention to choosing which tokens to focus on when generating the next token.{{r|Russell2020|p=900–901}}
* End-to-end training of a single model improved translation performance and also simplified the whole process.{{Citation needed|date=December 2023}}
* The huge n-gram models (up to 7-gram) used in SMT required large amounts of memory,{{r|Federico2007|p=88}} whereas NMT requires less.
 
==Training procedure==
 
===Cross-entropy loss===
NMT models are usually trained to maximize the likelihood of observing the training data. I.e., for a dataset of <math>T</math> source sentences <math>X = \mathbf{x}^{(1)}, ..., \mathbf{x}^{(T)}</math> and corresponding target sentences <math>Y = \mathbf{y}^{(1)}, ..., \mathbf{y}^{(T)}</math>, the goal is finding the model parameters <math>\theta^*</math> that maximize the sum of the likelihood of each target sentence in the training data given the corresponding source sentence:
 
<math display=block>\theta^* = \underset{\theta}{\operatorname{arg\,max}} \sum_i^T P_{\theta}(\mathbf{y}^{(i)}|\mathbf{x}^{(i)})</math>
 
Expanding to token level yields:
 
<math display=block>\theta^* = \underset{\theta}{\operatorname{arg\,max}} \sum_i^T \prod_{j=1}^{J^{(i)}} P(y_j^{(i)} | y_{1,j-1}^{(i)}, \mathbf{x}^{(i)})</math>
 
Since we are only interested in the maximum, we can just as well search for the maximum of the logarithm instead (which has the advantage that it avoids [[arithmetic underflow|floating point underflow]] that could happen with the product of low probabilities).
 
<math display=block>\theta^* = \underset{\theta}{\operatorname{arg\,max}} \sum_i^T \log\prod_{j=1}^{J^{(i)}} P(y_j^{(i)} | y_{1,j-1}^{(i)}, \mathbf{x}^{(i)})</math>
 
Using the fact that [[List of logarithmic identities#Logarithm of a product|the logarithm of a product is the sum of the factors’ logarithms]] and flipping the sign yields the classic [[Cross-entropy#Cross-entropy loss function and logistic regression|cross-entropy loss]]:
 
<math display=block>\theta^* = \underset{\theta}{\operatorname{arg\,min}} - \sum_i^T \log\sum_{j=1}^{J^{(i)}} P(y_j^{(i)} | y_{1,j-1}^{(i)}, \mathbf{x}^{(i)})</math>
 
In practice, this minimization is done iteratively on small subsets (mini-batches) of the training set using [[stochastic gradient descent]].
 
===Teacher forcing===
{{Main|Teacher forcing}}
 
During inference, auto-regressive decoders use the token generated in the previous step as the input token. However, the vocabulary of target tokens is usually very large. So, at the beginning of the training phase, untrained models will pick the wrong token almost always; and subsequent steps would then have to work with wrong input tokens, which would slow down training considerably. Instead, ''teacher forcing'' is used during the training phase: The model (the “student” in the teacher forcing metaphor) is always fed the previous ground-truth tokens as input for the next token, regardless of what it predicted in the previous step.
 
==Translation by prompt engineering LLMs==
 
As outlined in the history section above, instead of using an NMT system that is trained on parallel text, one can also prompt a generative LLM to translate a text. These models differ from an encoder-decoder NMT system in a number of ways:{{r|Hendy2023|p=1}}
 
* Generative language models are not trained on the translation task, let alone on a parallel dataset. Instead, they are trained on a language modeling objective, such as predicting the next word in a sequence drawn from a large dataset of text. This dataset can contain documents in many languages, but is in practice dominated by English text.{{r|GPT3LanguagesByCharacterCount2020}} After this pre-training, they are [[Large language model#Training and architecture|fine-tuned on another task]], usually to follow instructions.{{r|Radford2018}}
* Since they are not trained on translation, they also do not feature an encoder-decoder architecture. Instead, they just consist of a transformer's decoder.
* In order to be competitive on the machine translation task, LLMs need to be much larger than other NMT systems. E.g., GPT-3 has 175 billion parameters,{{r|Brown2020|p=5}} while mBART has 680 million{{r|Liu2020|p=727}} and the original transformer-big has “only” 213 million.{{r|Vaswani2017|p=9}} This means that they are computationally more expensive to train and use.
 
A generative LLM can be prompted in a [[Zero-shot learning|zero-shot]] fashion by just asking it to translate a text into another language without giving any further examples in the prompt. Or one can include one or several example translations in the prompt before asking to translate the text in question. This is then called [[Few-shot learning (natural language processing)|one-shot or few-shot learning]], respectively. For example, the following prompts were used by Hendy et al. (2023) for zero-shot and one-shot translation:{{r|Hendy2023}}
 
<pre>### Translate this sentence from [source language] to [target language], Source:
[source sentence]
### Target:
</pre>
 
<pre>Translate this into 1. [target language]:
[shot 1 source]
1. [shot 1 reference]
Translate this into 1. [target language]:
[input]
1.</pre>
 
==Literature==
 
* [[Philipp Koehn|Koehn, Philipp]] (2020). [http://www2.statmt.org/nmt-book/ Neural Machine Translation.] Cambridge University Press.
* Stahlberg, Felix (2020). [https://arxiv.org/abs/1912.02047v2 Neural Machine Translation: A Review and Survey.]
 
==See also==
 
* [[Attention (machine learning)]]
* [[Transformer (machine learning model)]]
* [[Seq2seq]]
 
==References==
{{reflist|refs=
<ref name="Goodfellow2013">{{cite book |last1=Goodfellow |first1=Ian |last2=Bengio |first2=Yoshua |last3=Courville |first3=Aaron |title=Deep Learning |date=2016 |publisher=MIT Press |url=https://www.deeplearningbook.org/ |access-date=2022-12-29 |chapter=12.4.5 Neural Machine Translation |chapter-url=https://www.deeplearningbook.org/contents/applications.html |pages=468–471}}</ref>
<ref name="WMT16">{{cite journal|last1=Bojar|first1=Ondrej|last2=Chatterjee|first2=Rajen|last3=Federmann|first3=Christian|last4=Graham|first4=Yvette|last5=Haddow|first5=Barry|last6=Huck|first6=Matthias|last7=Yepes|first7=Antonio Jimeno|last8=Koehn|first8=Philipp|last9=Logacheva|first9=Varvara|last10=Monz|first10=Christof|last11=Negri|first11=Matteo|last12=Névéol|first12=Aurélie|last13=Neves|first13=Mariana|last14=Popel|first14=Martin|last15=Post|first15=Matt|last16=Rubino|first16=Raphael|last17=Scarton|first17=Carolina|last18=Specia|first18=Lucia|last19=Turchi|first19=Marco|last20=Verspoor|first20=Karin|last21=Zampieri|first21=Marcos|title=Findings of the 2016 Conference on Machine Translation|journal=ACL 2016 First Conference on Machine Translation (WMT16)|date=2016|pages=131–198|url=https://cris.fbk.eu/retrieve/handle/11582/307240/14326/W16-2301.pdf|publisher=The Association for Computational Linguistics|access-date=2018-01-27|archive-url=https://web.archive.org/web/20180127202851/https://cris.fbk.eu/retrieve/handle/11582/307240/14326/W16-2301.pdf|archive-date=2018-01-27|url-status=dead}}</ref>
<ref name="Koehn2020">{{cite book |last=Koehn |first=Philipp |date=2020 |title=Neural Machine Translation |publisher=Cambridge University Press |url=http://www2.statmt.org/nmt-book/}}</ref>
<ref name="Medical">{{cite journal |last1=Wołk |first1=Krzysztof |last2=Marasek |first2=Krzysztof |title=Neural-based Machine Translation for Medical Text Domain. Based on European Medicines Agency Leaflet Texts |year=2015 |journal=Procedia Computer Science |volume=64 |issue=64 |pages=2–9 |doi=10.1016/j.procs.2015.08.456|bibcode=2015arXiv150908644W |arxiv=1509.08644|s2cid=15218663 }}</ref>
<ref name="attentionStahlberg2020">{{Citecite arXiv |last1last=BahdanauStahlberg |first1first=Dzmitry|last2=Cho|first2=Kyunghyun|last3=Bengio|first3=YoshuaFelix |date=20142020-09-0129 |title=Neural Machine Translation: byA JointlyReview Learningand to Align andSurvey Translate|eprint=14091912.047302047v2 |class=cs.CL}}</ref>
<ref name="Tan2020">{{cite arXiv |last1=Tan |first1=Zhixing |last2=Wang |first2=Shuo |last3=Yang |first3=Zonghan |last4=Chen |first4=Gang |last5=Huang |first5=Xuancheng |last6=Sun |first6=Maosong |last7=Liu |first7=Yang |date=2020-12-31 |title=Neural Machine Translation: A Review of Methods, Resources, and Tools |eprint=2012.15515 |class=cs.CL}}</ref>
<ref name="DeepL">{{Cite news|url=https://techcrunch.com/2017/08/29/deepl-schools-other-online-translators-with-clever-machine-learning/|title=DeepL schools other online translators with clever machine learning|last=Coldewey|first=Devin|work=TechCrunch|date=2017-08-29|access-date=2018-01-27}}</ref>
<ref name="KalchbrennerBlunsomYang2020">{{cite journalarXiv |last1=KalchbrennerYang |first1=NalShuoheng |last2=BlunsomWang |first2=PhilipYuxin |titlelast3=RecurrentChu Continuous Translation|first3=Xiaowen Models|journaltitle=ProceedingsA Survey of theDeep AssociationLearning Techniques for ComputationalNeural Machine Translation Linguistics|pagesdate=1700–17092020-02-18 |dateeprint=20132002.07526 |urlclass=http://wwwcs.aclweb.org/anthology/D13-1176CL}}</ref>
<ref name="sequenceRussell2020">{{cite arXivbook |last1=SutskeverRussell |first1=IlyaStuart |last2=VinyalsNorvig |first2=Oriol|last3=Le|first3=QuocPeter Viet|title=SequenceArtificial toIntelligence: sequenceA learningModern withApproach neural|edition=4th, global networks|eprintpublisher=1409.3215Pearson |classurl=http://aima.cs.CL|year=2014berkeley.edu/global-index.html}}</ref>
<ref name="Wang2022">{{cite journal |last1=Wang |first1=Haifeng |last2=Wu |first2=Hua |last3=He |first3=Zhongjun |last4=Huang |first4=Liang |last5=Church |first5=Kenneth Ward |date=2022-11-01 |title=Progress in Machine Translation |url=https://www.sciencedirect.com/science/article/pii/S2095809921002745 |journal=Engineering |language=en |volume=18 |pages=143–153 |doi=10.1016/j.eng.2021.03.023}}</ref>
<ref name="Properties">{{cite arXiv | eprint = 1409.1259|author1=Kyunghyun Cho |author2=Bart van Merrienboer |author3=Dzmitry Bahdanau |author4=Yoshua Bengio | title = On the Properties of Neural Machine Translation: Encoder–Decoder Approaches | date = 3 September 2014 | class = cs.CL}}</ref>
<ref name="Haddow2022">{{cite journal |last1=Haddow |first1=Barry |last2=Bawden |first2=Rachel |last3=Miceli Barone |first3=Antonio Valerio |last4=Helcl |first4=Jindřich |last5=Birch |first5=Alexandra |date=2022 |title=Survey of Low-Resource Machine Translation |url=https://aclanthology.org/2022.cl-3.6 |journal=Computational Linguistics |volume=48 |issue=3 |pages=673–732 |doi=10.1162/coli_a_00446|arxiv=2109.00486 }}</ref>
<ref name="align&translate">{{cite arXiv | eprint = 1409.0473|author1=Dzmitry Bahdanau |author2=Cho Kyunghyun |author3=Yoshua Bengio | title = Neural Machine Translation by Jointly Learning to Align and Translate | year = 2014 | class = cs.CL}}</ref>
 
<ref name="vid">{{cite web|url=https://www.youtube.com/watch?v=-ZVplhqhyYM|title=Neural Machine Translation|date=16 July 2018|access-date=14 June 2021|publisher=European Patent Office}}</ref>
<ref name="Allen1987">{{cite conference |last=Allen |first=Robert B. |date=1987 |title=Several Studies on Natural Language and Back-Propagation |url=https://www.researchgate.net/publication/243614356 |access-date=2022-12-30 |conference=IEEE First International Conference on Neural Networks |location=San Diego |volume=2 |pages=335–341}}</ref>
<ref name="Pollack1990">{{cite journal |last=Chrisman |first=Lonnie |date=1991 |title=Learning Recursive Distributed Representations for Holistic Computation |journal=Connection Science |volume=3 |issue=4 |pages=345–366 |doi=10.1080/09540099108946592 |issn=0954-0091|url=https://figshare.com/articles/journal_contribution/6606899 }}</ref>
<ref name="Chrisman1991">{{cite journal |last=Pollack |first=Jordan B. |date=1990 |title=Recursive distributed representations |url=https://dx.doi.org/10.1016/0004-3702%2890%2990005-K |journal=Artificial Intelligence |volume=46 |issue=1 |pages=77–105|doi=10.1016/0004-3702(90)90005-K }}</ref>
<ref name="Forcada1997">{{cite book |last1=Forcada |first1=Mikel L. |last2=Ñeco |first2=Ramón P. |date=1997 |chapter=Recursive hetero-associative memories for translation |title=Biological and Artificial Computation: From Neuroscience to Technology |series=Lecture Notes in Computer Science |volume=1240 |pages=453–462|doi=10.1007/BFb0032504 |isbn=978-3-540-63047-0 }}</ref>
<ref name="Castano1997a">{{cite conference |last1=Castaño |first1=Asunción |last2=Casacuberta |first2=Francisco |date=1997 |title=A connectionist approach to machine translation |url=https://www.isca-speech.org/archive/eurospeech_1997/castano97_eurospeech.html |conference=5th European Conference on Speech Communication and Technology (Eurospeech 1997) |pages=91–94 |location= Rhodes, Greece |doi=10.21437/Eurospeech.1997-50}}</ref>
<ref name="Castano1997b">{{cite conference |last1=Castaño |first1=Asunción |last2=Casacuberta |first2=Francisco |last3=Vidal |first3=Enrique |date=1997-07-23 |title=Machine translation using neural networks and finite-state models |url=https://aclanthology.org/1997.tmi-1.19 |conference=Proceedings of the 7th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages |location=St John's College, Santa Fe}}</ref>
 
<ref name="Schwenk2006">{{cite conference |last1=Schwenk |first1=Holger |last2=Dechelotte |first2=Daniel |last3=Gauvain |first3=Jean-Luc |date=2006 |title=Continuous Space Language Models for Statistical Machine Translation |conference=Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions |location=Sydney, Australia |pages=723–730 |url=https://aclanthology.org/P06-2093}}</ref>
<ref name="Schwenk2007">{{cite journal |last=Schwenk |first=Holger |date=2007 |title=Contiuous space language models |journal=Computer Speech and Language |volume=3 |issue=21 |pages=492–518 |doi=10.1016/j.csl.2006.09.003}}</ref>
<ref name="Federico2007">{{cite journal |last1=Federico |first1=Marcello |last2=Cettolo |first2=Mauro |date=2007 |editor-last=Callison-Burch |editor-first=Chris |editor2-last=Koehn |editor2-first=Philipp |editor3-last=Fordyce |editor3-first=Cameron Shaw |editor4-last=Monz |editor4-first=Christof |title=Efficient Handling of N-gram Language Models for Statistical Machine Translation |url=https://aclanthology.org/W07-0712 |journal=Proceedings of the Second Workshop on Statistical Machine Translation |location=Prague, Czech Republic |publisher=Association for Computational Linguistics |pages=88–95|doi=10.3115/1626355.1626367 }}</ref>
<ref name="Schwenk2012">{{cite conference |last1=Schwenk |first1=Holger |date=2012 |title=Continuous Space Translation Models for Phrase-Based Statistical Machine Translation |conference=Proceedings of COLING 2012: Posters |location=Mumbai, India |pages=1071–1080 |url=https://aclanthology.org/C12-2104}}</ref>
 
<ref name="Cho2014Properties">{{cite conference |last1=Cho |first1=Kyunghyun |last2=van Merriënboer |first2=Bart |last3=Bahdanau |first3=Dzmitry |last4=Bengio |first4=Yoshua | title=On the Properties of Neural Machine Translation: Encoder–Decoder Approaches | conference=Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation |date=2014 |location=Doha, Qatar |pages=103–111 |doi=10.3115/v1/W14-4012 |publisher=Association for Computational Linguistics|arxiv=1409.1259 }}</ref>
<ref name="Cho2014EncDec">{{cite conference |last1=Cho |first1=Kyunghyun |last2=van Merriënboer |first2=Bart |last3=Gulcehre |first3=Caglar |last4=Bahdanau |first4=Dzmitry |last5=Bougares |first5=Fethi |last6=Schwenk |first6=Holger |last7=Bengio |first7=Yoshua |title=Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation |conference=Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) |date=2014 |location=Doha, Qatar |pages=1724–1734 |doi=10.3115/v1/D14-1179 |publisher=Association for Computational Linguistics|arxiv=1406.1078 }}</ref>
<ref name="Sutskever2014">{{cite journal |last1=Sutskever |first1=Ilya |last2=Vinyals |first2=Oriol |last3=Le |first3=Quoc V. |date=2014 |title=Sequence to Sequence Learning with Neural Networks |url=https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=27}}</ref>
<ref name="Bahdanau2015">{{cite arXiv |last1=Bahdanau |first1=Dzmitry |last2=Cho |first2=Kyunghyun |last3=Bengio |first3=Yoshua |title = Neural Machine Translation by Jointly Learning to Align and Translate | year=2014 |eprint=1409.0473 |class=cs.CL}}</ref>
<ref name="KalchbrennerBlunsom2013">{{cite journal |last1=Kalchbrenner |first1=Nal |last2=Blunsom |first2=Philip |title=Recurrent Continuous Translation Models |journal=Proceedings of the Association for Computational Linguistics |pages=1700–1709 |date=2013 |url=http://www.aclweb.org/anthology/D13-1176}}</ref>
<ref name="Wu2016">{{cite arXiv |last1=Wu |first1=Yonghui |last2=Schuster |first2=Mike |last3=Chen |first3=Zhifeng |last4=Le |first4=Quoc V. |last5=Norouzi |first5=Mohammad |last6=Macherey |first6=Wolfgang |last7=Krikun |first7=Maxim |last8=Cao |first8=Yuan |last9=Gao |first9=Qin |last10=Macherey |first10=Klaus |last11=Klingner |first11=Jeff |last12=Shah |first12=Apurva |last13=Johnson |first13=Melvin |last14=Liu |first14=Xiaobing |last15=Kaiser |first15=Łukasz |date=2016 |title=Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation |eprint=1609.08144 |class=cs.CL}}</ref>
<ref name="Gehring2017">{{cite conference |last1=Gehring |first1=Jonas |last2=Auli |first2=Michael |last3=Grangier |first3=David |last4=Dauphin |first4=Yann |date=2017 |title=A Convolutional Encoder Model for Neural Machine Translation |url=https://aclanthology.org/P17-1012 |conference=Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |location=Vancouver, Canada |publisher=Association for Computational Linguistics |pages=123–135 |doi=10.18653/v1/P17-1012|arxiv=1611.02344 }}</ref>
<ref name="Vaswani2017">{{cite conference |last1=Vaswani |first1=Ashish |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Gomez |first5=Aidan N. |last6=Kaiser |first6=Łukasz |last7=Polosukhin |first7=Illia |date=2017 |title=Attenion Is All You Need |conference=Advances in Neural Information Processing Systems 30 (NIPS 2017) |pages=5998–6008 |url=https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html}}</ref>
<ref name="Radford2018">{{cite tech report |last1=Radford |first1=Alec |last2=Narasimhan |first2=Karthik |last3=Salimans |first3=Tim |last4=Sutskever |first4=Ilya |title=Improving Language Understanding by Generative Pre-Training |url=https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf |institution=OpenAI |date=2018 |access-date=2023-12-26}}</ref>
<ref name="Liu2020">{{cite journal |last1=Liu |first1=Yinhan |last2=Gu |first2=Jiatao |last3=Goyal |first3=Naman |last4=Li |first4=Xian |last5=Edunov |first5=Sergey |last6=Ghazvininejad |first6=Marjan |last7=Lewis |first7=Mike |last8=Zettlemoyer |first8=Luke |date=2020 |title=Multilingual Denoising Pre-training for Neural Machine Translation |url=https://doi.org/10.1162/tacl_a_00343 |journal=Transactions of the Association for Computational Linguistics |volume=8 |pages=726–742 |doi=10.1162/tacl_a_00343|arxiv=2001.08210 }}</ref>
<ref name="Brown2020">{{cite journal |last1=Brown |first1=Tom |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared D |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=2020 |title=Language Models are Few-Shot Learners |url=https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=33 |pages=1877–1901}}</ref>
<ref name="Popel2020">{{cite journal |last1=Popel |first1=Martin |last2=Tomkova |first2=Marketa |last3=Tomek |first3=Jakub |last4=Kaiser |first4=Łukasz |last5=Uszkoreit |first5=Jakob |last6=Bojar |first6=Ondřej |last7=Žabokrtský |first7=Zdeněk |date=2020-09-01 |title=Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals |journal=Nature Communications |volume=11 |issue=1 |pages=4381 |doi=10.1038/s41467-020-18073-9 |pmid=32873773 |pmc=7463233 |issn=2041-1723|hdl=11346/BIBLIO@id=368112263610994118 |hdl-access=free }}</ref>
<ref name="Poibeau2022">{{cite journal |last=Poibeau |first=Thierry |date=2022 |editor-last=Calzolari |editor-first=Nicoletta |editor2-last=Béchet |editor2-first=Frédéric |editor3-last=Blache |editor3-first=Philippe |editor4-last=Choukri |editor4-first=Khalid |editor5-last=Cieri |editor5-first=Christopher |editor6-last=Declerck |editor6-first=Thierry |editor7-last=Goggi |editor7-first=Sara |editor8-last=Isahara |editor8-first=Hitoshi |editor9-last=Maegaard |editor9-first=Bente |title=On "Human Parity" and "Super Human Performance" in Machine Translation Evaluation |url=https://aclanthology.org/2022.lrec-1.647 |journal=Proceedings of the Thirteenth Language Resources and Evaluation Conference |location=Marseille, France |publisher=European Language Resources Association |pages=6018–6023}}</ref>
<ref name="Hendy2023">{{cite arXiv |last1=Hendy |first1=Amr |last2=Abdelrehim |first2=Mohamed |last3=Sharaf |first3=Amr |last4=Raunak |first4=Vikas |last5=Gabr |first5=Mohamed |last6=Matsushita |first6=Hitokazu |last7=Kim |first7=Young Jin |last8=Afify |first8=Mohamed |last9=Awadalla |first9=Hany |title=How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation |date=2023-02-18 |eprint=2302.09210 |class=cs.CL}}</ref>
 
<ref name="DeepLTechCrunch">{{cite news |url=https://techcrunch.com/2017/08/29/deepl-schools-other-online-translators-with-clever-machine-learning/ |title=DeepL schools other online translators with clever machine learning |last=Coldewey |first=Devin |work=TechCrunch |date=2017-08-29 |access-date=2023-12-26}}</ref>
<ref name="DeepLLeMonde">{{cite news |url=https://www.lemonde.fr/pixels/article/2017/08/29/quel-est-le-meilleur-service-de-traduction-en-ligne_5177956_4408996.html |title=Quel est le meilleur service de traduction en ligne? |last1=Leloup |first1=Damien |last2=Larousserie |first2=David |work=Le Monde |date=2022-08-29 |access-date=2023-01-10}}</ref>
<ref name="DeepLGolem">{{cite news |url=https://www.golem.de/news/deepl-im-hands-on-neues-tool-uebersetzt-viel-besser-als-google-und-microsoft-1708-129715.html |title=DeepL im Hands On: Neues Tool übersetzt viel besser als Google und Microsoft |last=Pakalski |first=Ingo |work=Golem |date=2017-08-29 |access-date=2023-01-10}}</ref>
<ref name="GPT3LanguagesByCharacterCount2020">{{cite web |url=https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_character_count.csv |title=GPT 3 dataset statistics: languages by character count |date=2020-06-01 |access-date=2023-12-23 |publisher=OpenAI}}</ref>
 
<ref name="WMT2016">{{cite journal|last1=Bojar|first1=Ondrej|last2=Chatterjee|first2=Rajen|last3=Federmann|first3=Christian|last4=Graham|first4=Yvette|last5=Haddow|first5=Barry|last6=Huck|first6=Matthias|last7=Yepes|first7=Antonio Jimeno|last8=Koehn|first8=Philipp|last9=Logacheva|first9=Varvara|last10=Monz|first10=Christof|last11=Negri|first11=Matteo|last12=Névéol|first12=Aurélie|last13=Neves|first13=Mariana|last14=Popel|first14=Martin|last15=Post|first15=Matt|last16=Rubino|first16=Raphael|last17=Scarton|first17=Carolina|last18=Specia|first18=Lucia|last19=Turchi|first19=Marco|last20=Verspoor|first20=Karin|last21=Zampieri|first21=Marcos|title=Findings of the 2016 Conference on Machine Translation|journal=ACL 2016 First Conference on Machine Translation (WMT16)|date=2016|pages=131–198|url=https://cris.fbk.eu/retrieve/handle/11582/307240/14326/W16-2301.pdf|publisher=The Association for Computational Linguistics|access-date=2018-01-27|archive-url=https://web.archive.org/web/20180127202851/https://cris.fbk.eu/retrieve/handle/11582/307240/14326/W16-2301.pdf|archive-date=2018-01-27|url-status=dead}}</ref>
 
<ref name="WMT2022">{{cite conference |last1=Kocmi |first1=Tom |last2=Bawden |first2=Rachel |last3=Bojar |first3=Ondřej |last4=Dvorkovich |first4=Anton |last5=Federmann |first5=Christian |last6=Fishel |first6=Mark |last7=Gowda |first7=Thamme |last8=Graham |first8=Yvette |last9=Grundkiewicz |first9=Roman |last10=Haddow |first10=Barry |last11=Knowles |first11=Rebecca |last12=Koehn |first12=Philipp |last13=Monz |first13=Christof |last14=Morishita |first14=Makoto |last15=Nagata |first15=Masaaki |date=2022 |editor-last=Koehn |editor-first=Philipp |editor2-last=Barrault |editor2-first=Loïc |editor3-last=Bojar |editor3-first=Ondřej |editor4-last=Bougares |editor4-first=Fethi |editor5-last=Chatterjee |editor5-first=Rajen |editor6-last=Costa-jussà |editor6-first=Marta R. |editor7-last=Federmann |editor7-first=Christian |editor8-last=Fishel |editor8-first=Mark |editor9-last=Fraser |editor9-first=Alexander |title=Findings of the 2022 Conference on Machine Translation (WMT22) |url=https://aclanthology.org/2022.wmt-1.1 |conference=Proceedings of the Seventh Conference on Machine Translation (WMT) |location=Abu Dhabi, United Arab Emirates (Hybrid) |publisher=Association for Computational Linguistics |pages=1–45}}</ref>
 
<ref name="WMT2023">{{cite conference |last1=Kocmi |first1=Tom |last2=Avramidis |first2=Eleftherios |last3=Bawden |first3=Rachel |last4=Bojar |first4=Ondřej |last5=Dvorkovich |first5=Anton |last6=Federmann |first6=Christian |last7=Fishel |first7=Mark |last8=Freitag |first8=Markus |last9=Gowda |first9=Thamme |last10=Grundkiewicz |first10=Roman |last11=Haddow |first11=Barry |last12=Koehn |first12=Philipp |last13=Marie |first13=Benjamin |last14=Monz |first14=Christof |last15=Morishita |first15=Makoto |date=2023 |editor-last=Koehn |editor-first=Philipp |editor2-last=Haddow |editor2-first=Barry |editor3-last=Kocmi |editor3-first=Tom |editor4-last=Monz |editor4-first=Christof |title=Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet |url=https://aclanthology.org/2023.wmt-1.1 |journal=Proceedings of the Eighth Conference on Machine Translation |location=Singapore |publisher=Association for Computational Linguistics |pages=1–42 |doi=10.18653/v1/2023.wmt-1.1|doi-access=free }}</ref>
 
}}