-
PyMarian: Fast Neural Machine Translation and Evaluation in Python
Authors:
Thamme Gowda,
Roman Grundkiewicz,
Elijah Rippeth,
Matt Post,
Marcin Junczys-Dowmunt
Abstract:
The deep learning language of choice these days is Python; measured by factors such as available libraries and technical support, it is hard to beat. At the same time, software written in lower-level programming languages like C++ retain advantages in speed. We describe a Python interface to Marian NMT, a C++-based training and inference toolkit for sequence-to-sequence models, focusing on machine…
▽ More
The deep learning language of choice these days is Python; measured by factors such as available libraries and technical support, it is hard to beat. At the same time, software written in lower-level programming languages like C++ retain advantages in speed. We describe a Python interface to Marian NMT, a C++-based training and inference toolkit for sequence-to-sequence models, focusing on machine translation. This interface enables models trained with Marian to be connected to the rich, wide range of tools available in Python. A highlight of the interface is the ability to compute state-of-the-art COMET metrics from Python but using Marian's inference engine, with a speedup factor of up to 7.8$\times$ the existing implementations. We also briefly spotlight a number of other integrations, including Jupyter notebooks, connection with prebuilt models, and a web app interface provided with the package. PyMarian is available in PyPI via $\texttt{pip install pymarian}$.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
Recovering document annotations for sentence-level bitext
Authors:
Rachel Wicks,
Matt Post,
Philipp Koehn
Abstract:
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets hav…
▽ More
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, ParaDocs, and resulting models as a resource to the community.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Authors:
Tom Kocmi,
Vilém Zouhar,
Christian Federmann,
Matt Post
Abstract:
Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the "dynamic range" of a number of modern metrics in a…
▽ More
Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the "dynamic range" of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference X in metric Y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.
△ Less
Submitted 10 June, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context
Authors:
Elijah Rippeth,
Marine Carpuat,
Kevin Duh,
Matt Post
Abstract:
Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of…
▽ More
Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.
△ Less
Submitted 26 November, 2023;
originally announced November 2023.
-
Identifying Context-Dependent Translations for Evaluation Set Production
Authors:
Rachel Wicks,
Matt Post
Abstract:
A major impediment to the transition to context-aware machine translation is the absence of good evaluation metrics and test sets. Sentences that require context to be translated correctly are rare in test sets, reducing the utility of standard corpus-level metrics such as COMET or BLEU. On the other hand, datasets that annotate such sentences are also rare, small in scale, and available for only…
▽ More
A major impediment to the transition to context-aware machine translation is the absence of good evaluation metrics and test sets. Sentences that require context to be translated correctly are rare in test sets, reducing the utility of standard corpus-level metrics such as COMET or BLEU. On the other hand, datasets that annotate such sentences are also rare, small in scale, and available for only a few languages. To address this, we modernize, generalize, and extend previous annotation pipelines to produce CTXPRO, a tool that identifies subsets of parallel documents containing sentences that require context to correctly translate five phenomena: gender, formality, and animacy for pronouns, verb phrase ellipsis, and ambiguous noun inflections. The input to the pipeline is a set of hand-crafted, per-language, linguistically-informed rules that select contextual sentence pairs using coreference, part-of-speech, and morphological features provided by state-of-the-art tools. We apply this pipeline to seven languages pairs (EN into and out-of DE, ES, FR, IT, PL, PT, and RU) and two datasets (OpenSubtitles and WMT test sets), and validate its performance using both overlap with previous work and its ability to discriminate a contextual MT system from a sentence-based one. We release the CTXPRO pipeline and data as open source.
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
Authors:
Vikas Raunak,
Tom Kocmi,
Matt Post
Abstract:
Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. In this paper, we investigate whether additional source context can effectively substitute for a reference. We present a metric named SLIDE (S…
▽ More
Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output. This is unsurprising, since references resolve ambiguities that may be present in the source. In this paper, we investigate whether additional source context can effectively substitute for a reference. We present a metric named SLIDE (SLIding Document Evaluator), which operates on blocks of sentences. SLIDE leverages a moving window that slides over each document in the test set, feeding each chunk of sentences into an unmodified, off-the-shelf quality estimation model. We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics. This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities. This finding is especially pertinent for reference-free document-level evaluation, wherein SLIDE could provide higher-quality pairwise system assessments while only requiring document boundary annotations.
△ Less
Submitted 2 April, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
SOTASTREAM: A Streaming Approach to Machine Translation Training
Authors:
Matt Post,
Thamme Gowda,
Roman Grundkiewicz,
Huda Khayrallah,
Rohit Jain,
Marcin Junczys-Dowmunt
Abstract:
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword…
▽ More
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Do GPTs Produce Less Literal Translations?
Authors:
Vikas Raunak,
Arul Menezes,
Matt Post,
Hany Hassan Awadalla
Abstract:
Large Language Models (LLMs) such as GPT-3 have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks. On the task of Machine Translation (MT), multiple works have investigated few-shot prompting mechanisms to elicit better translations from LLMs. However, there has been relatively little investigation on how such translations diff…
▽ More
Large Language Models (LLMs) such as GPT-3 have emerged as general-purpose language models capable of addressing many natural language generation or understanding tasks. On the task of Machine Translation (MT), multiple works have investigated few-shot prompting mechanisms to elicit better translations from LLMs. However, there has been relatively little investigation on how such translations differ qualitatively from the translations generated by standard Neural Machine Translation (NMT) models. In this work, we investigate these differences in terms of the literalness of translations produced by the two systems. Using literalness measures involving word alignment and monotonicity, we find that translations out of English (E-X) from GPTs tend to be less literal, while exhibiting similar or better scores on MT quality metrics. We demonstrate that this finding is borne out in human evaluations as well. We then show that these differences are especially pronounced when translating sentences that contain idiomatic expressions.
△ Less
Submitted 5 June, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
Authors:
Elizabeth Salesky,
Neha Verma,
Philipp Koehn,
Matt Post
Abstract:
We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations. We experiment with two different data settings with a variety of language and script coverage, demonstrating improved performance compared to subword embeddings. We explore various properties of pixel representations such as parameter sharing within and across scripts to better…
▽ More
We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations. We experiment with two different data settings with a variety of language and script coverage, demonstrating improved performance compared to subword embeddings. We explore various properties of pixel representations such as parameter sharing within and across scripts to better understand where they lead to positive transfer. We observe that these properties not only enable seamless cross-lingual transfer to unseen scripts, but make pixel representations more data-efficient than alternatives such as vocabulary expansion. We hope this work contributes to more extensible multilingual models for all languages and scripts.
△ Less
Submitted 24 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Escaping the sentence-level paradigm in machine translation
Authors:
Matt Post,
Marcin Junczys-Dowmunt
Abstract:
It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in lig…
▽ More
It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DE$\rightarrow$EN, EN$\rightarrow$DE, EN$\rightarrow$FR, and EN$\rightarrow$RU) establish the success of these three pieces together in improving document-level performance.
△ Less
Submitted 16 May, 2024; v1 submitted 25 April, 2023;
originally announced April 2023.
-
Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models
Authors:
Vikas Raunak,
Matt Post,
Arul Menezes
Abstract:
In this work, we present some recommendations on the evaluation of state-of-the-art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large-scale models have had three impacts: firstly, the fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in dia…
▽ More
In this work, we present some recommendations on the evaluation of state-of-the-art generative models for constrained generation tasks. The progress on generative models has been rapid in recent years. These large-scale models have had three impacts: firstly, the fluency of generation in both language and vision modalities has rendered common average-case evaluation metrics much less useful in diagnosing system errors. Secondly, the same substrate models now form the basis of a number of applications, driven both by the utility of their representations as well as phenomena such as in-context learning, which raise the abstraction level of interacting with such models. Thirdly, the user expectations around these models and their feted public releases have made the technical challenge of out of domain generalization much less excusable in practice. Subsequently, our evaluation methodologies haven't adapted to these changes. More concretely, while the associated utility and methods of interacting with generative models have expanded, a similar expansion has not been observed in their evaluation practices. In this paper, we argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted and provide recommendations for the same. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality and are readily applicable to a variety of tasks.
△ Less
Submitted 19 November, 2022;
originally announced December 2022.
-
Additive Interventions Yield Robust Multi-Domain Machine Translation Models
Authors:
Elijah Rippeth,
Matt Post
Abstract:
Additive interventions are a recently-proposed mechanism for controlling target-side attributes in neural machine translation. In contrast to tag-based approaches which manipulate the raw source sequence, interventions work by directly modulating the encoder representation of all tokens in the sequence. We examine the role of additive interventions in a large-scale multi-domain machine translation…
▽ More
Additive interventions are a recently-proposed mechanism for controlling target-side attributes in neural machine translation. In contrast to tag-based approaches which manipulate the raw source sequence, interventions work by directly modulating the encoder representation of all tokens in the sequence. We examine the role of additive interventions in a large-scale multi-domain machine translation setting and compare its performance in various inference scenarios. We find that while the performance difference is small between intervention-based systems and tag-based systems when the domain label matches the test domain, intervention-based systems are robust to label error, making them an attractive choice under label uncertainty. Further, we find that the superiority of single-domain fine-tuning comes under question when training data size is scaled, contradicting previous findings.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
SALTED: A Framework for SAlient Long-Tail Translation Error Detection
Authors:
Vikas Raunak,
Matt Post,
Arul Menezes
Abstract:
Traditional machine translation (MT) metrics provide an average measure of translation quality that is insensitive to the long tail of behavioral problems in MT. Examples include translation of numbers, physical units, dropped content and hallucinations. These errors, which occur rarely and unpredictably in Neural Machine Translation (NMT), greatly undermine the reliability of state-of-the-art MT…
▽ More
Traditional machine translation (MT) metrics provide an average measure of translation quality that is insensitive to the long tail of behavioral problems in MT. Examples include translation of numbers, physical units, dropped content and hallucinations. These errors, which occur rarely and unpredictably in Neural Machine Translation (NMT), greatly undermine the reliability of state-of-the-art MT systems. Consequently, it is important to have visibility into these problems during model development. Towards this direction, we introduce SALTED, a specifications-based framework for behavioral testing of MT models that provides fine-grained views of salient long-tail errors, permitting trustworthy visibility into previously invisible problems. At the core of our approach is the development of high-precision detectors that flag errors (or alternatively, verify output correctness) between a source sentence and a system output. We demonstrate that such detectors could be used not just to identify salient long-tail errors in MT systems, but also for higher-recall filtering of the training data, fixing targeted errors with model fine-tuning in NMT and generating novel data for metamorphic testing to elicit further bugs in models.
△ Less
Submitted 20 May, 2022;
originally announced May 2022.
-
Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
Authors:
Jian Xue,
Peidong Wang,
Jinyu Li,
Matt Post,
Yashesh Gaur
Abstract:
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduc…
▽ More
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming cascaded ST for English-German translation.
△ Less
Submitted 1 July, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task
Authors:
Shuoyang Ding,
Marcin Junczys-Dowmunt,
Matt Post,
Christian Federmann,
Philipp Koehn
Abstract:
This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip transla…
▽ More
This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip translation, and pseudo post-editing of the MT output. We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline. Our system is also the top-ranking system on the MT MCC metric for the English-German language pair.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Levenshtein Training for Word-level Quality Estimation
Authors:
Shuoyang Ding,
Marcin Junczys-Dowmunt,
Matt Post,
Philipp Koehn
Abstract:
We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we pro…
▽ More
We propose a novel scheme to use the Levenshtein Transformer to perform the task of word-level quality estimation. A Levenshtein Transformer is a natural fit for this task: trained to perform decoding in an iterative manner, a Levenshtein Transformer can learn to post-edit without explicit supervision. To further minimize the mismatch between the translation task and the word-level QE task, we propose a two-stage transfer learning procedure on both augmented data and human post-editing data. We also propose heuristics to construct reference labels that are compatible with subword-level finetuning and inference. Results on WMT 2020 QE shared task dataset show that our proposed method has superior data efficiency under the data-constrained setting and competitive performance under the unconstrained setting.
△ Less
Submitted 15 September, 2021; v1 submitted 12 September, 2021;
originally announced September 2021.
-
Robust Open-Vocabulary Translation from Visual Text Representations
Authors:
Elizabeth Salesky,
David Etter,
Matt Post
Abstract:
Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an 'open vocabulary.' This approach relies on consistent and correct underlying unicode sequences, and makes models susceptible to degradation from common types of noise and variation. Motivated by the robustness of human language processing, we propose the use of visual text represent…
▽ More
Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an 'open vocabulary.' This approach relies on consistent and correct underlying unicode sequences, and makes models susceptible to degradation from common types of noise and variation. Motivated by the robustness of human language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created by processing visually rendered text with sliding windows. We show that models using visual text representations approach or match performance of traditional text models on small and larger datasets. More importantly, models with visual embeddings demonstrate significant robustness to varied types of noise, achieving e.g., 25.9 BLEU on a character permuted German-English task where subword models degrade to 1.9.
△ Less
Submitted 9 December, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
The Multilingual TEDx Corpus for Speech Recognition and Translation
Authors:
Elizabeth Salesky,
Matthew Wiesner,
Jacob Bremerman,
Roldano Cattoni,
Matteo Negri,
Marco Turchi,
Douglas W. Oard,
Matt Post
Abstract:
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with op…
▽ More
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus creation methodology can be applied to more languages than previous work, and creates multi-way parallel evaluation sets. We provide baselines in multiple ASR and ST settings, including multilingual models to improve translation performance for low-resource language pairs.
△ Less
Submitted 14 June, 2021; v1 submitted 2 February, 2021;
originally announced February 2021.
-
Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity
Authors:
Brian Thompson,
Matt Post
Abstract:
Recent work has shown that a multilingual neural machine translation (NMT) model can be used to judge how well a sentence paraphrases another sentence in the same language (Thompson and Post, 2020); however, attempting to generate paraphrases from such a model using standard beam search produces trivial copies or near copies. We introduce a simple paraphrase generation algorithm which discourages…
▽ More
Recent work has shown that a multilingual neural machine translation (NMT) model can be used to judge how well a sentence paraphrases another sentence in the same language (Thompson and Post, 2020); however, attempting to generate paraphrases from such a model using standard beam search produces trivial copies or near copies. We introduce a simple paraphrase generation algorithm which discourages the production of n-grams that are present in the input. Our approach enables paraphrase generation in many languages from a single multilingual NMT model. Furthermore, the amount of lexical diversity between the input and output can be controlled at generation time. We conduct a human evaluation to compare our method to a paraphraser trained on the large English synthetic paraphrase database ParaBank 2 (Hu et al., 2019c) and find that our method produces paraphrases that better preserve meaning and are more gramatical, for the same level of lexical diversity. Additional smaller human assessments demonstrate our approach also works in two non-English languages.
△ Less
Submitted 27 October, 2020; v1 submitted 11 August, 2020;
originally announced August 2020.
-
A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing
Authors:
Rachel Bawden,
Biao Zhang,
Lisa Yankovskaya,
Andre Tättar,
Matt Post
Abstract:
We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional diverse references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language direction…
▽ More
We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional diverse references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU's capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.
△ Less
Submitted 8 October, 2020; v1 submitted 30 April, 2020;
originally announced April 2020.
-
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
Authors:
Brian Thompson,
Matt Post
Abstract:
We frame the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on a human reference. We propose training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task (e.g., Czech to Czech). This results in the paraphraser's output mode being centered around a copy of the in…
▽ More
We frame the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser, conditioned on a human reference. We propose training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task (e.g., Czech to Czech). This results in the paraphraser's output mode being centered around a copy of the input sequence, which represents the best case scenario where the MT system output matches a human reference. Our method is simple and intuitive, and does not require human judgements for training. Our single model (trained in 39 languages) outperforms or statistically ties with all prior metrics on the WMT 2019 segment-level shared metrics task in all languages (excluding Gujarati where the model had no training data). We also explore using our model for the task of quality estimation as a metric--conditioning on the source instead of the reference--and find that it significantly outperforms every submission to the WMT 2019 shared task on quality estimation in every language pair.
△ Less
Submitted 27 October, 2020; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Simulated Multiple Reference Training Improves Low-Resource Machine Translation
Authors:
Huda Khayrallah,
Brian Thompson,
Matt Post,
Philipp Koehn
Abstract:
Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and…
▽ More
Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser's distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation.
△ Less
Submitted 13 October, 2020; v1 submitted 29 April, 2020;
originally announced April 2020.
-
Generating Commit Messages from Git Diffs
Authors:
S. R. P. van Hal,
M. Post,
K. Wendel
Abstract:
Commit messages aid developers in their understanding of a continuously evolving codebase. However, developers not always document code changes properly. Automatically generating commit messages would relieve this burden on developers. Recently, a number of different works have demonstrated the feasibility of using methods from neural machine translation to generate commit messages. This work aims…
▽ More
Commit messages aid developers in their understanding of a continuously evolving codebase. However, developers not always document code changes properly. Automatically generating commit messages would relieve this burden on developers. Recently, a number of different works have demonstrated the feasibility of using methods from neural machine translation to generate commit messages. This work aims to reproduce a prominent research paper in this field, as well as attempt to improve upon their results by proposing a novel preprocessing technique. A reproduction of the reference neural machine translation model was able to achieve slightly better results on the same dataset. When applying more rigorous preprocessing, however, the performance dropped significantly. This demonstrates the inherent shortcoming of current commit message generation models, which perform well by memorizing certain constructs. Future research directions might include improving diff embeddings and focusing on specific groups of commits.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
A Discriminative Neural Model for Cross-Lingual Word Alignment
Authors:
Elias Stengel-Eskin,
Tzu-Ray Su,
Matt Post,
Benjamin Van Durme
Abstract:
We introduce a novel discriminative word alignment model, which we integrate into a Transformer-based machine translation model. In experiments based on a small number of labeled examples (~1.7K-5K sentences) we evaluate its performance intrinsically on both English-Chinese and English-Arabic alignment, where we achieve major improvements over unsupervised baselines (11-27 F1). We evaluate the mod…
▽ More
We introduce a novel discriminative word alignment model, which we integrate into a Transformer-based machine translation model. In experiments based on a small number of labeled examples (~1.7K-5K sentences) we evaluate its performance intrinsically on both English-Chinese and English-Arabic alignment, where we achieve major improvements over unsupervised baselines (11-27 F1). We evaluate the model extrinsically on data projection for Chinese NER, showing that our alignments lead to higher performance when used to project NER tags from English to Chinese. Finally, we perform an ablation analysis and an annotation experiment that jointly support the utility and feasibility of future manual alignment elicitation.
△ Less
Submitted 1 September, 2019;
originally announced September 2019.
-
Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System?
Authors:
Sorami Hisamoto,
Matt Post,
Kevin Duh
Abstract:
Data privacy is an important issue for "machine learning as a service" providers. We focus on the problem of membership inference attacks: given a data sample and black-box access to a model's API, determine whether the sample existed in the model's training data. Our contribution is an investigation of this problem in the context of sequence-to-sequence models, which are important in applications…
▽ More
Data privacy is an important issue for "machine learning as a service" providers. We focus on the problem of membership inference attacks: given a data sample and black-box access to a model's API, determine whether the sample existed in the model's training data. Our contribution is an investigation of this problem in the context of sequence-to-sequence models, which are important in applications such as machine translation and video captioning. We define the membership inference problem for sequence generation, provide an open dataset based on state-of-the-art machine translation models, and report initial results on whether these models leak private information against several kinds of membership inference attacks.
△ Less
Submitted 16 March, 2020; v1 submitted 10 April, 2019;
originally announced April 2019.
-
ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation
Authors:
J. Edward Hu,
Rachel Rudinger,
Matt Post,
Benjamin Van Durme
Abstract:
We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sente…
▽ More
We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that ParaBank's paraphrases improve over ParaNMT on both semantic similarity and fluency. Finally, we use ParaBank to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.
△ Less
Submitted 11 January, 2019;
originally announced January 2019.
-
A Call for Clarity in Reporting BLEU Scores
Authors:
Matt Post
Abstract:
The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cann…
▽ More
The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to facilitate this.
△ Less
Submitted 12 September, 2018; v1 submitted 23 April, 2018;
originally announced April 2018.
-
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
Authors:
Matt Post,
David Vilar
Abstract:
The end-to-end nature of neural machine translation (NMT) removes many ways of manually guiding the translation process that were available in older paradigms. Recent work, however, has introduced a new capability: lexically constrained or guided decoding, a modification to beam search that forces the inclusion of pre-specified words and phrases in the output. However, while theoretically sound, e…
▽ More
The end-to-end nature of neural machine translation (NMT) removes many ways of manually guiding the translation process that were available in older paradigms. Recent work, however, has introduced a new capability: lexically constrained or guided decoding, a modification to beam search that forces the inclusion of pre-specified words and phrases in the output. However, while theoretically sound, existing approaches have computational complexities that are either linear (Hokamp and Liu, 2017) or exponential (Anderson et al., 2017) in the number of constraints. We present a algorithm for lexically constrained decoding with a complexity of O(1) in the number of constraints. We demonstrate the algorithms remarkable ability to properly place these constraints, and use it to explore the shaky relationship between model and BLEU scores. Our implementation is available as part of Sockeye.
△ Less
Submitted 9 November, 2018; v1 submitted 18 April, 2018;
originally announced April 2018.
-
Sockeye: A Toolkit for Neural Machine Translation
Authors:
Felix Hieber,
Tobias Domhan,
Michael Denkowski,
David Vilar,
Artem Sokolov,
Ann Clifton,
Matt Post
Abstract:
We describe Sockeye (version 1.12), an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). Sockeye is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNet, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attenti…
▽ More
We describe Sockeye (version 1.12), an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). Sockeye is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNet, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attentional recurrent neural networks, self-attentional transformers, and fully convolutional networks. Sockeye also supports a wide range of optimizers, normalization and regularization techniques, and inference improvements from current NMT literature. Users can easily run standard training recipes, explore different model settings, and incorporate new ideas. In this paper, we highlight Sockeye's features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English. We report competitive BLEU scores across all three architectures, including an overall best score for Sockeye's transformer implementation. To facilitate further comparison, we release all system outputs and training scripts used in our experiments. The Sockeye toolkit is free software released under the Apache 2.0 license.
△ Less
Submitted 1 June, 2018; v1 submitted 15 December, 2017;
originally announced December 2017.
-
Grammatical Error Correction with Neural Reinforcement Learning
Authors:
Keisuke Sakaguchi,
Matt Post,
Benjamin Van Durme
Abstract:
We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated ev…
▽ More
We propose a neural encoder-decoder model with reinforcement learning (NRL) for grammatical error correction (GEC). Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes towards an objective that considers a sentence-level, task-specific evaluation metric, avoiding the exposure bias issue in MLE. We demonstrate that NRL outperforms MLE both in human and automated evaluation metrics, achieving the state-of-the-art on a fluency-oriented GEC corpus.
△ Less
Submitted 2 July, 2017;
originally announced July 2017.
-
Using of heterogeneous corpora for training of an ASR system
Authors:
Jan Trmal,
Gaurav Kumar,
Vimal Manohar,
Sanjeev Khudanpur,
Matt Post,
Paul McNamee
Abstract:
The paper summarizes the development of the LVCSR system built as a part of the Pashto speech-translation system at the SCALE (Summer Camp for Applied Language Exploration) 2015 workshop on "Speech-to-text-translation for low-resource languages". The Pashto language was chosen as a good "proxy" low-resource language, exhibiting multiple phenomena which make the speech-recognition and and speech-to…
▽ More
The paper summarizes the development of the LVCSR system built as a part of the Pashto speech-translation system at the SCALE (Summer Camp for Applied Language Exploration) 2015 workshop on "Speech-to-text-translation for low-resource languages". The Pashto language was chosen as a good "proxy" low-resource language, exhibiting multiple phenomena which make the speech-recognition and and speech-to-text-translation systems development hard.
Even when the amount of data is seemingly sufficient, given the fact that the data originates from multiple sources, the preliminary experiments reveal that there is little to no benefit in merging (concatenating) the corpora and more elaborate ways of making use of all of the data must be worked out.
This paper concentrates only on the LVCSR part and presents a range of different techniques that were found to be useful in order to benefit from multiple different corpora
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Robsut Wrod Reocginiton via semi-Character Recurrent Neural Network
Authors:
Keisuke Sakaguchi,
Kevin Duh,
Matt Post,
Benjamin Van Durme
Abstract:
Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers)…
▽ More
Language processing mechanism by humans is generally more robust than computers. The Cmabrigde Uinervtisy (Cambridge University) effect from the psycholinguistics literature has demonstrated such a robust word processing mechanism, where jumbled words (e.g. Cmabrigde / Cambridge) are recognized with little cost. On the other hand, computational models for word recognition (e.g. spelling checkers) perform poorly on data with such noise. Inspired by the findings from the Cmabrigde Uinervtisy effect, we propose a word recognition model based on a semi-character level recurrent neural network (scRNN). In our experiments, we demonstrate that scRNN has significantly more robust performance in word spelling correction (i.e. word recognition) compared to existing spelling checkers and character-based convolutional neural network. Furthermore, we demonstrate that the model is cognitively plausible by replicating a psycholinguistics experiment about human reading difficulty using our model.
△ Less
Submitted 7 February, 2017; v1 submitted 7 August, 2016;
originally announced August 2016.
-
GLEU Without Tuning
Authors:
Courtney Napoles,
Keisuke Sakaguchi,
Matt Post,
Joel Tetreault
Abstract:
The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the…
▽ More
The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
△ Less
Submitted 9 May, 2016;
originally announced May 2016.