-
Definition generation for lexical semantic change detection
Authors:
Mariia Fedorova,
Andrey Kutuzov,
Yves Scherrer
Abstract:
We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages,…
▽ More
We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.
△ Less
Submitted 31 July, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling
Authors:
Dana Roemling,
Yves Scherrer,
Aleksandra Miletic
Abstract:
Forensic authorship profiling uses linguistic markers to infer characteristics about an author of a text. This task is paralleled in dialect classification, where a prediction is made about the linguistic variety of a text based on the text itself. While there have been significant advances in recent years in variety classification, forensic linguistics rarely relies on these approaches due to the…
▽ More
Forensic authorship profiling uses linguistic markers to infer characteristics about an author of a text. This task is paralleled in dialect classification, where a prediction is made about the linguistic variety of a text based on the text itself. While there have been significant advances in recent years in variety classification, forensic linguistics rarely relies on these approaches due to their lack of transparency, among other reasons. In this paper we therefore explore the explainability of machine learning approaches considering the forensic context. We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area. For this, we identify the lexical items that are the most impactful for the variety classification. We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.
△ Less
Submitted 1 July, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
Findings of the VarDial Evaluation Campaign 2023
Authors:
Noëmi Aepli,
Çağrı Çöltekin,
Rob Van Der Goot,
Tommi Jauhiainen,
Mourhaf Kazzaz,
Nikola Ljubešić,
Kai North,
Barbara Plank,
Yves Scherrer,
Marcos Zampieri
Abstract:
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR),…
▽ More
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages -- True Labels (DSL-TL), and Discriminating Between Similar Languages -- Speech (DSL-S). All three tasks were organized for the first time this year.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Democratizing Neural Machine Translation with OPUS-MT
Authors:
Jörg Tiedemann,
Mikko Aulamo,
Daria Bakshandaeva,
Michele Boggia,
Stig-Arne Grönroos,
Tommi Nieminen,
Alessandro Raganato,
Yves Scherrer,
Raul Vazquez,
Sami Virpioja
Abstract:
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-opt…
▽ More
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
△ Less
Submitted 4 July, 2023; v1 submitted 4 December, 2022;
originally announced December 2022.
-
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
Authors:
Alessandro Raganato,
Yves Scherrer,
Jörg Tiedemann
Abstract:
Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propos…
▽ More
Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.
△ Less
Submitted 5 October, 2020; v1 submitted 24 February, 2020;
originally announced February 2020.
-
The University of Helsinki submissions to the WMT19 news translation task
Authors:
Aarne Talman,
Umut Sulubacak,
Raúl Vázquez,
Yves Scherrer,
Sami Virpioja,
Alessandro Raganato,
Arvi Hurskainen,
Jörg Tiedemann
Abstract:
In this paper, we present the University of Helsinki submissions to the WMT 2019 shared task on news translation in three language pairs: English-German, English-Finnish and Finnish-English. This year, we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German, we trained both senten…
▽ More
In this paper, we present the University of Helsinki submissions to the WMT 2019 shared task on news translation in three language pairs: English-German, English-Finnish and Finnish-English. This year, we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German, we trained both sentence-level transformer models and compared different document-level translation approaches. For Finnish-English and English-Finnish we focused on different segmentation approaches, and we also included a rule-based system for English-Finnish.
△ Less
Submitted 10 June, 2019;
originally announced June 2019.
-
Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation Tasks
Authors:
Jörg Tiedemann,
Yves Scherrer
Abstract:
In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones. We test this hypotheses by measuring the perplexity of such models when applied to paraphrases of the source language. The intuition is that an encoder produces better representations if a decoder is capable of recognizing synonymous sentences in the s…
▽ More
In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones. We test this hypotheses by measuring the perplexity of such models when applied to paraphrases of the source language. The intuition is that an encoder produces better representations if a decoder is capable of recognizing synonymous sentences in the same language even though the model is never trained for that task. In our setup, we add 16 different auxiliary languages to a bidirectional bilingual baseline model (English-French) and test it with in-domain and out-of-domain paraphrases in English. The results show that the perplexity is significantly reduced in each of the cases, indicating that meaning can be grounded in translation. This is further supported by a study on paraphrase generation that we also include at the end of the paper.
△ Less
Submitted 3 May, 2019; v1 submitted 21 August, 2018;
originally announced August 2018.
-
Neural Machine Translation with Extended Context
Authors:
Jörg Tiedemann,
Yves Scherrer
Abstract:
We investigate the use of extended context in attention-based neural machine translation. We base our experiments on translated movie subtitles and discuss the effect of increasing the segments beyond single translation units. We study the use of extended source language context as well as bilingual context extensions. The models learn to distinguish between information from different segments and…
▽ More
We investigate the use of extended context in attention-based neural machine translation. We base our experiments on translated movie subtitles and discuss the effect of increasing the segments beyond single translation units. We study the use of extended source language context as well as bilingual context extensions. The models learn to distinguish between information from different segments and are surprisingly robust with respect to translation quality. In this pilot study, we observe interesting cross-sentential attention patterns that improve textual coherence in translation at least in some selected cases.
△ Less
Submitted 20 August, 2017;
originally announced August 2017.
-
The Helsinki Neural Machine Translation System
Authors:
Robert Östling,
Yves Scherrer,
Jörg Tiedemann,
Gongbo Tang,
Tommi Nieminen
Abstract:
We introduce the Helsinki Neural Machine Translation system (HNMT) and how it is applied in the news translation task at WMT 2017, where it ranked first in both the human and automatic evaluations for English--Finnish. We discuss the success of English--Finnish translations and the overall advantage of NMT over a strong SMT baseline. We also discuss our submissions for English--Latvian, English--C…
▽ More
We introduce the Helsinki Neural Machine Translation system (HNMT) and how it is applied in the news translation task at WMT 2017, where it ranked first in both the human and automatic evaluations for English--Finnish. We discuss the success of English--Finnish translations and the overall advantage of NMT over a strong SMT baseline. We also discuss our submissions for English--Latvian, English--Chinese and Chinese--English.
△ Less
Submitted 20 August, 2017;
originally announced August 2017.