-
FastSpell: the LangId Magic Spell
Authors:
Marta Bañón,
Jaume Zaragoza-Bernabeu,
Gema Ramírez-Sánchez,
Sergio Ortiz-Rojas
Abstract:
Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentiate between similar or closely-related languages. This paper introduces FastSpell, a language identifier that combines fastText (a pre-trained language identifier tool) and Hunspell (a…
▽ More
Language identification is a crucial component in the automated production of language resources, particularly in multilingual and big data contexts. However, commonly used language identifiers struggle to differentiate between similar or closely-related languages. This paper introduces FastSpell, a language identifier that combines fastText (a pre-trained language identifier tool) and Hunspell (a spell checker) with the aim of having a refined second-opinion before deciding which language should be assigned to a text. We provide a description of the FastSpell algorithm along with an explanation on how to use and configure it. To that end, we motivate the need of such a tool and present a benchmark including some popular language identifiers evaluated during the development of FastSpell. We show how FastSpell is useful not only to improve identification of similar languages, but also to identify new ones ignored by other tools.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
A New Massive Multilingual Dataset for High-Performance Language Technologies
Authors:
Ona de Gibert,
Graeme Nail,
Nikolay Arefyev,
Marta Bañón,
Jelmer van der Linde,
Shaoxiong Ji,
Jaume Zaragoza-Bernabeu,
Mikko Aulamo,
Gema Ramírez-Sánchez,
Andrey Kutuzov,
Sampo Pyysalo,
Stephan Oepen,
Jörg Tiedemann
Abstract:
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa…
▽ More
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages
Authors:
Rik van Noord,
Taja Kuzman,
Peter Rupnik,
Nikola Ljubešić,
Miquel Esplà-Gomis,
Gema Ramírez-Sánchez,
Antonio Toral
Abstract:
Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant larg…
▽ More
Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models
Authors:
Nikolay Bogoychev,
Jelmer van der Linde,
Graeme Nail,
Barry Haddow,
Jaume Zaragoza-Bernabeu,
Gema Ramírez-Sánchez,
Lukas Weymann,
Tudor Nicolae Mateiu,
Jindřich Helcl,
Mikko Aulamo
Abstract:
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers.
OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc…
▽ More
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers.
OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researchers to quickly download, visualise and preprocess bilingual (or monolingual) data that comes from many different sources, each of them with different quality, issues, and unique filtering/preprocessing requirements.
OpusTrainer is a data scheduling and data augmenting tool aimed at building large scale, robust machine translation systems and large language models. It features deterministic data mixing from many different sources, on-the-fly data augmentation and more.
Using these tools, we showcase how we can use it to create high quality machine translation model robust to noisy user input; multilingual models and terminology aware models.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Statistical sentiment analysis performance in Opinum
Authors:
Boyan Bonev,
Gema Ramírez-Sánchez,
Sergio Ortiz Rojas
Abstract:
The classification of opinion texts in positive and negative is becoming a subject of great interest in sentiment analysis. The existence of many labeled opinions motivates the use of statistical and machine-learning methods. First-order statistics have proven to be very limited in this field. The Opinum approach is based on the order of the words without using any syntactic and semantic informati…
▽ More
The classification of opinion texts in positive and negative is becoming a subject of great interest in sentiment analysis. The existence of many labeled opinions motivates the use of statistical and machine-learning methods. First-order statistics have proven to be very limited in this field. The Opinum approach is based on the order of the words without using any syntactic and semantic information. It consists of building one probabilistic model for the positive and another one for the negative opinions. Then the test opinions are compared to both models and a decision and confidence measure are calculated. In order to reduce the complexity of the training corpus we first lemmatize the texts and we replace most named-entities with wildcards. Opinum presents an accuracy above 81% for Spanish opinions in the financial products domain. In this work we discuss which are the most important factors that have impact on the classification performance.
△ Less
Submitted 2 March, 2013;
originally announced March 2013.