Search | arXiv e-print repository

Sentence-level Aggregation of Lexical Metrics Correlate Stronger with Human Judgements than Corpus-level Aggregation

Authors: Paulo Cavalin, Pedro Henrique Domingues, Claudio Pinhanez

Abstract: In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metri… ▽ More In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.12620 [pdf, other]

Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences

Authors: Claudio Pinhanez, Paulo Cavalin, Luciana Storto, Thomas Finbow, Alexander Cobbinah, Julio Nogima, Marisa Vasconcelos, Pedro Domingues, Priscila de Souza Mizukami, Nicole Grell, Majoí Gongora, Isabel Gonçalves

Abstract: Since 2022 we have been exploring application areas and technologies in which Artificial Intelligence (AI) and modern Natural Language Processing (NLP), such as Large Language Models (LLMs), can be employed to foster the usage and facilitate the documentation of Indigenous languages which are in danger of disappearing. We start by discussing the decreasing diversity of languages in the world and h… ▽ More Since 2022 we have been exploring application areas and technologies in which Artificial Intelligence (AI) and modern Natural Language Processing (NLP), such as Large Language Models (LLMs), can be employed to foster the usage and facilitate the documentation of Indigenous languages which are in danger of disappearing. We start by discussing the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP. To address those challenges, we propose an alternative development AI cycle based on community engagement and usage. Then, we report encouraging results in the development of high-quality machine learning translators for Indigenous languages by fine-tuning state-of-the-art (SOTA) translators with tiny amounts of data and discuss how to avoid some common pitfalls in the process. We also present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing, and discuss the development of Indigenous Language Models (ILMs) as a replicable and scalable way to create spell-checkers, next-word predictors, and similar tools. Finally, we discuss how we envision a future for language documentation where dying languages are preserved as interactive language models. △ Less

Submitted 29 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

arXiv:2402.19204 [pdf, other]

PeLLE: Encoder-based language models for Brazilian Portuguese based on open data

Authors: Guilherme Lamartine de Mello, Marcelo Finger, and Felipe Serras, Miguel de Mello Carpi, Marcos Menon Jose, Pedro Henrique Domingues, Paulo Cavalim

Abstract: In this paper we present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, c… ▽ More In this paper we present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks. We conclude that several tasks perform better with larger models, but some tasks benefit from smaller-but-curated data in its pretraining. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 15 pages

ACM Class: I.2.7

arXiv:1707.06336 [pdf]

Open Source Software for Digital Preservation Repositories: a Survey

Authors: Carlos André Rosa, Olga Craveiro, Patricio Domingues

Abstract: In the digital age, the amount of data produced is growing exponentially. Governments and institutions can no longer rely on old methods for storing data and passing on the knowledge to future generations. Digital data preservation is a mandatory issue that needs proper strategies and tools. With this awareness, efforts are being made to create and perfect software solutions capable of responding… ▽ More In the digital age, the amount of data produced is growing exponentially. Governments and institutions can no longer rely on old methods for storing data and passing on the knowledge to future generations. Digital data preservation is a mandatory issue that needs proper strategies and tools. With this awareness, efforts are being made to create and perfect software solutions capable of responding to the challenge of properly preserving digital information. This paper focuses on the state-of-the-art in open-source software solutions for the digital preservation and curation field used to assimilate and disseminate information to designated audiences. Eleven open source projects for digital preservation are surveyed in areas such as supported standards and protocols, strategies for preservation, methodologies for reporting, dynamic of development, targeted operating systems, multilingual support and open source license. Furthermore, five of these open source projects, are further analysed, with focus on features deemed important for the area. Along open source solutions, the paper also briefly surveys the standards and protocols relevant for digital data preservation. The area of digital data preservation repositories has several open source solutions, which can form the base to overcome the challenges to reach mature and reliable digital data preservation. △ Less

Submitted 19 July, 2017; originally announced July 2017.

Comments: http://airccse.org/journal/ijcses/

Journal ref: International Journal of Computer Science & Engineering Survey (IJCSES) Vol.8, No.3, June 2017

Showing 1–4 of 4 results for author: Domingues, P