Zum Hauptinhalt springen

Showing 1–22 of 22 results for author: Tonja, A L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.17024  [pdf, other

    cs.CL

    InkubaLM: A small language model for low-resource African languages

    Authors: Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Aremu Anuoluwapo, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman

    Abstract: High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts an… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  2. arXiv:2406.05967  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

    Authors: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song , et al. (50 additional authors not shown)

    Abstract: Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  3. arXiv:2404.05365  [pdf, other

    cs.CL

    NLP Progress in Indigenous Latin American Languages

    Authors: Atnafu Lambebo Tonja, Fazlourrahman Balouchzahi, Sabur Butt, Olga Kolesnikova, Hector Ceballos, Alexander Gelbukh, Thamar Solorio

    Abstract: The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancemen… ▽ More

    Submitted 12 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at NAACL 2024

  4. arXiv:2403.19365  [pdf, other

    cs.CL

    EthioMT: Parallel Corpus for Low-resource Ethiopian Languages

    Authors: Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh, Jugal Kalita

    Abstract: Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: Accepted at The Fifth workshop on Resources for African Indigenous Languages (RAIL) 2024 ( LREC-COLING 2024)

  5. arXiv:2403.13737  [pdf, ps, other

    cs.CL

    EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation

    Authors: Atnafu Lambebo Tonja, Israel Abebe Azime, Tadesse Destaw Belay, Mesay Gemeda Yigezu, Moges Ahmed Mehamed, Abinew Ali Ayele, Ebrahim Chekol Jibril, Michael Melese Woldeyohannis, Olga Kolesnikova, Philipp Slusallek, Dietrich Klakow, Shengwu Xiong, Seid Muhie Yimam

    Abstract: Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassin… ▽ More

    Submitted 23 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-Coling 2024

  6. arXiv:2402.08015  [pdf, other

    cs.CL

    Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets

    Authors: Israel Abebe Azime, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Mitiku Yohannes Fuge, Aman Kassahun Wassie, Eyasu Shiferaw Jada, Yonas Chanie, Walelign Tewabe Sewunetie, Seid Muhie Yimam

    Abstract: Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets… ▽ More

    Submitted 29 April, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  7. arXiv:2312.04764  [pdf, other

    cs.CL

    First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages

    Authors: Atnafu Lambebo Tonja, Melkamu Mersha, Ananya Kalita, Olga Kolesnikova, Jugal Kalita

    Abstract: This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpo… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Accepted to ICON 2023

  8. arXiv:2310.13228  [pdf, other

    cs.CL

    The Less the Merrier? Investigating Language Representation in Multilingual Models

    Authors: Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Jugal Kalita

    Abstract: Multilingual Language Models offer a way to incorporate multiple languages in one model and utilize cross-language transfer learning to improve performance for different Natural Language Processing (NLP) tasks. Despite progress in multilingual models, not all languages are supported as well, particularly in low-resource settings. In this work, we investigate the linguistic representation of differ… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023(Findings)

  9. arXiv:2310.00274  [pdf, other

    cs.CL

    AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR

    Authors: Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, Chris Chinenye Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, Atnafu Lambebo Tonja, Naome Etori, Clinton Mbataku

    Abstract: Africa has a very low doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day -- a heavy patient burden compared with developed countries -- but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of… ▽ More

    Submitted 30 September, 2023; originally announced October 2023.

    Comments: Accepted to TACL 2023. This is a pre-MIT Press publication version

  10. arXiv:2306.01261  [pdf, other

    cs.CL

    Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts

    Authors: Yevhen Kostiuk, Atnafu Lambebo Tonja, Grigori Sidorov, Olga Kolesnikova

    Abstract: In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this st… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  11. arXiv:2306.00253  [pdf, other

    cs.CL cs.CY

    AfriNames: Most ASR models "butcher" African Names

    Authors: Tobi Olatunji, Tejumade Afonja, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Chris Chinenye Emezue, Amina Mardiyyah Rufai, Sahib Singh

    Abstract: Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken… ▽ More

    Submitted 2 June, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023 (Main Conference)

  12. arXiv:2305.17406  [pdf, other

    cs.CL

    Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models

    Authors: Atnafu Lambebo Tonja, Hellina Hailu Nigatu, Olga Kolesnikova, Grigori Sidorov, Alexander Gelbukh, Jugal Kalita

    Abstract: This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning se… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

    Comments: Accepted to Third Workshop on NLP for Indigenous Languages of the Americas

  13. arXiv:2305.17404  [pdf, other

    cs.CL

    Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec

    Authors: Atnafu Lambebo Tonja, Christian Maldonado-Sifuentes, David Alejandro Mendoza Castillo, Olga Kolesnikova, Noé Castro-Sánchez, Grigori Sidorov, Alexander Gelbukh

    Abstract: In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook M2M100-48 model outperformed… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

    Comments: Accepted to Third Workshop on NLP for Indigenous Languages of the Americas

  14. arXiv:2305.06897  [pdf, other

    cs.CL cs.AI cs.IR

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Authors: Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera, Jonathan H. Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou, Abdou Aziz DIOP, Claytone Sikasote, Gilles Hacheme, Happy Buzaaba, Ignatius Ezeani, Rooweither Mabuya, Salomey Osei, Chris Emezue, Albert Njoroge Kahira, Shamsuddeen H. Muhammad, Akintunde Oladipo, Abraham Toluwase Owodunni, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Akari Asai, Tunde Oluwaseyi Ajayi, Clemencia Siro, Steven Arthur , et al. (27 additional authors not shown)

    Abstract: African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  15. arXiv:2304.12155  [pdf, ps, other

    cs.CL cs.LG

    The African Stopwords project: curating stopwords for African languages

    Authors: Chris Emezue, Hellina Nigatu, Cynthia Thinwa, Helper Zhou, Shamsuddeen Muhammad, Lerato Louis, Idris Abdulmumin, Samuel Oyerinde, Benjamin Ajibade, Olanrewaju Samuel, Oviawe Joshua, Emeka Onwuegbuzia, Handel Emezue, Ifeoluwatayo A. Ige, Atnafu Lambebo Tonja, Chiamaka Chukwuneke, Bonaventure F. P. Dossou, Naome A. Etori, Mbonu Chinedu Emmanuel, Oreen Yousuf, Kaosarat Aina, Davis David

    Abstract: Stopwords are fundamental in Natural Language Processing (NLP) techniques for information retrieval. One of the common tasks in preprocessing of text data is the removal of stopwords. Currently, while high-resource languages like English benefit from the availability of several stopwords, low-resource languages, such as those found in the African continent, have none that are standardized and avai… ▽ More

    Submitted 21 March, 2023; originally announced April 2023.

    Comments: Accepted at the AfricaNLP workshop at ICLR2022

  16. arXiv:2304.09972  [pdf, other

    cs.CL

    MasakhaNEWS: News Topic Classification for African languages

    Authors: David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, sana al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen , et al. (40 additional authors not shown)

    Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African… ▽ More

    Submitted 20 September, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: Accepted to IJCNLP-AACL 2023 (main conference)

  17. arXiv:2304.06459  [pdf, other

    cs.CL cs.AI

    Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

    Authors: Israel Abebe Azime, Sana Sabah Al-Azzawi, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Jesujoba Alabi, Ayodele Awokoya, Mardiyyah Oduwole, Tosin Adewumi, Samuel Fanijo, Oyinkansola Awosan, Oreen Yousuf

    Abstract: AfriSenti-SemEval Shared Task 12 of SemEval-2023. The task aims to perform monolingual sentiment classification (sub-task A) for 12 African languages, multilingual sentiment classification (sub-task B), and zero-shot sentiment classification (task C). For sub-task A, we conducted experiments using classical machine learning classifiers, Afro-centric language models, and language-specific models. F… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: SemEval 2023

  18. arXiv:2303.16985  [pdf, other

    cs.CL cs.AI

    Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages

    Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole, Younwoo Choi, Tosin Adewumi

    Abstract: Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapt… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: Accepted to AfricaNLP workshop at ICLR2023

  19. arXiv:2303.14406  [pdf, other

    cs.CL

    Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

    Authors: Atnafu Lambebo Tonja, Tadesse Destaw Belay, Israel Abebe Azime, Abinew Ali Ayele, Moges Ahmed Mehamed, Olga Kolesnikova, Seid Muhie Yimam

    Abstract: This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This r… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

    Comments: Accepted to Fourth workshop on Resources for African Indigenous Languages (RAIL), EACL2023

  20. arXiv:2211.14459  [pdf, other

    cs.CL cs.AI

    Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts

    Authors: Atnafu Lambebo Tonja, Mesay Gemeda Yigezu, Olga Kolesnikova, Moein Shahiki Tash, Grigori Sidorov, Alexander Gelbuk

    Abstract: Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Politécnico Nacional, Centro de Investigación en Computación (CIC) team's syst… ▽ More

    Submitted 25 November, 2022; originally announced November 2022.

  21. arXiv:2211.03263  [pdf, other

    cs.CL cs.AI cs.LG

    AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

    Authors: Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, Chris Chinenye Emezue

    Abstract: In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model con… ▽ More

    Submitted 23 November, 2022; v1 submitted 6 November, 2022; originally announced November 2022.

    Comments: Third Workshop on Simple and Efficient Natural Language Processing, EMNLP 2022

  22. arXiv:2210.15224  [pdf, other

    cs.CL

    The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

    Authors: Tadesse Destaw Belay, Atnafu Lambebo Tonja, Olga Kolesnikova, Seid Muhie Yimam, Abinew Ali Ayele, Silesh Bogale Haile, Grigori Sidorov, Alexander Gelbukh

    Abstract: Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.