-
Machine Intelligence in Africa: a survey
Authors:
Allahsera Auguste Tapo,
Ali Traore,
Sidy Danioko,
Hamidou Tembine
Abstract:
In the last 5 years, the availability of large audio datasets in African countries has opened unlimited opportunities to build machine intelligence (MI) technologies that are closer to the people and speak, learn, understand, and do businesses in local languages, including for those who cannot read and write. Unfortunately, these audio datasets are not fully exploited by current MI tools, leaving…
▽ More
In the last 5 years, the availability of large audio datasets in African countries has opened unlimited opportunities to build machine intelligence (MI) technologies that are closer to the people and speak, learn, understand, and do businesses in local languages, including for those who cannot read and write. Unfortunately, these audio datasets are not fully exploited by current MI tools, leaving several Africans out of MI business opportunities. Additionally, many state-of-the-art MI models are not culture-aware, and the ethics of their adoption indexes are questionable. The lack thereof is a major drawback in many applications in Africa. This paper summarizes recent developments in machine intelligence in Africa from a multi-layer multiscale and culture-aware ethics perspective, showcasing MI use cases in 54 African countries through 400 articles on MI research, industry, government actions, as well as uses in art, music, the informal economy, and small businesses in Africa. The survey also opens discussions on the reliability of MI rankings and indexes in the African continent as well as algorithmic definitions of unclear terms used in MI.
△ Less
Submitted 3 February, 2024;
originally announced February 2024.
-
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Authors:
Cheikh M. Bamba Dione,
David Adelani,
Peter Nabende,
Jesujoba Alabi,
Thapelo Sindane,
Happy Buzaaba,
Shamsuddeen Hassan Muhammad,
Chris Chinenye Emezue,
Perez Ogayo,
Anuoluwapo Aremu,
Catherine Gitau,
Derguene Mbaye,
Jonathan Mukiibi,
Blessing Sibanda,
Bonaventure F. P. Dossou,
Andiswa Bukula,
Rooweither Mabuya,
Allahsera Auguste Tapo,
Edwin Munkoh-Buabeng,
victoire Memdjokam Koagne,
Fatoumata Ouoba Kabore,
Amelia Taylor,
Godson Kalipe,
Tebogo Macucwa,
Vukosi Marivate
, et al. (19 additional authors not shown)
Abstract:
In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-l…
▽ More
In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
Authors:
David Ifeoluwa Adelani,
Graham Neubig,
Sebastian Ruder,
Shruti Rijhwani,
Michael Beukman,
Chester Palen-Michel,
Constantine Lignos,
Jesujoba O. Alabi,
Shamsuddeen H. Muhammad,
Peter Nabende,
Cheikh M. Bamba Dione,
Andiswa Bukula,
Rooweither Mabuya,
Bonaventure F. P. Dossou,
Blessing Sibanda,
Happy Buzaaba,
Jonathan Mukiibi,
Godson Kalipe,
Derguene Mbaye,
Amelia Taylor,
Fatoumata Kabore,
Chris Chinenye Emezue,
Anuoluwapo Aremu,
Perez Ogayo,
Catherine Gitau
, et al. (20 additional authors not shown)
Abstract:
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity r…
▽ More
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages.
△ Less
Submitted 15 November, 2022; v1 submitted 22 October, 2022;
originally announced October 2022.
-
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
Authors:
David Ifeoluwa Adelani,
Jesujoba Oluwadara Alabi,
Angela Fan,
Julia Kreutzer,
Xiaoyu Shen,
Machel Reid,
Dana Ruiter,
Dietrich Klakow,
Peter Nabende,
Ernie Chang,
Tajuddeen Gwadabe,
Freshia Sackey,
Bonaventure F. P. Dossou,
Chris Chinenye Emezue,
Colin Leong,
Michael Beukman,
Shamsuddeen Hassan Muhammad,
Guyo Dub Jarso,
Oreen Yousuf,
Andre Niyongabo Rubungo,
Gilles Hacheme,
Eric Peter Wairagala,
Muhammad Umair Nasir,
Benjamin Ayoade Ajibade,
Tunde Oluwaseyi Ajayi
, et al. (20 additional authors not shown)
Abstract:
Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models…
▽ More
Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
△ Less
Submitted 22 August, 2022; v1 submitted 4 May, 2022;
originally announced May 2022.
-
Domain-specific MT for Low-resource Languages: The case of Bambara-French
Authors:
Allahsera Auguste Tapo,
Michael Leventhal,
Sarah Luger,
Christopher M. Homan,
Marcos Zampieri
Abstract:
Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data. In this paper we address the issue of domain-specific MT for Bambara, an under-resourced Mande language spoken in Mali. We present the first domain-specific parallel dataset for MT of Bambara into and from French. We discuss challenges in working with small quantities…
▽ More
Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data. In this paper we address the issue of domain-specific MT for Bambara, an under-resourced Mande language spoken in Mali. We present the first domain-specific parallel dataset for MT of Bambara into and from French. We discuss challenges in working with small quantities of domain-specific data for a low-resource language and we present the results of machine learning experiments on this data.
△ Less
Submitted 31 March, 2021;
originally announced April 2021.
-
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Authors:
Julia Kreutzer,
Isaac Caswell,
Lisa Wang,
Ahsan Wahab,
Daan van Esch,
Nasanbayar Ulzii-Orshikh,
Allahsera Tapo,
Nishant Subramani,
Artem Sokolov,
Claytone Sikasote,
Monang Setyawan,
Supheakmungkol Sarin,
Sokhar Samb,
Benoît Sagot,
Clara Rivera,
Annette Rios,
Isabel Papadimitriou,
Salomey Osei,
Pedro Ortiz Suarez,
Iroro Orife,
Kelechi Ogueji,
Andre Niyongabo Rubungo,
Toan Q. Nguyen,
Mathias Müller,
André Müller
, et al. (27 additional authors not shown)
Abstract:
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system…
▽ More
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
△ Less
Submitted 21 February, 2022; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara
Authors:
Allahsera Auguste Tapo,
Bakary Coulibaly,
Sébastien Diarra,
Christopher Homan,
Julia Kreutzer,
Sarah Luger,
Arthur Nagashima,
Marcos Zampieri,
Michael Leventhal
Abstract:
Low-resource languages present unique challenges to (neural) machine translation. We discuss the case of Bambara, a Mande language for which training data is scarce and requires significant amounts of pre-processing. More than the linguistic situation of Bambara itself, the socio-cultural context within which Bambara speakers live poses challenges for automated processing of this language. In this…
▽ More
Low-resource languages present unique challenges to (neural) machine translation. We discuss the case of Bambara, a Mande language for which training data is scarce and requires significant amounts of pre-processing. More than the linguistic situation of Bambara itself, the socio-cultural context within which Bambara speakers live poses challenges for automated processing of this language. In this paper, we present the first parallel data set for machine translation of Bambara into and from English and French and the first benchmark results on machine translation to and from Bambara. We discuss challenges in working with low-resource languages and propose strategies to cope with data scarcity in low-resource machine translation (MT).
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study
Authors:
Michael Leventhal,
Allahsera Tapo,
Sarah Luger,
Marcos Zampieri,
Christopher M. Homan
Abstract:
We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages. Malian university students translated French texts, producing either written or oral translations to Bambara. Our results suggest that similar quality can be obtained from either written or spoken translations for certain kinds of texts. They al…
▽ More
We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages. Malian university students translated French texts, producing either written or oral translations to Bambara. Our results suggest that similar quality can be obtained from either written or spoken translations for certain kinds of texts. They also suggest specific instructions that human translators should be given in order to improve the quality of their work.
△ Less
Submitted 31 March, 2020;
originally announced April 2020.