Search | arXiv e-print repository

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA. △ Less

Submitted 8 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: https://github.com/SEACrowd

arXiv:2311.07171 [pdf, other]

calamanCy: A Tagalog Natural Language Processing Toolkit

Authors: Lester James V. Miranda

Abstract: We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dep… ▽ More We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dependency parsing, parts-of-speech (POS) tagging, and named entity recognition (NER). calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework. The calamanCy toolkit is available on GitHub: https://github.com/ljvmiranda921/calamanCy. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: To be published in The Third Workshop for NLP-OSS at EMNLP 2023

arXiv:2311.07161 [pdf, other]

Developing a Named Entity Recognition Dataset for Tagalog

Authors: Lester James V. Miranda

Abstract: We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity t… ▽ More We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity types: Person, Organization, and Location. The inter-annotator agreement, as measured by Cohen's $κ$, is 0.81. We also conducted extensive empirical evaluation of state-of-the-art methods across supervised and transfer learning settings. Finally, we released the data and processing code publicly to inspire future work on Tagalog NLP. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: To be published in The First Workshop for Southeast Asian Language Processing 2023 at IJCNLP-AACL

arXiv:1910.05571 [pdf, other]

Geomancer: An Open-Source Framework for Geospatial Feature Engineering

Authors: Lester James V. Miranda, Mark Steve Samson, Alfiero K. Orden II, Bianca S. Silmaro, Ram K. De Guzman III, Stephanie S. Sy

Abstract: This paper presents Geomancer, an open-source framework for geospatial feature engineering. It simplifies the acquisition of geospatial attributes for downstream, large-scale machine learning tasks. Geomancer leverages any geospatial dataset stored in a data warehouse, users need only to define the features (Spells) they want to create, and cast them on any spatial dataset. In addition, these feat… ▽ More This paper presents Geomancer, an open-source framework for geospatial feature engineering. It simplifies the acquisition of geospatial attributes for downstream, large-scale machine learning tasks. Geomancer leverages any geospatial dataset stored in a data warehouse, users need only to define the features (Spells) they want to create, and cast them on any spatial dataset. In addition, these features can be exported into a JSON file (SpellBook) for sharing and reproducibility. Geomancer has been useful to some of our production use-cases such as property value estimation, area valuation, and more. It is available on Github, and can be installed from PyPI. △ Less

Submitted 12 October, 2019; originally announced October 2019.

Showing 1–4 of 4 results for author: Miranda, L J V