SEACrowd: A Multilingual Multimodal Data Hub
and Benchmark Suite for Southeast Asian Languages

Holy Lovenia★,1,2Rahmad Mahendra★,3,2Salsabil Maulana Akbar★,2Lester James V. Miranda★,4
Jennifer Santoso★,5Elyanah Aco★,6Akhdan Fadhilah★,7Jonibek Mansurov★,8Joseph Marvin Imperial★,9,10
Onno P. Kampman★,11Joel Ruben Antony Moniz★,6Muhammad Ravi Shulthan Habibi★,3,2Frederikus Hudi★,12,13
Railey Montalan★,1Ryan Ignatius6Joanito Agili Lopo14William Nixon15Börje F. Karlsson16James Jaya6
Ryandito Diandaru6Yuze Gao6Patrick Amadeus15Bin Wang6Jan Christian Blaise Cruz17Chenxi Whitehouse18
Ivan Halim Parmonangan19Maria Khelli15Wenyu Zhang6Lucky Susanto20Reynard Adha Ryanda21
Sonny Lazuardi Hermawan22Dan John Velasco17Muhammad Dehan Al Kautsar15Willy Fitra Hendria6
Yasmin Moslem23Noah Flynn24Muhammad Farid Adilazuarda8Haochen Li6Johanes Lee15R. Damanhuri25
Shuo Sun6Muhammad Reza Qorib26Amirbek Djanibekov8Wei Qi Leong1Quyet V. Do27Niklas Muennighoff28
Tanrada Pansuwan18Ilham Firdausi Putra6Yan Xu29,27Ngee Chia Tai1Ayu Purwarianti6,30
Sebastian Ruder31William Tjhi1Peerat Limkonchotiwat★,32Alham Fikri Aji★,8Sedrick Keh★,33
Genta Indra Winata★,2Ruochen Zhang★,34Fajri Koto★,8,2Zheng-Xin Yong★,34Samuel Cahyawijaya★,27,2
1AI Singapore 2IndoNLP 3Universitas Indonesia 4Allen Institute for Artificial Intelligence 5RevComm, Inc. 
6Independent Researcher 7Tohoku University 8MBZUAI 9University of Bath 10National University Philippines 
11MOH Office for Healthcare Transformation (MOHT) 12NAIST 13Works Applications Lab 14Universitas Gadjah Mada 
15Institut Teknologi Bandung 16Beijing Academy of Artificial Intelligence (BAAI) 17Samsung Research Philippines 
18University of Cambridge 19Queensland University of Technology 20Monash University Indonesia 21Imperial College London 
22Independent Design Engineer 23Bering Lab 24Amazon 25Universitas Diponegoro 26NUS 27HKUST 28Contextual AI 
29Huawei Noah’s Ark Lab 30Prosa.ai 31Cohere 32VISTEC 33Toyota Research Institute 34Brown University
Major contributors

Abstract

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub111https://seacrowd.github.io/seacrowd-catalogue/ that fills the resource gap by providing standardized corpora222https://github.com/SEACrowd/seacrowd-datahub/ in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

SEACrowd: A Multilingual Multimodal Data Hub
and Benchmark Suite for Southeast Asian Languages


Holy Lovenia★,1,2 Rahmad Mahendra★,3,2 Salsabil Maulana Akbar★,2 Lester James V. Miranda★,4 Jennifer Santoso★,5Elyanah Aco★,6Akhdan Fadhilah★,7Jonibek Mansurov★,8Joseph Marvin Imperial★,9,10 Onno P. Kampman★,11Joel Ruben Antony Moniz★,6Muhammad Ravi Shulthan Habibi★,3,2Frederikus Hudi★,12,13 Railey Montalan★,1Ryan Ignatius6Joanito Agili Lopo14William Nixon15Börje F. Karlsson16James Jaya6 Ryandito Diandaru6Yuze Gao6Patrick Amadeus15Bin Wang6Jan Christian Blaise Cruz17Chenxi Whitehouse18 Ivan Halim Parmonangan19Maria Khelli15Wenyu Zhang6Lucky Susanto20Reynard Adha Ryanda21 Sonny Lazuardi Hermawan22Dan John Velasco17Muhammad Dehan Al Kautsar15Willy Fitra Hendria6 Yasmin Moslem23Noah Flynn24Muhammad Farid Adilazuarda8Haochen Li6Johanes Lee15R. Damanhuri25 Shuo Sun6Muhammad Reza Qorib26Amirbek Djanibekov8Wei Qi Leong1Quyet V. Do27Niklas Muennighoff28 Tanrada Pansuwan18Ilham Firdausi Putra6Yan Xu29,27Ngee Chia Tai1Ayu Purwarianti6,30 Sebastian Ruder31William Tjhi1Peerat Limkonchotiwat★,32Alham Fikri Aji★,8Sedrick Keh★,33 Genta Indra Winata★,2Ruochen Zhang★,34Fajri Koto★,8,2Zheng-Xin Yong★,34Samuel Cahyawijaya★,27,2 1AI Singapore 2IndoNLP 3Universitas Indonesia 4Allen Institute for Artificial Intelligence 5RevComm, Inc. 6Independent Researcher 7Tohoku University 8MBZUAI 9University of Bath 10National University Philippines 11MOH Office for Healthcare Transformation (MOHT) 12NAIST 13Works Applications Lab 14Universitas Gadjah Mada 15Institut Teknologi Bandung 16Beijing Academy of Artificial Intelligence (BAAI) 17Samsung Research Philippines 18University of Cambridge 19Queensland University of Technology 20Monash University Indonesia 21Imperial College London 22Independent Design Engineer 23Bering Lab 24Amazon 25Universitas Diponegoro 26NUS 27HKUST 28Contextual AI 29Huawei Noah’s Ark Lab 30Prosa.ai 31Cohere 32VISTEC 33Toyota Research Institute 34Brown University Major contributors


1 Introduction

Despite the Southeast Asia (SEA) region being home to 1,300 indigenous languages (18% of the world’s languages) and 671 million people (8.75% of the world’s population), the representation of texts, images, and audio datasets from this region is significantly lacking in machine learningmodels. This deficiency negatively impacts the model quality for SEA languages. The language coverage of SEA languages in two common pre-training resources, Common Crawl333https://commoncrawl.github.io/cc-crawl-statistics/plots/languages and C4 Xue et al. (2021), is extremely limited, with only 2.36% (in 11 languages) and 10.62% (in 11 languages), respectively. In modalities beyond text, the representation is even more limited. For instance, Common Voice, one of the largest multilingual speech corpora, includes only 6 SEA indigenous languages Conneau et al. (2021); Ardila et al. (2020), and LAION-5B, one of the largest multilingual vision-language (VL) corpora, includes only 12 SEA indigenous languages Schuhmann et al. (2022). While datasets for other SEA indigenous languages may exist, they are often scattered, insufficiently documented, or varied in quality and formatting, thereby making access and usage challenging Cahyawijaya et al. (2023a); Joshi et al. (2020); Aji et al. (2023).

In terms of evaluation, the sparse availability of high-quality test sets for these languages also complicates evaluating models for SEA languages. Despite there being 1,300+ languages in the SEA region, prior works  Winata et al. (2023); Cahyawijaya et al. (2021); Koto and Koto (2020); Zhang et al. (2024); Wang et al. (2024); Nguyen et al. (2023); Leong et al. (2023); Yong et al. (2023) have only evaluated fewer than 10 SEA languages collectively. The actual performance of current models on most SEA languages remains largely unknown.

Moreover, the dominance of Anglocentric training data potentially results in cultural bias when generating texts, images, or audio in underrepresented SEA languages Søgaard (2022); Talat et al. (2022). Further, Durmus et al. (2023); AlKhamissi et al. (2024); Cahyawijaya et al. (2024a) have shown that the learned representations in large language models (LLMs) often fail to reflect local cultural values in SEA Koto et al. (2024); Liu et al. (2024); Adilazuarda et al. (2024). This raises concerns about the ability of current LLMs to generate natural, high-quality texts for this region. In addition, the discrepancy in language support creates language barriers in technological access and risks marginalizing minority groups who do not speak the dominant language.

Refer to caption
Figure 1: Mapping between tasks, schemas, modalities, and language regions across 498 datasheets in SEACrowd.

In this work, we investigate the current AI progress for SEA languages by addressing the challenges of resources, evaluation, and generation quality. Our contributions are three-fold:

  • We bridge the resource gap by centralizing and standardizing similar-to\sim500 corpora in nearly 1,000 SEA languages in SEACrowd, a comprehensive and standardized resource center, across 3 modalities: text, image, and audio.

  • We close the evaluation gap in SEA languages with the SEACrowd Benchmarks, which cover 38 SEA indigenous languages on 13 tasks across 3 modalities, providing insights into the performance of a diverse spectrum of AI models. Further, our study reveals that the generative outputs of existing LLMs exhibit a closer resemblance to translationese rather than natural data in 9 SEA languages.

  • We offer insights and strategies for the future development of AI in SEA, aiming to promote a sustainable and prosperous future through continuous AI advancements.

2 SEACrowd

SEACrowd represents the first comprehensive AI dataset collection initiative for SEA, developed through a collaborative effort among researchers and engineers primarily based in the SEA region. As addressed in §1, resource scarcity and the scattered nature of the data are crucial challenges in SEA. SEACrowd addresses these issues through two primary contributions: 1) consolidating datasheets to enhance data discoverability; and 2) standardizing dataloaders for easier use, especially in multiple dataset loading. We also follow data provenance practices (Longpre et al., 2023) to preserve the proprietary rights of dataset owners.

Consolidating datasheets

We invited contributors to submit datasheet forms Gebru et al. (2021) for publicly available datasets across all modalities including text, audio, and image in SEA languages and/or cultures. These datasheets include detailed information about each dataset, such as data subset(s), description, task, language, license, URL access, annotation method(s), annotation validation, relevant publications, publication venue, and data splits. For each submission, we manually verify and correct it as necessary to ensure datasheet accuracy.

Standardizing dataloaders

For each approved datasheet, we created a standardized dataloader wrapper to facilitate ready-to-use data access since only 38.4% of the consolidated data sources were originally hosted on Hugging Face444https://huggingface.co/. To support diverse task types, we carefully designed the standardized seacrowd schema to support different data structures and modalities (see Appendix F). We also adhere to data provenance practices (Longpre et al., 2023) and document the relevant metadata (e.g., license) in the dataloaders. Furthermore, we engaged with data owners and successfully converted three private datasets into public ones.

These efforts have culminated in 498 datasheets in SEACrowd Catalogue and 399 dataloaders in SEACrowd Data Hub (§2.1). Notably, our centralized data repository covers similar-to\sim1,000 SEA languages, underscoring the extensive linguistic diversity captured by SEACrowd. We elaborate on the SEACrowd dataset statistics in §2.2. SEACrowd’s contribution guidelines, progression details, and reviewing procedure are in Appendix C, D, and E.

2.1 SEACrowd Catalogue & Data Hub

SEACrowd comprises two interconnected platforms: SEACrowd Catalogue555SEACrowd Catalogue is also present in csv format. and SEACrowd Data Hub. These platforms work in tandem to consolidate the datasheet submissions and provide a standardized pipeline for SEACrowd. Specifically, Catalogue houses the datasheets (metadata), while Data Hub stores the standardized dataloaders and the seacrowd library666All codes are available under Apache License 2.0. for the schemas and configurations (Appendix F). These systems share information on the datasheets and dataloaders, allowing users to seamlessly explore and utilize them.

Refer to caption
(a) NLU evaluation
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) NLG evaluation
Figure 2: Zero-shot model performance across NLU and NLG tasks in SEA languages.
Model Gini \downarrow
Commercial
      GPT-4 0.155
      Command-R 0.184
English
      Mistral 0.159
      Llama3 0.131
      Falcon 0.238
Multilingual
      mT0 0.131
      BLOOMZ 0.228
      BactrianX-Llama 0.163
      AYA-23 0.183
      AYA-101 0.095
SEA regional
      SEA-LION 0.204
      SeaLLM v2.5 0.116
      Sailor 0.145
SEA country
      Cendol-mT5 0.378
      Cendol-Llama2 0.267
      Merak v4 0.199
      WangchanX-Llama3 0.153
      Malaysian Llama3 0.179
Table 1: Language equity across baselines based on Gini coefficient weighted by population (τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5).

2.2 Datasets in SEACrowd

SEACrowd consolidates 498 datasheets with diverse tasks in SEA languages and provides standardized access through dataloaders to 399 of them. As shown in Figure 1, approximately 81% of the datasets in SEACrowd are textual data, with the remaining similar-to\sim8% and similar-to\sim11% being VL and speech, respectively. The complete list of SEA indigenous languages covered by SEACrowd and their mapping to the relevant SEA regions are provided in Appendix K. Around similar-to\sim53% of the datasets have a commercially permissive license.

A total of 83 tasks are provided in SEACrowd with a breakdown of 66 in NLP (e.g., abusive language detection, intent classification, instruction tuning, named entity recognition, etc.), 10 in VL (image-to-text generation, sign language recognition, video captioning, etc.), and 7 in speech (e.g., automatic speech recognition, text-to-speech, speech emotion recognition, and others). These tasks are then standardized into 20 dataloader schemas described in Appendix F. Further discussion regarding resources in SEACrowd is in §5.1.

3 SEACrowd Benchmarks

To understand the capability of state-of-the-art models, we conduct comprehensive evaluations of existing LLMs, VLMs, and speech models from various architectures and training approaches. To construct a benchmark suite777https://github.com/SEACrowd/seacrowd-experiments, we select a subset of the dataset that has been manually annotated and/or validated from the data presented in §2.2. More details regarding the data subsets, baselines, and prompts used for the evaluations are given in Appendix G.1, G.2, and G.3.

3.1 Datasets

NLP

Our natural language understanding (NLU) benchmark consists of 131 data subsets and 7 tasks: sentiment analysis, topic classification, natural language inference (NLI), commonsense reasoning, exam-style multiple-choice question answering (QA), culture understanding, and reading comprehension. It covers English (eng) and 33 SEA indigenous languages.

We utilize 100 data subsets for the natural language generation (NLG) benchmark, which covers machine translation (MT) between English and SEA languages from both directions, summarization, as well as extractive or abstractive question answering, covering 27 SEA indigenous languages.

Speech

We employ 19 automatic speech recognition (ASR) data subsets to evaluate the capability of speech models in 15 SEA indigenous languages.

VL

We assess the models on image captioning using four data subsets in 4 SEA indigenous languages, i.e., Filipino (fil), Indonesian (ind), Thai (tha), and Vietnamese (vie). This disparity in the evaluation scale is due to the fact that only a few datasets in SEACrowd are VL datasets, and even fewer are annotated by humans.

3.2 Baselines

Complete details regarding the model architectures, model sizes, seen languages, corresponding publications, and other aspects are in Appendix G.2.

NLP

To evaluate the zero-shot performance of instruction-tuned LLMs on SEA languages, we benchmark two commercial, i.e., GPT-4 OpenAI et al. (2024) and Command-R888https://docs.cohere.com/docs/command-r, and 17 open-source baselines, the majority of which are similar-to\sim7B-13B parameters. We categorize the open-source baselines according to the language(s) coverage in pre-training and/or instruction tuning, i.e., 1) English: Llama3 Touvron et al. (2023), Mistral Jiang et al. (2023), and Falcon Almazrouei et al. (2023); 2) Multilingual: AYA-101, AYA-23 Üstün et al. (2024), mT0, BLOOMZ Muennighoff et al. (2022), and BactrianX-Llama Li et al. (2023a); 3) SEA regional: SEA-LION Singapore (2023), Sailor Dou et al. (2024), and SeaLLM Nguyen et al. (2023); and 4) SEA country-specific: Cendol-mT5, Cendol-Llama2 Cahyawijaya et al. (2024b), and Merak Ichsan (2023) from Indonesia, WangchanX-Llama3 Phatthiyaphaibun et al. (2024) from Thailand, and Malaysian-Llama3999https://huggingface.co/mesolitica/malaysian-llama-3-8b-instruct-16k from Malaysia.

Speech

We evaluate the zero-shot performance of state-of-the-art multilingual pre-trained speech models in transcribing speech in SEA languages. Specifically, we consider Whisper v3 Radford et al. (2023), MMS 1B Pratap et al. (2024), and Seamless M4T v2 Communication et al. (2023), which have shown proficiency in accurately transcribing multiple languages without fine-tuning. Additionally, we include models that are fine-tuned on specific language(s), SEA or English, based on 1) Wav2Vec2 XLSR Conneau et al. (2021) and 2) XLS-R Babu et al. (2021), known for their cross-lingual speech representation learning by pre-training on raw speech waveforms across diverse languages, with XLS-R offering broader language coverage, and 3) Whisper, which leverages weakly supervised pre-training on spectrograms of speech in diverse languages. The specific fine-tuned models are evaluated: XLSR on ind, jav, sun; XLSR and Whisper on Indonesian (ind); XLSR and Whisper on Thai (tha); XLS-R on Tagalog (tgl); XLS-R on Burmese (mya); XLS-R and Whisper on Khmer (khm); and XLSR on English (eng). See Appendix G.2 for details.

VL

We consider state-of-the-art VLMs primarily trained on English pre-training and instruction-following data: LLaVA Liu et al. (2023b, a), InstructBLIP Dai et al. (2024), and Idefics2 Laurençon et al. (2024), and VLMs trained in a multilingual manner: mBLIP Geigle et al. (2023) and PaliGemma Gemma Team et al. (2024), to assess their image captioning ability in SEA languages.

3.3 Experimental Settings

We conduct all evaluations in a zero-shot fashion. We employ 3 prompt templates in English for each NLU task and 1 for each NLG task. We utilize the weighted F1 score to measure the model performance on NLU tasks and n-gram reference-based metrics, i.e., chrF++ Popović (2015, 2017) and ROUGE-L Lin (2004), on NLG tasks. As for VL, aside from a prompt template in English, we also use a prompt template in the respective SEA indigenous language per data subset. We report CIDEr Vedantam et al. (2015) for the image captioning task. For ASR, we use word error rate (WER) for languages with Latin script and character error rate (CER) for those with non-Latin script.

4 Result & Analysis

Refer to caption
Figure 3: Speech model error rate (%\downarrow) across existing ASR tasks in SEA languages.

4.1 State-of-the-Art Models on SEA languages

LLMs

Figure 2(a) and 2(b) illustrate the overall model performance of the LLM baselines in SEA languages for both NLU tasks and NLG tasks. In our NLU evaluation, AYA-101, a large multilingual instruction-tuned language model covering 101 languages, demonstrates the best zero-shot performance. It is followed by the commercial baselines, which achieve a median of similar-to\sim0.6 weighted F1-score. Sailor and SeaLLM, models specifically trained with SEA languages, also display competitive performance. Similarly, mT0 exhibits strong generalization abilities due to its exposure to similar-to\sim100 languages in pre-training, including those from the SEA region Muennighoff et al. (2022). In contrast, most English and SEA country-specific baselines perform less effectively, likely due to their narrow focus on English or a limited set of SEA languages, such as Indonesian languages for Cendol and Thai for WangchanX-Llama3. Similar and consistent trends are observed on MT task, while the baselines’ poorer scores on abstractive/extractive QA and summarization indicate their ineffectiveness in producing acceptable outputs in SEA languages for these tasks, which is especially pronounced in the open-source baselines. Appendix G.4 describes the performance of LLMs per language.

To analyze the equality in model performance across SEA languages, following Khanuja et al. (2023), we utilize the Gini coefficient—originally used to observe income equality Dorfman (1979)—weighted by demand and parameterized by τ𝜏\tauitalic_τ. Here, τ=1𝜏1\tau=1italic_τ = 1 corresponds to a demographic notion of demand, considering language population size, while τ=0𝜏0\tau=0italic_τ = 0 does not take population size into account Blasi et al. (2022). Table 1 shows that models trained on more SEA languages, such as multilingual and SEA regional baselines, generally exhibit greater language equity. For instance, although Command-R and GPT-4 are competitive performance-wise against AYA-101 and mT0, AYA-101 and mT0 demonstrate higher equality across all SEA languages under study. This trend is consistent across different τ𝜏\tauitalic_τ (see Appendix G.5).

Refer to caption
Figure 4: Existing VLMs produce subpar image captions in SEA languages.

Speech models

Figure 3 presents the off-the-shelf speech model performance on ASR across languages in SEA, measured by the error rate percentage. 9 of the 15 SEA languages in our speech evaluation belong to the Austronesian language family. The other 6 are khm and vie, which belong to Austro-Asiatic, cnh and mya belong to Sino-Tibetan, and tha and vie belong to the Kra-Dai language family. The multilingual pre-trained baselines have a competitive generalization capability across languages, although it varies by language. For instance, Whisper v3 demonstrates significantly higher effectiveness for national languages such as ind, zlm, fil, tha, and vie, while performing less optimally for other indigenous languages. Conversely, Seamless M4T v2 shows a more balanced performance across the languages. Regarding fine-tuned baselines, error rates decrease for their seen languages. The fine-tuned Whisper models, however, manage to better optimize for the target language while retaining their original capabilities in other SEA languages compared to their Wav2Vec2 XLSR and XLS-R counterparts, despite both having been pre-trained in a multilingual manner. This observation aligns with the findings of Rouditchenko et al. (2023), who find that the number of hours seen per language and language family during pre-training is predictive of how the models compare, in which Whisper’s pre-training data duration for these four language families exceeds that of XLSR.

VLMs

Figure 4 depicts the zero-shot performance of off-the-shelf VLMs on image captioning in SEA indigenous languages. Despite the capability of LLMs for zero-shot cross-lingual generalization Huang et al. (2021); Täckström et al. (2012); Neubig and Hu (2018); Artetxe et al. (2020), VLMs trained only in English (i.e., InstructBLIP, LLaVA, and Idefics2) fail to exhibit this capability, struggling to generate adequate image captions in SEA languages. Multilingual VL pre-training is crucial to achieving aligned multilingual representations Burns et al. (2020); Li et al. (2023b); Huang et al. (2021). For instance, PaliGemma and mBLIP generate better image captions in tha and fil when prompted in the relevant SEA languages.

Model Natural outputs
SEA-LION 58.57%
AYA-23 43.57%
Sailor 37.86%
Cendol-Llama2 37.37%
Malaysian Llama3 36.90%
WangchanX-Llama3 30.24%
Falcon 29.52%
BactrianX-Llama 28.10%
SeaLLM 27.38%
Merak 26.19%
BLOOMZ 25.00%
Cendol-MT5 24.05%
Command-R 20.95%
mT0-XL 19.76%
Mistral 19.52%
GPT-4 16.67%
Llama3 14.05%
AYA-101 8.33%
(a) Avg. by models
Language Natural outputs
Indonesian (ind) 41.58%
Vietnamese (vie) 37.31%
Thai (tha) 34.21%
Khmer (khm) 29.21%
Lao (lao) 28.42%
Malay (zlm) 22.24%
Burmese (mya) 19.47%
Filipino (fil) 12.22%
English (eng) 8.95%
(b) Avg. by languages
Table 2: Current LLMs are still incapable of generating natural texts in SEA languages. As spoken in SEA regions, not worldwide.
Refer to caption
(a) Language coverage
Refer to caption
(b) Annotation quality
Refer to caption
(c) Cultural relevance
Figure 5: The resource gap in SEA in terms of language coverage, annotation quality, and cultural relevance.

However, when prompted in eng, the performance of these multilingual baselines varies notably. PaliGemma’s performance collapses completely, while mBLIP’s performance shows both increases and decreases across different SEA languages. This raises the question of whether the multilingual VLMs can maintain consistent performance across different languages used in the instructions and the tasks. It highlights the need for further research into the mechanisms that drive these variations and how to achieve robust multilingual performance in VLMs across diverse linguistic contexts. Understanding these dynamics is crucial for improving VLMs’ generalization capabilities and ensuring equitable performance across all languages, despite most related works focusing on monolingual visual instruction tuning Liu et al. (2023b); Gong et al. (2023); Zhu et al. (2024).

4.2 Generation Quality in SEA Languages: Translationese vs. Natural Language

Classifying Translationese in SEA Languages

To analyze the generation quality of LLMs in SEA languages, we build a text classifier to discriminate between translationese and natural texts Riley et al. (2020). We construct a translationese classification training and testing dataset using 49 and 62 data subsets, respectively, covering approximately 39.9k and 51.5k sentences across English (eng) and 8 SEA languages: Indonesian (ind), Khmer (khm), Lao (lao), Burmese (mya), Filipino (fil), Thai (tha), Vietnamese (vie), and Malay (zlm). The training and test data are detailed in Appendix H.1.

We fine-tune a classifier from mDeBERTaV3 He et al. (2020, 2022)101010https://huggingface.co/microsoft/mdeberta-v3-base using these data and achieve 79.08% accuracy on the test set in predicting translationese across these 9 languages. The detailed results and ablation studies of our translationese classifier experiments are provided in Appendix H.2. This classifier enables us to assess the generation quality of LLMs by distinguishing between translationese and naturally occurring text, providing insights into the models’ performance in producing authentic language output.

Generation Quality of LLMs

We evaluate the generation quality of LLMs in 9 SEA languages by generating answers to natural, general, and safety questions from Sea-Bench Nguyen et al. (2023). As shown in Table 2(a), LLMs with extensive language coverage but less focus on SEA languages, e.g., AYA-101 Üstün et al. (2024), GPT-4 OpenAI et al. (2024), mT0 Muennighoff et al. (2023); Xue et al. (2021), and Llama3 AI@Meta (2024), tend to produce natural sentences less than 20% of the time. In contrast, models with narrower language coverage but a greater focus on SEA languages, such as Cendol-Llama2 Cahyawijaya et al. (2024b), Sailor Dou et al. (2024), AYA-23 Aryabumi et al. (2024), and SEA-LION Singapore (2023), generate natural sentences over 35% of the time.

However, even the LLM with the least translationese generation, SEA-LION, only produces natural SEA sentences 57.71% of the time, highlighting a significant quality gap in generating natural sentences in SEA languages. As displayed in Table 2(b), the translationese issue varies across SEA languages. Languages such as Tagalog (tgl), Burmese (mya), and Malay (zlm) have more severe translationese problems, with existing LLMs producing natural sentences only 11.58%, 19.47%, and 22.24% of the time, respectively. This underscores the need for further improvements in LLMs to more effectively address the linguistic diversity and complexity of SEA languages.

5 Discussions

5.1 Resource Gaps in SEA

Coverage

SEACrowd covers 980 out of the 1,308 languages spoken in SEA (74.9%). Despite this high coverage, language representation in SEACrowd exhibits a very long-tail distribution, with over 700 languages having only 1 or 2 datasets, and only 23 languages having 20 datasets or more. These less represented languages typically exist only in the form of lexicons Asgari et al. (2020); List et al. (2022) or unlabeled data Leong et al. (2022); Kudugunta et al. (2024); Nguyen et al. (2024). Existing tasks in SEACrowd still cover only a small portion of languages. For instance, sentiment analysis data is available for only 22 languages, and named entity recognition (NER) data is available for just 17 languages. Furthermore, for modalities beyond text, SEA resources are extremely underrepresented. Approximately 90% of SEA indigenous languages lack both speech and VL datasets.

Refer to caption
Refer to caption
Figure 6: SEA languages prioritization based on (top) current utility and (bottom) resource availability. The languages are ranked based on the descending order of the area size of their missing potential .

Quality

78.7% of the datasets in SEACrowd are published in peer-reviewed venues, and most of the data has undergone external validation. The overall quality of the datasets in SEACrowd is depicted in Figure 5(b). We compile the reported data construction methods by the authors, considering both the data collection method (i.e., data source) and label annotation validation (i.e., quality control). Nearly 19% of the datasets in SEACrowd have machine-generated and machine-translated annotations, while more than 80% were obtained from online texts (e.g., web crawling) and expert generation. In terms of label annotation validation, 62.4% of the datasets have been fully manually checked, while the remaining portion is partially validated and automatically checked. Note that these statistics only provide an initial indication of dataset collection quality on the surface and do not necessarily reflect the exact quality. Only a few datasets (6%) in SEACrowd report their detailed quality metrics (e.g., inter-annotator agreement scores). A deeper investigation is required for future work.

Cultural Relevance

The resource gap in SEA extends to the cultural aspect, where misrepresentation can lead to offensive behaviors, e.g., cultural appropriation and stereotyping Evans et al. (2020); Glotov (2023). As a proxy of the cultural relevance of SEA datasets, we manually curated 259 data subsets used in SEACrowd evaluation based on their data source. Specifically, we categorize them whether they are 1) translated from another language, 2) crawled from local sources, or 3) hand-crafted to capture cultural relevance. In Figure 5(c), approximately 70% lack cultural relevance, as many are machine-translated from English sources. About 20% are taken from local news, social media, or other local outlets, which potentially contain some culturally relevant data. Only the remaining 10% are designed to consider cultural relevance, derived from studies highlighting serious deficiencies in cultural understanding by LLMs for underrepresented languages Kabra et al. (2023); Koto et al. (2023a); Wibowo et al. (2023); Liu et al. (2024); Koto et al. (2024).

5.2 Conclusion & Future Work

Southeast Asia is home to highly diverse languages and cultures; the majority of its people do not use English as their primary language. The utility of English-first AI is limited for the majority of Southeast Asian users, especially in critical sectors like healthcare and education. Through SEACrowd, we have explored the AI landscape in SEA and bridged the gaps in resources, evaluation, and naturalness analysis of AI models in SEA languages. Further, our initiative has nurtured an open-source research community, which will actively continue to add and maintain datasheets and dataloaders, as well as drive AI research and developments in SEA.

Nonetheless, AI development in SEA requires concentrated efforts by a range of stakeholders, who may prioritize differently when it comes to incorporating the region’s 1,300+ languages into AI models. Moving forward, our work suggests AI development in SEA should prioritize two key metrics: 1) potential utility and 2) resource equity.111111https://github.com/SEACrowd/globalutility

Potential utility

Potential utility is defined as the gap between current utility and ideal utility, in which model capability acts as a proxy for utility. Based on potential utility, unsurprisingly the development of the national languages (except for English and Chinese used in Singapore), i.e., Indonesian (ind), Burmese (mya), Vietnamese (vie), Thai (tha), Filipino (fil), Khmer (khm), Malay (zlm), and Lao (lao) in Figure 6, will bring the biggest benefit. Among them, we identify notable gaps in the naturalness of Malay, Burmese, and Filipino AI-generated outputs (§4.2). Focused efforts in resource building for these languages may move the needle the most for utility. Beyond the national languages, growing local languages or dialects with large speaker bases, e.g., Javanese (jav), Sundanese (sun), and Hmong (hmn), is key.

Resource equity

Resource equity is defined as the gap between existing and ideal resource availability (Figure 6). We found that many local languages or dialects still fall short of the expected level of resources. These include Northeastern Thai (tts), Northern Thai (nod), Hmong Do (hmv), Southern Thai (sou), Cebuano (ceb), Ilocano (ilo), and others. Efforts to narrow these gaps would not only help preserve these languages but also ensure the continuation of the cultural heritage of the speakers of these languages. More details on SEA language prioritization for different weightings of demand can be found in Appendix I.

To improve these metrics, governments, and industry leaders in the region should invest in R&D activities to improve regional language capability for both the national languages and local dialects. This could include funding for open data collection and collaborations with local communities to address the resource gap in local languages. This also requires long-term sustainable strategies, such as catalyzing profitable use cases based on inclusive AI models, promoting fair and responsible compensation schemes for data workers, and orchestrating win-win exemplar collaborations between data owners, AI, and application developers.

Acknowledgments

We would like to thank our amazing contributors: Joshua Spergel, Tiezheng Yu, Parinthapat Pengpun, Bin Wang, Ishan Jindal, Muhammad Satrio, Jipeng Zhang, Bhavish Pahwa, Haryo Akbarianto Wibowo, Hiroki Nomoto, Yohanes Sigit Purnomo W.P., Ahmad Fathan Hidayatullah, Bryan Wilie, Ruhiyah Faradishi Widiaputri, Rafif Rabbani, Fawwaz Mayda, Manoj Khatri, Supryadi Supryadi, Virach Sornlertlamvanich, Pavaris Ruangchutiphophan, Erland Hilman Fuadi, Mega Fransiska, Richardy Sapan, and Camilla Johnine Cosme for their hard work in submitting datasheets and implementing dataloaders for SEACrowd.

This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme; PhD Fellowship Award, the Hong Kong University of Science and Technology; and PF20-43679 Hong Kong PhD Fellowship Scheme, Research Grant Council, Hong Kong. JMI is funded by National University Philippines and the UKRI Centre for Doctoral Training in Accountable, Responsible and Transparent AI [EP/S023437/1] of the University of Bath. In addition, we would like to express our gratitude to Cohere For AI sfor providing research grants that enabled us to perform our experiments using a commercial baseline, specifically Command-R.

Limitations

While our work covers nearly 1,000 SEA languages, many dialects, which are considered as belonging to a parent language, are missing from our evaluation benchmark. For instance, for the Malay language, only Standard Malay (zsm) is evaluated, but not other dialects such as Sarawak Malay (zlm-sar). Furthermore, the majority of our datasets also do not contain code-switched texts, which is a common linguistic phenomenon of SEA language usage (Aji et al., 2023). Moreover, the language coverage of different evaluation tasks varies significantly. For instance, NLP tasks cover 34 languages in total, whereas VL tasks only cover 4 languages.

Ethics Statement

In developing an evaluation benchmark for SEA languages, we have taken several steps to ensure ethical considerations are addressed comprehensively. First, the data used for this benchmark is sourced from publicly available resources, ensuring compliance with legal and ethical standards regarding data privacy. Where applicable, explicit consent was obtained from data contributors. Furthermore, all the datasets and resources utilized in this benchmark are used in accordance with their respective licenses. Second, our benchmark aims to be inclusive, representing a wide range of SEA languages, including those that are underrepresented in current linguistic resources. Lastly, our research process, including data collection, benchmark development, and evaluation methodologies, is entirely open-sourced and is documented transparently to enable reproducibility and accountability.

References

  • Adelani et al. (2022a) David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. 2022a. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
  • Adelani et al. (2024) David Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba Alabi, Yanke Mao, Haonan Gao, and En-Shiun Lee. 2024. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julian’s, Malta. Association for Computational Linguistics.
  • Adelani et al. (2022b) David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, Mboning Tchiaze Elvis, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia, and Joyce Nakatumba-Nabende. 2022b. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Adelani et al. (2021) David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei. 2021. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
  • Adelani et al. (2023) David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode, Tolulope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko, Abeeb Afolabi, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Ogbu, Chinedu Mbonu, Chiamaka Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola Awosan, Tadesse Kebede, Toadoum Sari Sakayo, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Kanda Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. 2023. MasakhaNEWS: News topic classification for African languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 144–159, Nusa Dua, Bali. Association for Computational Linguistics.
  • Adilazuarda et al. (2023) Muhammad Farid Adilazuarda, Samuel Cahyawijaya, and Ayu Purwarianti. 2023. The obscure limitation of modular multilingual language models. ICLR Tiny Papers 2023.
  • Adilazuarda et al. (2024) Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Ashutosh Dwivedi, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards measuring and modeling "culture" in llms: A survey. Preprint, arXiv:2403.15412.
  • AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
  • Aji et al. (2023) Alham Fikri Aji, Jessica Zosa Forde, Alyssa Marie Loo, Lintang Sutawika, Skyler Wang, Genta Indra Winata, Zheng-Xin Yong, Ruochen Zhang, A. Seza Doğruöz, Yin Lin Tan, and Jan Christian Blaise Cruz. 2023. Current status of NLP in south East Asia with insights from multilingualism and language diversity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract, pages 8–13, Nusa Dua, Bali. Association for Computational Linguistics.
  • Aji et al. (2022) Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, and Sebastian Ruder. 2022. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
  • AlKhamissi et al. (2024) Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. Preprint, arXiv:2402.13231.
  • Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance.
  • Ardila et al. (2020) Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
  • Artetxe et al. (2020) Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
  • Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open weight releases to further multilingual progress. Preprint, arXiv:2405.15032.
  • Asai et al. (2023) Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila B Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2023. BUFFET: Benchmarking large language models for cross-lingual few-shot transfer. Preprint, arXiv:2305.14857.
  • Asgari et al. (2020) Ehsaneddin Asgari, Fabienne Braune, Benjamin Roth, Christoph Ringlstetter, and Mohammad Mofrad. 2020. UniSent: Universal adaptable sentiment lexica for 1000+ languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4113–4120, Marseille, France. European Language Resources Association.
  • Astuti et al. (2023) Laksmita Widya Astuti, Yunita Sari, and Suprapto. 2023. Code-mixed sentiment analysis using transformer for twitter social media data. International Journal of Advanced Computer Science and Applications, 14(10).
  • Babu et al. (2021) Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. 2021. Xls-r: Self-supervised cross-lingual speech representation learning at scale. Preprint, arXiv:2111.09296.
  • Bandarkar et al. (2023) Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2023. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
  • Blasi et al. (2022) Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
  • Burns et al. (2020) Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, and Bryan A Plummer. 2020. Learning to scale multilingual representations for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 197–213. Springer.
  • Cahyawijaya et al. (2022) Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, and Ayu Purwarianti. 2022. Nusacrowd: A call for open and reproducible nlp research in indonesian languages. Preprint, arXiv:2207.10524.
  • Cahyawijaya et al. (2024a) Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, and Pascale Fung. 2024a. High-dimension human value representation in large language models. arXiv preprint arXiv:2404.07900.
  • Cahyawijaya et al. (2023a) Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, and Ayu Purwarianti. 2023a. NusaCrowd: Open source initiative for Indonesian NLP resources. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13745–13818, Toronto, Canada. Association for Computational Linguistics.
  • Cahyawijaya et al. (2023b) Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Linuwih, Bryan Wilie, Galih Muridan, Genta Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2023b. NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945, Nusa Dua, Bali. Association for Computational Linguistics.
  • Cahyawijaya et al. (2024b) Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2024b. Cendol: Open instruction-tuned generative large language models for indonesian languages. Preprint, arXiv:2404.06138.
  • Cahyawijaya et al. (2021) Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, Adhiguna Kuncoro, Sebastian Ruder, Zhi Yuan Lim, Syafri Bahar, Masayu Khodra, Ayu Purwarianti, and Pascale Fung. 2021. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Catapang and Visperas (2023) Jasper Kyle Catapang and Moses Visperas. 2023. Emotion-based morality in Tagalog and English scenarios (EMoTES-3K): A parallel corpus for explaining (im)morality of actions. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 1–6, Tokyo, Japan. Association for Computational Linguistics.
  • Communication et al. (2023) Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, and Skyler Wang. 2023. Seamlessm4t: Massively multilingual & multimodal machine translation. Preprint, arXiv:2308.11596.
  • Conneau et al. (2021) Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430.
  • Conneau et al. (2022) Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. 2022. XTREME-S: Evaluating Cross-lingual Speech Representations. In Proc. Interspeech 2022, pages 3248–3252.
  • Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  • Costa-jussà et al. (2024) Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang, and N. L. L. B. Team. 2024. Scaling neural machine translation to 200 languages. Nature.
  • Dabre et al. (2022) Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh Khapra, and Pratyush Kumar. 2022. IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849–1863, Dublin, Ireland. Association for Computational Linguistics.
  • Dac Lai et al. (2023) Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2023. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2307.
  • Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  • Dorfman (1979) Robert Dorfman. 1979. A formula for the gini coefficient. The review of economics and statistics, pages 146–149.
  • Dou et al. (2024) Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, and Min Lin. 2024. Sailor: Open language models for south-east asia. Preprint, arXiv:2404.03608.
  • Dryer and Haspelmath (2013) Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online (v2020.3). Zenodo.
  • Durmus et al. (2023) Esin Durmus, Karina Nguyen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. 2023. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
  • Eberhard et al. (2021) David M. Eberhard, Gary F. Simons, and Charles D. Fennig. 2021. Ethnologue: Languages of the World. Twenty-fourth edition. Dallas, Texas: SIL International.
  • Ebrahimi et al. (2022) Abteen Ebrahimi, Manuel Mager, Arturo Oncevay, Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan Vladimir Meza Ruiz, Gustavo Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando Coto-Solano, Thang Vu, and Katharina Kann. 2022. AmericasNLI: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6279–6299, Dublin, Ireland. Association for Computational Linguistics.
  • Elias (2018) Alexander Elias. 2018. Lio and the central flores languages. Leiden: Leiden University Master thesis.
  • Evans et al. (2020) Leanne M Evans, Crystasany R Turner, and Kelly R Allen. 2020. " good teachers" with" good intentions": Misappropriations of culturally responsive pedagogy. Journal of Urban Learning, Teaching, and Research, 15(1):51–73.
  • Federmann et al. (2022) Christian Federmann, Tom Kocmi, and Ying Xin. 2022. NTREX-128 – news test references for MT evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 21–24, Online. Association for Computational Linguistics.
  • Gebru et al. (2021) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Communications of the ACM, 64(12):86–92.
  • Geigle et al. (2023) Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. 2023. mblip: Efficient bootstrapping of multilingual vision-llms. arXiv, abs/2307.06930.
  • Gemma Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Open models based on gemini research and technology. Preprint, arXiv:2403.08295.
  • Glotov (2023) Sergei Glotov. 2023. Intercultural film literacy education against cultural misrepresentation: Finnish visual art teachers’ perspectives. Journal of Media Literacy Education, 15(1):31–43.
  • Gong et al. (2023) Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
  • Hammarström et al. (2024) Harald Hammarström, Robert Forkel, Martin Haspelmath, and Sebastian Bank. 2024. Glottolog 5.0. leipzig: Max planck institute for evolutionary anthropology.
  • Hasan et al. (2021) Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  • He et al. (2022) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2022. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
  • He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  • Huang et al. (2021) Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann. 2021. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2443–2459, Online. Association for Computational Linguistics.
  • Huynh et al. (2022) Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2022. ViNLI: A Vietnamese corpus for studies on open-domain natural language inference. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3858–3872, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  • Ichsan (2023) Muhammad Ichsan. 2023. Merak-7b: The llm for bahasa indonesia. Hugging Face Repository.
  • Imperial et al. (2019) Joseph Marvin Imperial, Jeyrome Orosco, Shiela Mae Mazo, and Lany Maceda. 2019. Sentiment analysis of typhoon related tweets using standard and bidirectional recurrent neural networks. arXiv preprint arXiv:1908.01765.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jiang et al. (2022) Shengyi Jiang, Sihui Fu, Nankai Lin, and Yingwen Fu. 2022. Pretrained models and evaluation data for the khmer language. Tsinghua Science and Technology, 27(4):709–718.
  • Joshi et al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  • Juan et al. (2015) Sarah Samson Juan, Laurent Besacier, Benjamin Lecouteux, and Mohamed Dyab. 2015. Using resources from a closely-related language to develop asr for a very under-resourced language: A case study for iban. In Proceedings of INTERSPEECH, Dresden, Germany.
  • Kabra et al. (2023) Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, and Graham Neubig. 2023. Multi-lingual and multi-cultural figurative language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada. Association for Computational Linguistics.
  • Kakwani et al. (2020) Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961, Online. Association for Computational Linguistics.
  • Karo et al. (2022) Ichwanul Muslim Karo Karo, Mohd Farhan Md Fudzee, Shahreen Kasim, and Azizul Azhar Ramli. 2022. Sentiment analysis in karonese tweet using machine learning. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 10(1):219–231.
  • Khanuja et al. (2023) Simran Khanuja, Sebastian Ruder, and Partha Talukdar. 2023. Evaluating the diversity, equity, and inclusion of NLP technology: A case study for Indian languages. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1763–1777, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Koto et al. (2023a) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023a. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore. Association for Computational Linguistics.
  • Koto et al. (2023b) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023b. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
  • Koto et al. (2022) Fajri Koto, Timothy Baldwin, and Jey Han Lau. 2022. Cloze evaluation for deeper understanding of commonsense stories in Indonesian. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 8–16, Dublin, Ireland. Association for Computational Linguistics.
  • Koto and Koto (2020) Fajri Koto and Ikhwan Koto. 2020. Towards computational linguistics in Minangkabau language: Studies on sentiment analysis and machine translation. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, pages 138–148, Hanoi, Vietnam. Association for Computational Linguistics.
  • Koto et al. (2024) Fajri Koto, Rahmad Mahendra, Nurul Aisyah, and Timothy Baldwin. 2024. Indoculture: Exploring geographically-influenced cultural commonsense reasoning across eleven indonesian provinces. Preprint, arXiv:2404.01854.
  • Kudugunta et al. (2024) Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2024. Madlad-400: a multilingual and document-level large audited dataset. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
  • Kumar et al. (2022) Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh Mishra, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra, and Pratyush Kumar. 2022. IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Laurençon et al. (2024) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models? Preprint, arXiv:2405.02246.
  • Le and Luu (2023) Thang Le and Anh Luu. 2023. A parallel corpus for Vietnamese central-northern dialect text transfer. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13839–13855, Singapore. Association for Computational Linguistics.
  • Leong et al. (2022) Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham Owodunni, and Daniel Whitenack. 2022. Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8608–8621, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Leong et al. (2023) Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto, Hamsawardhini Rengarajan, Kengatharaiyer Sarveswaran, and William Chandra Tjhi. 2023. Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models. arXiv preprint arXiv:2309.06085.
  • Li et al. (2023a) Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. 2023a. Bactrian-x: A multilingual replicable instruction-following model with low-rank adaptation. arXiv preprint arXiv:2305.15011.
  • Li et al. (2023b) Zejun Li, Zhihao Fan, Jingjing Chen, Qi Zhang, Xuanjing Huang, and Zhongyu Wei. 2023b. Unifying cross-lingual and cross-modal modeling towards weakly supervised multilingual vision-language pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5939–5958, Toronto, Canada. Association for Computational Linguistics.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Lin et al. (2022) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • List et al. (2022) Johann-Mattis List, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch, and Russell D. Gray. 2022. Lexibank, a public repository of standardized wordlists with computed phonological and lexical features. Scientific Data, 9(1):316.
  • Liu et al. (2024) Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. 2024. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings. Preprint, arXiv:2309.08591.
  • Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved baselines with visual instruction tuning.
  • Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual instruction tuning. In NeurIPS.
  • Longpre et al. (2021) Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406.
  • Longpre et al. (2023) Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2023. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  • Mager et al. (2021) Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, and Katharina Kann, editors. 2021. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas. Association for Computational Linguistics, Online.
  • Mahendra et al. (2021) Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. 2021. IndoNLI: A natural language inference dataset for Indonesian. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10511–10527, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Muennighoff et al. (2023) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  • Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  • Muzad and Rahutomo (2016) Aad Muzad and Faisal Rahutomo. 2016. Korpus berita daring bahasa indonesia dengan depth first focused crawling. Prosiding Sentrinov (Seminar Nasional Terapan Riset Inovatif), 2(1):11–20.
  • Neubig and Hu (2018) Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 875–880, Brussels, Belgium. Association for Computational Linguistics.
  • Nguyen et al. (2020) Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan Nguyen. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2595–2605, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • Nguyen et al. (2024) Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4226–4237, Torino, Italia. ELRA and ICCL.
  • Nguyen et al. (2023) Xuan-Phi Nguyen, Wenxuan Zhang, Li Xin, Mahani Aljunied, Weiwen Xu, Hou Pong Chan, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2023. Seallms - large language models for southeast asia. Preprint, arXiv:arXiv:2312.00738.
  • OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Palen-Michel and Lignos (2023) Chester Palen-Michel and Constantine Lignos. 2023. LR-sum: Summarization for less-resourced languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6829–6844, Toronto, Canada. Association for Computational Linguistics.
  • Phatthiyaphaibun et al. (2023) Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai natural language processing in python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore. Association for Computational Linguistics.
  • Phatthiyaphaibun et al. (2024) Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, and Sarana Nutanong. 2024. Wangchanlion and wangchanx mrc eval. Preprint, arXiv:2403.16127.
  • Ponti et al. (2020) Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
  • Popović (2015) Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  • Popović (2017) Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
  • Pratap et al. (2024) Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. 2024. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
  • Project (2024) The Joshua Project. 2024. The joshua project.
  • Purwarianti and Crisdayanti (2019) Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti. 2019. Improving bi-lstm performance for indonesian sentiment analysis using paragraph vector. In 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pages 1–5. IEEE.
  • Purwarianti et al. (2007) Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa. 2007. A machine learning approach for Indonesian question answering system. In Artificial Intelligence and Applications, pages 573–578.
  • Putra et al. (2024) I Made Suwija Putra, Daniel Siahaan, and Ahmad Saikhu. 2024. Snli indo: A recognizing textual entailment dataset in indonesian derived from the stanford natural language inference dataset. Data in Brief, 52:109998.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  • Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
  • Riccosan and Saputra (2023) Riccosan and Karen Etania Saputra. 2023. Multilabel multiclass sentiment and emotion dataset from indonesian mobile application review. Data in Brief, 50:109576.
  • Riley et al. (2020) Parker Riley, Isaac Caswell, Markus Freitag, and David Grangier. 2020. Translationese as a language in “multilingual” NMT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online. Association for Computational Linguistics.
  • Rizqullah et al. (2023) Muhammad Razif Rizqullah, Ayu Purwarianti, and Alham Fikri Aji. 2023. Qasina: Religious domain question answering using sirah nabawiyah. In 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), pages 1–6. IEEE.
  • Rouditchenko et al. (2023) Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, and James Glass. 2023. Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages. In Proc. INTERSPEECH 2023, pages 2268–2272.
  • Ruder et al. (2023) Sebastian Ruder, Jonathan H Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel Sarr, Xinyi Wang, et al. 2023. Xtreme-up: A user-centric scarce-data benchmark for under-represented languages. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1856–1884.
  • Sanh et al. (2021) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization. Preprint, arXiv:2110.08207.
  • Sani et al. (2012) Auliya Sani, Sakriani Sakti, Graham Neubig, Tomoki Toda, Adi Mulyanto, and Satoshi Nakamura. 2012. Towards language preservation: Preliminary collection and vowel analysis of indonesian ethnic speech data. In 2012 International Conference on Speech Database and Assessments, pages 118–122.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
  • Setya and Mahendra (2018) Ken Nabila Setya and Rahmad Mahendra. 2018. Semi-supervised textual entailment on indonesian wikipedia data. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 416–427. Springer.
  • Singapore (2023) AI Singapore. 2023. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion.
  • Singh et al. (2024) Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, et al. 2024. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619.
  • Søgaard (2022) Anders Søgaard. 2022. Should we ban English NLP for a year? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5254–5260, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Sutoyo et al. (2022) Rhio Sutoyo, Said Achmad, Andry Chowanda, Esther Widhi Andangsari, and Sani M. Isa. 2022. Prdect-id: Indonesian product reviews dataset for emotions classification tasks. Data in Brief, 44:108554.
  • Täckström et al. (2012) Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 477–487, Montréal, Canada. Association for Computational Linguistics.
  • Talat et al. (2022) Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim Masoud, Margaret Mitchell, Dragomir Radev, Shanya Sharma, Arjun Subramonian, Jaesung Tae, Samson Tan, Deepak Tunuguntla, and Oskar Van Der Wal. 2022. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41, virtual+Dublin. Association for Computational Linguistics.
  • Thapliyal et al. (2022) Ashish V. Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut. 2022. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715–729, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Tran et al. (2021) Khanh Quoc Tran, Phap Ngoc Trinh, Khoa Nguyen-Anh Tran, An Tran-Hoai Le, Luan Van Ha, and Kiet Van Nguyen. 2021. An empirical investigation of online news classification on an open-domain, large-scale and high-quality dataset in vietnamese. In New Trends in Intelligent Software Methodologies, Tools and Techniques, pages 367–379. IOS Press.
  • Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. 2024. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
  • Van Nguyen et al. (2022) Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen. 2022. New vietnamese corpus for machine reading comprehension of health news articles. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 21(5).
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  • Wang et al. (2023) Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F Chen. 2023. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. arXiv preprint arXiv:2309.04766.
  • Wang et al. (2024) Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F. Chen. 2024. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. NAACL.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  • Wibowo et al. (2023) Haryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya, Radityo Eko Prasojo, and Alham Fikri Aji. 2023. Copal-id: Indonesian language reasoning with local culture and nuances. arXiv preprint arXiv:2311.01012.
  • Wilie et al. (2020) Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. 2020. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
  • Winata et al. (2023) Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. 2023. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Winata et al. (2024) Genta Indra Winata, Ruochen Zhang, and David Ifeoluwa Adelani. 2024. Miners: Multilingual language models as semantic retrievers. arXiv preprint arXiv:2406.07424.
  • Workshop et al. (2022) BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  • Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  • Yong et al. (2023) Zheng Xin Yong, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Long Phan, Rowena Garcia, Thamar Solorio, and Alham Aji. 2023. Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63, Singapore. Association for Computational Linguistics.
  • Zhang et al. (2023a) Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, and Alham Aji. 2023a. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, Singapore. Association for Computational Linguistics.
  • Zhang et al. (2023b) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023b. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Advances in Neural Information Processing Systems, volume 36, pages 5484–5505. Curran Associates, Inc.
  • Zhang et al. (2024) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2024. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems, 36.
  • Zhu et al. (2024) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ICLR.

Appendix A Summary of SEACrowd

Benchmark # Languages # Indigenous SEA Languages # Datasets # Tasks
SEACrowd (ours) 39 38 254 13 (11 text, 1 speech, 1 vision)
NusaCrowd Cahyawijaya et al. (2023a) 19 19 137 12 (11 text, 1 speech)
BUFFET Asai et al. (2023) 54 N/A 15 8 (8 text)
XTREME-UP Ruder et al. (2023) 88 11 269 9 (7 text, 1 speech, 1 vision)
Table 3: Benchmark comparison. The numbers in SEACrowd and NusaCrowd are the numbers of datasets included in the evaluation.

Addressing the resource gaps and challenges in AI development for Southeast Asian (SEA) languages is essential for our region’s sustainable and prosperous future. The lack of representation of SEA languages in machine learning pre-training models severely impacts their quality. Additionally, the scarcity of high-quality datasets and evaluation tools further hinders progress in AI for SEA languages. The dominance of English-centric training data introduces cultural biases and fails to capture the local values and nuances of SEA cultures. To overcome these obstacles, SEACrowd provides a comprehensive and standardized resource center, along with evaluation tasks, for nearly 1,000 SEA indigenous and non-indigenous languages across various modalities. SEACrowd closes the resource and evaluation gaps, enabling researchers and developers to improve the performance of AI models for SEA languages.

The journey does not end here. Concrete next steps are essential to drive AI advancement in Southeast Asia. Strategic investments in research and development, collaborations with local communities, and efforts toward language preservation are imperative. Governments, industry leaders, and stakeholders must prioritize the development of national and under-resourced local languages to ensure resource equity and promote inclusivity in AI technology. By taking bold actions, such as funding initiatives for data collection and model training, establishing partnerships with local communities, and focusing on language preservation, we can unlock the full potential of Southeast Asian languages. Not only this will spur economic growth but also preserve the region’s rich cultural heritage.

In conclusion, developing AI for Southeast Asian languages is not a mere necessity but an opportunity to create a sustainable and prosperous future. By addressing resource gaps, accurately evaluating models, and fostering inclusive AI development, we can harness the power of SEA languages to drive long-term economic growth while preserving our region’s cultural diversity.

Appendix B Related Work

SEA data resources

LLM research efforts for SEA languages are limited by the lack of available datasets and benchmarks. Up to this day, resources for SEA NLP tasks are concentrated on relatively higher-resource SEA indigenous languages, such as Indonesian (Mahendra et al., 2021; Wilie et al., 2020; Cahyawijaya et al., 2021, 2023a) and Vietnamese (Nguyen et al., 2020; Huynh et al., 2022; Le and Luu, 2023; Van Nguyen et al., 2022). NusaCrowd Cahyawijaya et al. (2023a) introduce the first multimodal benchmark for Indonesian languages, including text and speech. Ruder et al. (2023) introduce a multimodal benchmark encompassing 11 indigenous languages from SEA, spanning a wide array of languages totaling 88.

Additionally, Asai et al. (2023) present an LLM benchmark for cross-lingual few-shot transfer, comprising 15 distinct tasks and 54 languages sourced from varied multilingual datasets. Furthermore, Dou et al. (2024) find that publicly available pre-training data for SEA languages suffer from quality issues such as textual duplicates and excessive occurrences of Unicode escapes. On the other hand, pre-trained LLMs specifically for SEA languages suffer from limited language coverage; for instance, Cendol Cahyawijaya et al. (2024b), Sailor Dou et al. (2024), SEA-LION (Singapore, 2023), and SeaLLMs Nguyen et al. (2023) have only covered up to 11 different SEA languages, including English and Chinese.

Open-source Community Initiatives in NLP

Open-source and open-science communities play a crucial role in engaging native speakers to curate large-scale multilingual NLP resources. In the past, collaborative efforts have been organized to collect data and train multilingual language models either on a global scale (Workshop et al., 2022; Singh et al., 2024; Üstün et al., 2024) or on a regional level, e.g., Masakhane for African languages (Adelani et al., 2021, 2022b, 2022a, 2023), AI4Bharat for Indian languages (Kakwani et al., 2020; Kumar et al., 2022; Dabre et al., 2022, inter alia), and AmericasNLP for Latin American languages Mager et al. (2021); Ebrahimi et al. (2022). In the SEA region, there have been community-based initiatives, e.g., IndoNLP, PyThaiNLP, and RojakNLP, to study NLP on Indonesian languages (Aji et al., 2022; Wilie et al., 2020; Cahyawijaya et al., 2021, 2023a), Thai language (Phatthiyaphaibun et al., 2023), and the code-switching phenomenon in SEA (Aji et al., 2023; Yong et al., 2023; Winata et al., 2024), respectively.

Submission Points Max points
Public datasheet 2+bonus 6
Dataloader 3 6 if difficult
Private datasheet 1 -
Access to private data 4+bonus 10 if high-quality
Datasheet review 1 1
Dataloader review 2 4 if difficult
Private datasheet review 0.5 -
Private data contact 1 5 if succeeds
Table 4: Amount of points obtained for contributions related to datasheet, dataloader, and private data.

Appendix C Contributing to SEACrowd

C.1 Open Contributions

We identify four tasks for open contribution in SEACrowd.121212Landing page: https://github.com/SEACrowd. These tasks and the workflow of SEACrowd are heavily influenced by and extended upon NusaCrowd Cahyawijaya et al. (2023a, 2022), a collaborative effort to pool data resources for Indonesian NLP.

Refer to caption
Figure 7: A glimpse of SEACrowd Catalogue.
  • Submitting Metadata for Existing Public Datasets. Contributors can submit detailed datasheets for existing datasets through this form.131313Public datasheet form: https://form.jotform.com/team/232952680898069/seacrowd-sea-datasets. Contributors must provide important information such as data license, size, language and dialect, annotation method, and so on. The approved datasheets, as well as under review datasheets, will show up and be indexed in a monitor spreadsheet and the SEACrowd Catalogue (Figure 7).

  • Building a Dataloader. From the approved datasheets from the previous task, contributors can further contribute by building a HuggingFace dataset loader to ensure that all datasets in SEACrowd are standardized in terms of formatting and usage. Contributors can follow a dataloader guide and examples available141414Dataloader guide: https://github.com/SEACrowd/seacrowd-datahub/blob/master/DATALOADER.md. in the SEACrowd Data Hub. Dataloader maintainers and reviewers also monitor the self-assigned dataloader issues after 2 weeks of inactivity and ping contributors in case of a blocking impediment.

  • Identifying Private AI Datasets for SEA Languages, Cultures, and/or Regions. Unfortunately, a number of prior works involving SEA languages are still not publicly available. These may be due to several different reasons, including (but not limited to): non-release contracts related to funding, inclusion of private and personally identifiable data, and the use of explicitly private data such as those used by for-profit companies.

    In this task, contributors can search for works that contain private data and fill out a corresponding record form.151515Papers with private dataset form: https://form.jotform.com/team/232952680898069/seacrowd-paper-with-private-dataset. The SEACrowd team then attempts to contact the original data owners and negotiate the open-sourcing of their resources.

  • Opening a Private AI Dataset of SEA. If a contributor has previous work with closed data (or has been contacted by the SEACrowd team regarding closed-source data), they can decide to release their resources and register them in the collection via the public datasheet form. The resource will still be owned by the original contributor and is still tied to the contributor’s previous work, as SEACrowd simply catalogs it and records its now open-source license.

C.2 Measuring Contributions

Refer to caption
Figure 8: The timeline of SEACrowd’s entire run.

To be considered as a co-author, 20 contribution points are required.161616Submissions past the deadlines (see Appendix D.1) are still recorded, but contribution points are no longer given. To monitor how many points the contributors have obtained, the contribution point tracking is provided and updated regularly. The purpose of the point system is not to barrier collaboration but to reward rare and high-quality dataset entries. Table 4 describes the contribution points.171717Contribution point guidelines: https://github.com/SEACrowd/seacrowd-datahub/blob/master/POINTS.md. A bonus of 1 point is given if the dataset modality is speech or vision. We also provide a bonus based on the language rarity in terms of available resources as defined by Joshi et al. (2020)181818https://microsoft.github.io/linguisticdiversity/assets/lang2tax.txt, consisting of 1 point for languages in level 1 and 2, and 2 points for languages in level 0 or absent from the list. For other contributions not mentioned in Table 4 (e.g., maintenance, design, experiment, paper writing, etc.), the amount of contribution points is adjusted to the bulk and the complexity of the relevant work.

Appendix D Progression of SEACrowd

D.1 Timeline

SEACrowd released the open call for contributions on 1 November 2023. This lasted until 31 March 2024, for datasheet submissions, and until 15 May 2024 for both dataloaders and private dataset submissions. SEACrowd contributors have a biweekly discussion regarding the challenges they face while contributing, the next steps they should take to proceed, and/or experiment and research ideas for the paper. The detailed timeline can be seen in Figure 8.

D.2 Contribution Progress

Figure 9 shows the number of submissions for public datasheets, dataloader pull requests, and papers with private datasets in SEACrowd.

Appendix E Reviewing SEACrowd’s Submissions

We provide the complete reviewing guidelines in our Data Hub.191919Reviewer SOP: https://github.com/SEACrowd/seacrowd-datahub/blob/master/REVIEWING.md

E.1 Datasheet Reviewing

The datasheet reviewing standard operating procedure (SOP) ensures the integrity and completeness of datasets submitted to SEACrowd. It outlines procedures for verifying dataset availability, avoiding duplicates, and ensuring correctness and relevance to the SEA region. The SOP includes FAQs addressing common issues such as dataset duplicates and incorrect information, along with an approval checklist covering aspects like data availability, dataset splits, and licensing. Reviewers are instructed on how to handle various scenarios, including correcting errors and determining points allocation for multiple contributors. For instance, if the datasheet submitted has incorrect or missing information, the reviewer can either ask the contributor to fix it (with some guidance) or fix it themself. Upon completion of the review, reviewers update the status, add notes and points, and await the generation of a GitHub issue for the approved datasheet.

Refer to caption
Figure 9: Weekly status update of the cumulative number of submissions in SEACrowd.

E.2 Dataloader Reviewing

The dataloader reviewing SOP governs the review process for dataloaders in SEACrowd, ensuring adherence to the data structure and seacrowd schema and config standards. It specifies checks for metadata correctness, subset implementation, test script passing, and adherence to coding conventions. Additionally, it outlines dataloader config rules based on dataset types and provides guidelines for multilingual datasets. The SOP emphasizes the importance of reviewer collaboration, with each dataloader requiring two reviewers per submitted pull request, and outlines the approval and reviewer assignment process, either by allocation or by self-assignment based on availability and promptness.

Appendix F Schemas in SEACrowd

Schemas define and format the attributes of the dataset returned by a dataloader. For each dataloader, we implement 2 schema types: the source schema and the seacrowd schema. The source schema presents the dataset in a format similar to its original structure, while the seacrowd schema standardizes the data structure across similar tasks.

The following subsections define the seacrowd schemas in NLP (F.1), speech (F.2), and VL (F.3).

Subset ID Language Region # Samples
Sentiment Analysis \rightarrow *_seacrowd_text
lazada_review_filipino fil Philippines 1001
gklmip_sentiment mya Myanmar 716
indolem_sentiment ind Indonesia 1011
id_sentiment_analysis ind Indonesia 10806
karonese_sentiment btx Indonesia 1000
wisesight_thai_sentiment tha Thailand 2671
wongnai_reviews tha Thailand 6203
typhoon_yolanda_tweets fil Philippines 153
smsa ind Indonesia 500
prdect_id_sentiment ind Indonesia 5400
id_sent_emo_mobile_apps_sentiment ind Indonesia 21696
shopee_reviews_tagalog fil Philippines 2250
nusatranslation_senti_abs abs Indonesia 500
nusatranslation_senti_btk btx Indonesia 1200
nusatranslation_senti_bew bew Indonesia 1200
nusatranslation_senti_bhp bhp Indonesia 500
nusatranslation_senti_jav jav Indonesia 1200
nusatranslation_senti_mad mad Indonesia 1200
nusatranslation_senti_mak mak Indonesia 1200
nusatranslation_senti_min min Indonesia 1200
nusatranslation_senti_mui mui Indonesia 500
nusatranslation_senti_rej rej Indonesia 500
nusatranslation_senti_sun sun Indonesia 1200
nusax_senti_ind ind Indonesia 400
nusax_senti_ace ace Indonesia 400
nusax_senti_jav jav Indonesia 400
nusax_senti_sun sun Indonesia 400
nusax_senti_min min Indonesia 400
nusax_senti_bug bug Indonesia 400
nusax_senti_bbc bbc Indonesia 400
nusax_senti_ban ban Indonesia 400
nusax_senti_nij nij Indonesia 400
nusax_senti_mad mad Indonesia 400
nusax_senti_bjn bjn Indonesia 400
nusax_senti_eng eng Non-indigenous 400
indonglish ind Indonesia 1011
Table 5: Sentiment analysis data subsets used in SEACrowd NLU evaluation.
Subset ID Language Region # Samples
NLI \rightarrow *_seacrowd_pairs
indonli ind Indonesia 5183
wrete ind Indonesia 100
snli_indo ind Indonesia 9823
myxnli mya Myanmar 5010
xnli.tha tha Thailand 5010
xnli.vie vie Vietnam 5010
Table 6: NLI data subsets used in SEACrowd NLU evaluation.
Subset ID Language Region # Samples
Topic Classification \rightarrow *_seacrowd_text
gklmip_newsclass khm Cambodia 1436
indonesian_news_dataset ind Indonesia 2627
uit_vion vie Vietnam 26000
sib_200_ace_Arab ace Indonesia 204
sib_200_ace_Latn ace Indonesia 204
sib_200_ban_Latn ban Indonesia 204
sib_200_bjn_Arab bjn Indonesia 204
sib_200_bjn_Latn bjn Indonesia 204
sib_200_bug_Latn bug Indonesia 204
sib_200_ceb_Latn ceb Philippines 204
sib_200_ilo_Latn ilo Philippines 204
sib_200_ind_Latn ind Indonesia 204
sib_200_jav_Latn jav Indonesia 204
sib_200_kac_Latn kac Myanmar 204
sib_200_khm_Khmr khm Cambodia 204
sib_200_lao_Laoo lao Laos 204
sib_200_lus_Latn lus Myanmar 204
sib_200_min_Arab min Indonesia 204
sib_200_min_Latn min Indonesia 204
sib_200_mya_Mymr mya Myanmar 204
sib_200_pag_Latn pag Philippines 204
sib_200_shn_Mymr shn Myanmar 204
sib_200_sun_Latn sun Indonesia 204
sib_200_tgl_Latn fil Philippines 204
sib_200_tha_Thai tha Thailand 204
sib_200_vie_Latn vie Non-indigenous 204
sib_200_war_Latn war Philippines 204
sib_200_zsm_Latn zsm Malaysia 204
nusaparagraph_topic_btk btx Indonesia 500
nusaparagraph_topic_bew bew Indonesia 800
nusaparagraph_topic_bug bug Indonesia 300
nusaparagraph_topic_jav jav Indonesia 800
nusaparagraph_topic_mad mad Indonesia 700
nusaparagraph_topic_mak mak Indonesia 700
nusaparagraph_topic_min min Indonesia 800
nusaparagraph_topic_mui mui Indonesia 400
nusaparagraph_topic_rej rej Indonesia 350
nusaparagraph_topic_sun sun Indonesia 900
Table 7: Topic classification data subsets used in SEACrowd NLU evaluation.
Subset ID Language Region # Samples
Commonsense Reasoning \rightarrow *_seacrowd_text/qa
emotes_3k_tgl fil Philippines 2905
emotes_3k_eng eng Non-indigenous 2905
indo_story_cloze ind Indonesia 1135
xstorycloze_id ind Indonesia 1511
xstorycloze_my mya Myanmar 1511
Table 8: Commonsense reasoning data subsets used in SEACrowd NLU evaluation.
Subset ID Language Region # Samples
Standard Testing QA \rightarrow *_seacrowd_qa
indommlu_ind ind Indonesia 14979
indommlu_ban ban Indonesia 14979
indommlu_mad mad Indonesia 14979
indommlu_mak mak Indonesia 14979
indommlu_sun sun Indonesia 14979
indommlu_jav jav Indonesia 14979
indommlu_bjn bjn Indonesia 14979
indommlu_abl abl Indonesia 14979
indommlu_nij nij Indonesia 14979
seaeval_cross_mmlu_ind ind Indonesia 150
seaeval_cross_mmlu_vie vie Vietnam 150
seaeval_cross_mmlu_zlm zsm Malaysia 150
seaeval_cross_mmlu_fil fil Philippines 150
seaeval_cross_logiqa_ind ind Indonesia 176
seaeval_cross_logiqa_vie vie Vietnam 176
seaeval_cross_logiqa_zlm zsm Malaysia 176
seaeval_cross_logiqa_fil fil Philippines 176
m3exam_jav jav Indonesia 371
m3exam_tha tha Thailand 2168
m3exam_vie vie Vietnam 1789
okapi_m_arc_ind ind Indonesia 1170
okapi_m_arc_vie vie Vietnam 1170
Cultural QA \rightarrow *_seacrowd_qa
copal_colloquial ind Indonesia 559
xcopa_tha tha Thailand 500
xcopa_vie vie Vietnam 500
xcopa_ind ind Indonesia 500
seaeval_sg_eval_eng eng Non-indigenous 103
seaeval_ph_eval_eng eng Non-indigenous 100
mabl_ind ind Indonesia 1140
mabl_jav jav Indonesia 600
mabl_sun sun Indonesia 600
Reading Comprehension QA \rightarrow *_seacrowd_qa
belebele_ceb_latn ceb Philippines 900
belebele_ilo_latn ilo Philippines 900
belebele_ind_latn ind Indonesia 900
belebele_jav_latn jav Indonesia 900
belebele_kac_latn kac Myanmar 900
belebele_khm_khmr khm Cambodia 900
belebele_lao_laoo lao Laos 900
belebele_mya_mymr mya Myanmar 900
belebele_shn_mymr shn Myanmar 900
belebele_sun_latn sun Indonesia 900
belebele_tgl_latn fil Philippines 900
belebele_tha_thai tha Thailand 900
belebele_vie_latn vie Vietnam 900
belebele_war_latn war Philippines 900
belebele_zsm_latn zsm Malaysia 900
Table 9: Multiple-choice QA data subsets used in SEACrowd NLU evaluation.
Subset ID Language Region # Samples
Extractive & Abstractive QA \rightarrow *_seacrowd_qa
facqa ind Indonesia 311
iapp_squad tha Thailand 739
qasina ind Indonesia 500
mkqa_khm khm Cambodia 10000
mkqa_zsm zsm Malaysia 10000
mkqa_tha tha Thailand 10000
mkqa_vie vie Vietnam 10000
Table 10: Extractive and abstractive QA subsets used in SEACrowd NLG evaluation.
Subset ID Language Region # Samples
Summarization \rightarrow *_seacrowd_t2t
lr_sum_ind ind Indonesia 500
lr_sum_vie vie Vietnam 1460
lr_sum_lao lao Laos 1496
lr_sum_tha tha Thailand 500
lr_sum_khm khm Cambodia 486
lr_sum_mya mya Myanmar 990
xl_sum_mya mya Myanmar 570
xl_sum_ind ind Indonesia 4780
xl_sum_tha tha Thailand 826
xl_sum_vie vie Vietnam 4013
Table 11: Summarization data subsets used in SEACrowd NLG evaluation.

F.1 NLP

  • Unlabeled text (SSP). This schema could be used for language modeling in self-supervised pre-training. It consists of (id, text), where id denotes a unique row identifier of the dataset and text denotes an input text.

  • Single-label text classification (TEXT). This schema could be used for sentiment analysis, emotion classification, legal classification, and others. It consists of (id, text, label), where id denotes a unique row identifier of the dataset, text denotes an input text, and label denotes a deterministic target variable.

  • Multi-label text classification (TEXT MULTI). This schema could be used for hate speech detection and aspect-based sentiment analysis. It consists of (id, text, labels), where id denotes a unique row identifier of the dataset, text denotes an input text, and labels denotes a list of deterministic target variables.

  • Text-to-text (T2T). This schema could be used for machine translation, summarization, and paraphrasing. It consists of (id, text_1, text_2, text_1_name, text_2_name), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and text_1_name and text_2_name denote the names of the input text pair (e.g., ind and jav for translation input text pairs, or document and summary for summarization input text pairs).

  • Sequence labeling (SEQ LABEL). This schema could be used for named entity recognition (NER), POS tagging, and others. It consists of (id, tokens, labels), where id denotes a unique row identifier of the dataset, tokens denotes a list of tokens of an input text, and labels denotes a list of targets for the tokens.

  • Question answering (QA). This schema could be used for extractive QA, multiple-choice QA, and others. It consists of (id, question_id, document_id, question, type, choices, context, answer), where id denotes a unique row identifier of the dataset, question_id denotes a unique identifier of the question, document_id denotes a unique identifier of the context document, question denotes an input question to be answered, type denotes the type of the QA task (e.g., extractive, multiple-choice, open-generative, closed-generative, etc.), choices denotes a list of answer choices (if required), context denotes a passage that serves as the background information of the question (if required), and answer denotes the gold answer to the question (if required).

  • Single-label text pair classification (PAIRS). This could be used for textual entailment and next-sentence prediction. It consists of (id, text_1, text_2, label), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and label denotes the target variable.

  • Single-label text pair classification with continuous values or regression (PAIRS SCORE). This could be used for answer grading and semantic textual similarity. It consists of (id, text_1, text_2, label), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and label denotes a target variable as a continuous value.

  • Multi-label text pair classification (PAIRS MULTI). This could be used for morphological inflection. It consists of (id, text_1, text_2, labels), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and labels denotes a list of target variables.

  • Knowledge base (KB). This schema could be used for constituency parsing, dependency parsing, coreference resolution, dialogue systems, and other tasks with complex structures. It consists of (id, passages, entities, events, coreferences, relations). Considering its intricate structure, we encourage readers to take a look at the implementation of the knowledge base schema.

  • Tree (TREE). This schema could be used for constituency parsing, this schema assumes a document with subnode elements and a tree hierarchy. It consists of (id, passage, nodes), where id denotes a unique row identifier of the dataset, passage denotes the passage to that particular id; this passage consist of (id, type, text, offsets), nodes denotes the nodes to that particular id; this nodes consists of (id, type, text, offsets, subnodes).

  • Conversational Chat (CHAT). This schema could be used for conversational chat and/or multi-turn conversation. It consists of (id, input, output, meta), where id denotes a unique row identifier of the dataset, input denotes a sequence that consists of content and role as an input prompt and the role of the entity inputting the prompt, output denotes an answer from that input prompt, and meta denotes relevant details to allow some flexibility of the schema (if required).

  • End-to-end Task Oriented Dialogue (TOD). This schema could be used for end-to-end task-oriented dialogue. It consists of (dialogue_idx, dialogue), where dialogue_idx denotes a unique row identifier of the dialogue, dialogue denotes some core details such as turn label, system utterance, turn idx, belief state (consist of slots and act), user utterance, and system acts.

Subset ID Language Region # Samples
Image Captioning \rightarrow *_seacrowd_imtext
xm3600_fil fil Philippines 2760
xm3600_id ind Indonesia 2775
xm3600_th tha Thailand 2798
xm3600_vi vie Vietnam 2855
Table 12: Image captioning data subsets used in SEACrowd VL evaluation.

F.2 Speech

  • Speech-text (SPTEXT). This could be used for speech recognition, text-to-speech (TTS) or speech synthesis, and speech-to-text translation. It consists of (id, path, audio, text, speaker_id, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, text denotes an input text, speaker_id denotes a unique identifier of the speaker, metadata denotes relevant details such as the age and gender of the speaker (if required).

  • Speech-to-speech (S2S). This could be used for speech-to-speech translation. It consists of (id, path_1, audio_1, text_1, metadata_1, path_2, audio_2, text_2, metadata_2), where id denotes a unique row identifier of the dataset, path_1 and path_2 denote the file path to a respective input audio source, audio_1 and audio_2 denote the audio data loaded from the corresponding path, text_1 and text_2 denote input texts, and metadata_1 and metadata_2 denote relevant details such as the age of the speaker and their gender (if required).

  • Speech Classification (SPEECH). This schema could be used for speech classification, speech-language identification, and speech-emotion recognition for single-label use only. It consists of (id, path, audio, speaker_id, labels, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, speaker_id denotes a unique identifier of the speaker, labels denotes the label of that particular speech (only can be single-label), metadata denotes relevant details such as the age and gender of the speaker (if required).

  • Speech Classification for Multilabel (SPEECH MULTILABEL). This schema could be used for speech classification, speech-language identification, and speech-emotion recognition for multi-label use only. It consists of (id, path, audio, speaker_id, labels, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, speaker_id denotes a unique identifier of the speaker, labels denotes the sequence of labels of that particular speech (only can be multi-label), metadata denotes relevant details such as the age and gender of the speaker (if required).

Subset ID Language Region # Samples
ASR \rightarrow *_seacrowd_sptext
asr_ibsc iba Brunei 473
commonvoice_120_ind ind Indonesia 3647
commonvoice_120_tha tha Thailand 10964
commonvoice_120_cnh cnh Myanmar 763
commonvoice_120_vie vie Vietnam 1302
fleurs_ind ind Indonesia 687
fleurs_jav jav Indonesia 728
fleurs_tha tha Thailand 1021
fleurs_lao lao Laos 405
fleurs_mya mya Myanmar 880
fleurs_khm khm Cambodia 771
fleurs_vie vie Vietnam 857
fleurs_zlm zlm Malaysia 749
fleurs_fil fil Philippines 964
fleurs_ceb ceb Philippines 541
indspeech_newstra_ethnicsr_nooverlap_jav jav Indonesia 1000
indspeech_newstra_ethnicsr_nooverlap_sun sun Indonesia 1000
indspeech_newstra_ethnicsr_nooverlap_ban ban Indonesia 1000
indspeech_newstra_ethnicsr_nooverlap_btk btx Indonesia 1000
Table 13: ASR data subsets used in SEACrowd speech evaluation.
Subset ID Language Region # Samples
Eng \rightarrow XX XX \rightarrow Eng
MT (Eng \Leftrightarrow XX) \rightarrow *_seacrowd_t2t
lio_and_central_flores_eng_ljl lio_and_central_flores_ljl_eng ljl Indonesia 1658
flores200_eng_Latn_ace_Latn flores200_ace_Latn_eng_Latn ace Indonesia 1012
flores200_eng_Latn_ban_Latn flores200_ban_Latn_eng_Latn ban Indonesia 1012
flores200_eng_Latn_bjn_Latn flores200_bjn_Latn_eng_Latn bjn Indonesia 1012
flores200_eng_Latn_bug_Latn flores200_bug_Latn_eng_Latn bug Indonesia 1012
flores200_eng_Latn_ceb_Latn flores200_ceb_Latn_eng_Latn ceb Philippines 1012
flores200_eng_Latn_ilo_Latn flores200_ilo_Latn_eng_Latn ilo Philippines 1012
flores200_eng_Latn_ind_Latn flores200_ind_Latn_eng_Latn ind Indonesia 1012
flores200_eng_Latn_jav_Latn flores200_jav_Latn_eng_Latn jav Indonesia 1012
flores200_eng_Latn_kac_Latn flores200_kac_Latn_eng_Latn kac Myanmar 1012
flores200_eng_Latn_khm_Khmr flores200_khm_Khmr_eng_Latn khm Cambodia 1012
flores200_eng_Latn_lao_Laoo flores200_lao_Laoo_eng_Latn lao Laos 1012
flores200_eng_Latn_lus_Latn flores200_lus_Latn_eng_Latn lus Myanmar 1012
flores200_eng_Latn_min_Latn flores200_min_Latn_eng_Latn min Indonesia 1012
flores200_eng_Latn_mya_Mymr flores200_mya_Mymr_eng_Latn mya Myanmar 1012
flores200_eng_Latn_pag_Latn flores200_pag_Latn_eng_Latn pag Philippines 1012
flores200_eng_Latn_shn_Mymr flores200_shn_Mymr_eng_Latn shn Myanmar 1012
flores200_eng_Latn_sun_Latn flores200_sun_Latn_eng_Latn sun Indonesia 1012
flores200_eng_Latn_tha_Thai flores200_tha_Thai_eng_Latn tha Thailand 1012
flores200_eng_Latn_vie_Latn flores200_vie_Latn_eng_Latn vie Vietnam 1012
flores200_eng_Latn_war_Latn flores200_war_Latn_eng_Latn war Philippines 1012
flores200_eng_Latn_zsm_Latn flores200_zsm_Latn_eng_Latn zsm Malaysia 1012
ntrex_128_eng-US_ind ntrex_128_ind_eng-US ind Indonesia 1997
ntrex_128_eng-US_mya ntrex_128_mya_eng-US mya Myanmar 1997
ntrex_128_eng-US_fil ntrex_128_fil_eng-US fil Philippines 1997
ntrex_128_eng-US_khm ntrex_128_khm_eng-US khm Cambodia 1997
ntrex_128_eng-US_lao ntrex_128_lao_eng-US lao Laos 1997
ntrex_128_eng-US_zlm ntrex_128_zlm_eng-US zsm Malaysia 1997
ntrex_128_eng-US_tha ntrex_128_tha_eng-US tha Thailand 1997
ntrex_128_eng-US_vie ntrex_128_vie_eng-US vie Vietnam 1997
ntrex_128_eng-US_hmv ntrex_128_hmv_eng-US hmv Vietnam 1997
nusax_mt_eng_ind - ind Indonesia 400
nusax_mt_eng_ace nusax_mt_ace_eng ace Indonesia 400
nusax_mt_eng_jav nusax_mt_jav_eng jav Indonesia 400
nusax_mt_eng_sun nusax_mt_sun_eng sun Indonesia 400
nusax_mt_eng_min nusax_mt_min_eng min Indonesia 400
nusax_mt_eng_bug nusax_mt_bug_eng bug Indonesia 400
nusax_mt_eng_bbc nusax_mt_bbc_eng bbc Indonesia 400
nusax_mt_eng_ban nusax_mt_ban_eng ban Indonesia 400
nusax_mt_eng_nij nusax_mt_nij_eng nij Indonesia 400
nusax_mt_eng_mad nusax_mt_mad_eng mad Indonesia 400
nusax_mt_eng_bjn nusax_mt_bjn_eng bjn Indonesia 400
Table 14: MT between English and SEA languages data subsets used in SEACrowd NLG evaluation.

F.3 VL

  • Image-text (IMTEXT). This schema could be used for image captioning, text-to-image generation, and vision-language pre-training. It consists of (id, text, image_paths, metadata), where id denotes a unique row identifier of the dataset, text denotes an input text, image_paths denotes a list of paths to the input image sources, and metadata denotes relevant details such as visual concepts and labels (if required).

  • General Image Classification (IMAGE). This schema could be used for image classification both single-label and multi-label. It consists of (id, labels, image_path, metadata), where id denotes a unique row identifier of the dataset, labels denotes the label of that particular image (can be single-label and multi-label), image_path denotes a list of paths to the input image sources, and metadata denotes relevant details such as visual concepts and labels (if required).

  • Image Question Answering (IMQA). This schema could be used for image/visual question answering. It consists of (id, question_id, document_id, questions, type, choices, context, answer, image_paths, meta), where id denotes a unique row identifier of the dataset, question_id denotes a unique identifier of the question, document_id denotes a unique identifier of the context document, question denotes an input question to be answered, type denotes the type of the QA task (e.g., extractive, multiple-choice, open-generative, closed-generative, etc.), choices denotes a list of answer choices (if required), context denotes a passage that serves as the background information of the question (if required), and answer denotes the gold answer to the question (if required), image_path denotes a list of paths to the input image sources, and metadata denotes relevant details to allow some flexibility of the schema (if required).

  • General Video-to-Text (VIDEO). This schema could be used for video-to-text retrieval and video captioning. It consists of (id, video_path, text, metadata), where id denotes a unique row identifier of the dataset, video_path denotes the file path to an input video source, text denotes the text associated with that particular frame/video, metadata denotes relevant details such as the resolution, duration, and FPS of the video (if required).

Appendix G Supplementary Details for SEA Evaluation

Model τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 τ=0.2𝜏0.2\tau=0.2italic_τ = 0.2 τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 τ=0.7𝜏0.7\tau=0.7italic_τ = 0.7 τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0
Commercial
      GPT-4 0.199 0.192 0.155 0.118 0.066
      Command-R 0.201 0.198 0.185 0.168 0.126
English
      Mistral 0.161 0.160 0.159 0.162 0.150
      Llama3 0.138 0.137 0.131 0.129 0.113
      Falcon 0.274 0.272 0.238 0.250 0.211
Multilingual
      mT0 0.151 0.148 0.131 0.112 0.074
      BLOOMZ 0.238 0.236 0.228 0.217 0.167
      BactrianX-Llama 0.163 0.162 0.163 0.168 0.149
      AYA-23 0.183 0.182 0.183 0.179 0.135
      AYA-101 0.112 0.109 0.095 0.085 0.069
SEA regional
      SEA-LION 0.250 0.242 0.204 0.164 0.102
      SeaLLM v2.5 0.137 0.133 0.116 0.097 0.069
      Sailor 0.152 0.151 0.145 0.139 0.113
SEA country
      Cendol-mT5 0.407 0.404 0.378 0.328 0.200
      Cendol-Llama2 0.294 0.290 0.267 0.232 0.149
      Merak v4 0.209 0.207 0.199 0.190 0.155
      WangchanX-Llama3 0.163 0.161 0.153 0.150 0.131
      Malaysian Llama3 0.181 0.181 0.179 0.176 0.143
Table 15: Language equity across baselines based on Gini coefficient weighted by population with different τ𝜏\tauitalic_τ values. Lower Gini means higher equity.

G.1 Datasets

Table 5, 6, 7, 8, and 9 provide the details of data subsets used in the NLU evaluation. Sentiment analysis dataset is originally from NusaX Winata et al. (2023), NusaTranslation Cahyawijaya et al. (2023b), SentiTaglish202020https://huggingface.co/datasets/ccosme/SentiTaglishProductsAndServices, SmSA Purwarianti and Crisdayanti (2019), PRDECT-ID Sutoyo et al. (2022), code-mixed Indonesian-English sentiment Astuti et al. (2023), Karonese tweet sentiment Karo et al. (2022), Typhoon Yolanda sentiment Imperial et al. (2019), GKLMIP Khmer sentiment Jiang et al. (2022), Wisesight sentiment corpus212121https://github.com/PyThaiNLP/wisesight-sentiment, Filipino-Tagalog product reviews Sentiment222222https://github.com/EricEchemane/Filipino-Tagalog-Product-Reviews-Sentiment-Analysis, and multilabel sentiment of Indonesian mobile apps review Riccosan and Saputra (2023).

Topic classification dataset is originally from NusaParagraph Cahyawijaya et al. (2023b), UIT-ViON Tran et al. (2021), SIB-200 Adelani et al. (2024), GKLMIP Khmer news Jiang et al. (2022), and Indonesian news Muzad and Rahutomo (2016). Natural Language Inference dataset is originally from IndoNLI Mahendra et al. (2021), WreTe Setya and Mahendra (2018), SNLI Indo Putra et al. (2024), MyXNLI232323https://huggingface.co/datasets/akhtet/myXNLI, and XNLI Conneau et al. (2018). Commonsense reasoning dataset is originally from XStoryCloze Lin et al. (2022), IndoCloze Koto et al. (2022), and EMoTES-3K Catapang and Visperas (2023).

Open domain QA dataset is originally from IndoMMLU Koto et al. (2023b), SeaEval Wang et al. (2023), M3Exam Zhang et al. (2023b), and Okapi Dac Lai et al. (2023). Cultural QA dataset is originally from COPAL-ID Wibowo et al. (2023), XCOPA Ponti et al. (2020), SeaEval Wang et al. (2023), and Multilingual Fig-QA Kabra et al. (2023). The reading comprehension dataset is originally from Belebele Bandarkar et al. (2023).

Table 10, 11, and 14 provide the details of data subsets used in the NLG evaluation. The summarization dataset is originally from LR-Sum Palen-Michel and Lignos (2023) and XL-Sum Hasan et al. (2021). The machine translation dataset is originally from Lio and the Central Flores corpus Elias (2018), Flores-200 Costa-jussà et al. (2024) and NTREX-128 Federmann et al. (2022). Question answering dataset is originally from FacQA Purwarianti et al. (2007), QASiNa Rizqullah et al. (2023), MKQA Longpre et al. (2021), and Open Thai Wikipedia QA dataset242424https://zenodo.org/records/4539916.

Table 12 and 13 provide the details of data subsets used in the VL and speech evaluation. The image captioning dataset is originally from XM3600 Thapliyal et al. (2022). Speech recognition dataset is originally from INDspeech NEWSTRA Ethnic collection Sani et al. (2012), ASR Iban Juan et al. (2015), FLEURS Conneau et al. (2022), and Common Voice Ardila et al. (2020).

G.2 Baselines

Table  20, 21, and 22 report the details of baseline models used in SEACrowd evaluation (§3). For each baseline model, we provide information regarding the model size, origin base model, seen languages in the training corpora use, and the URL where the models can be downloaded. In principle, this work does not aim to acquire and fit all available SEA-trained LLMs over the Internet, as this is computationally expensive. Rather, we want to initiate the exploration of select publicly available models to serve as baselines for the evaluation of foundational capabilities on SEA languages through benchmarking on NLU, NLG, speech, and vision tasks aggregated via SEACrowd.

Across the various models explored, as listed in the tables, we prioritized the diversity of model variation in terms of scale, openness, and coverage of SEA languages. In NLP tasks, we covered five language model families for the main experiments, namely English-only, multilingual, regional, and country-specific models. Instruction-tuned LLMs demonstrate the ability to generalize to unseen tasks Wei et al. (2021); Sanh et al. (2021); Ouyang et al. (2022). When these LLMs are based on a multilingual foundation, they have shown proficiency in generalizing across multiple languages Muennighoff et al. (2022); Adilazuarda et al. (2023); Zhang et al. (2023a). For NLU, we compute the weighted F1-score and obtain the answers via log-likelihood for open-source baselines or string matching for commercial baselines.

For the speech benchmark, only two model families are available: multilingual models and models fine-tuned on specific SEA languages. For vision tasks, we covered English-only and one multilingual model. These models utilize a visual backbone pre-trained on image-text alignment, e.g., CLIP Radford et al. (2021), to project image features into the input space of an existing pre-trained LM. In summary, we mostly explored open models readily accessible on HuggingFace but also included commercial models such as GPT-4 and Whisper V3 for performance benchmarking, reproducibility, and extension by future works.

Model Hyperparameter Value
Logistic Regression max_iter 100
C np.linspace(0.001, 10, 100)
Naive Bayes alpha np.linspace(0.001, 1, 50)
distribution MultinomialNB
SVM C 1
kernel ["rbf", "linear"]
Table 16: Hyper-parameters of classical models for Translationese prediction through grid search.

G.3 Prompts

Tables 23, 24, and 25 describe the handwritten prompt templates used in NLU, NLG, and VL evaluation (§3). For all tasks, we used a zero-shot prompting procedure to serve as the baseline setup. Due to the task complexity and distribution of workload from volunteer contributors with available computing resources, we limited the experiment procedure for some setups to ensure the acquisition of results in line with target release dates. For NLU, we explored three prompt styles for each dataset from core tasks, including commonsense reasoning, question-answering, and NLI. For more challenging tasks requiring more intensive computing power such as NLG and VL, we used only one uniform prompt style, but we also explored prompts translated into SEA languages, i.e., Filipino, Indonesian, Thai, and Vietnamese for VL.

Model 3-label HT vs. MT-Nat MT vs. HT-Nat Nat vs. HT-MT
LR (TF-IDF) 39.73 53.03 56.01 75.20
LR (BoW) 45.63 55.90 61.39 75.60
NB (TF-IDF) 33.43 49.53 50.55 73.05
NB (BoW) 33.70 49.10 50.64 71.26
SVM (TF-IDF) 39.55 52.63 55.10 76.40
SVM (BoW) 46.84 56.85 61.40 75.65
mDeBERTa 51.51 64.77 59.16 79.08
Table 17: Results of translationese classifier (accuracy) averaged across languages.

G.4 Evaluation Results

Table 26 and 27 describes the NLU and NLG results per language.

G.5 Language Equity Results

Table 15 presents the language equity of LLMs used in the evaluation across different weights of the number of language speakers in the Gini coefficient calculation.

Country Affiliation Origin
Indonesia 16 31
Malaysia 0 1
Philippines 3 7
Singapore 13 2
Thailand 1 2
Vietnam 0 1
Australia 1 0
Brazil/Sweden 0 1
Canada 1 0
China 2 8
Egypt 0 1
Germany 0 2
Hong Kong 2 0
India 0 1
Ireland 1 0
Japan 3 0
The Netherlands 0 1
UAE 5 0
UK 4 0
USA 9 1
Uzbekistan 0 2
Table 18: The demographics of the authors based on affiliation country and origin country.

Appendix H Supplementary Details for Translationese Classifier

H.1 Training & Evaluation Data

We manually select and validate the text collection method of each data subset for training and evaluating the translationese classifier, in Tables 28 and 29, respectively. This validation is done by checking the relevant publication, domain, and annotation method. If the texts in the data subsets are a product of machine or human translation, we regard them as translationese. We label data subsets with human-generated texts as natural data.

H.2 Experiments

We aim to assess the capability of ML models to differentiate between human-generated/natural samples (Nat), human-translated samples (HT), and machine-translated samples (MT). Our approach involves training classifiers using classical ML techniques and fine-tuning mDeBERTa models to enhance learning. Furthermore, we experiment by combining two label classes into one to evaluate the predictive difficulty of distinguishing between these labels. This analysis provides valuable insights into the relative similarity of the samples across these categories. The following section provides a comprehensive overview of our methodology for this study.

Classical ML

We use three classical machine learning methods: 1) Logistic Regression (LR), 2) Naive Bayes (NB), and 3) Support Vector Machine (SVM) with two different features, including TF-IDF and Bag-of-words (BoW). We run hyper-parameter tuning with grid search to find the best hyper-parameters for each method on validation set, and report the results on test set in Table 16.

Encoder LM

We explore fine-tuning encoder-only LM for developing a translationese classifier. We utilize mDeBERTa-v3base model252525https://huggingface.co/microsoft/mdeberta-v3-base He et al. (2020, 2022)–a multilingual encoder-only LM–as our backbone. We trained the model with AdamW Loshchilov and Hutter (2019) optimizer using a learning rate of 1e-5, batch size of 256, and warming up steps of 500 for a maximum of 10 epochs. We apply an early stopping of 3 epochs based on the validation accuracy. We show results in Table 17.

No. Name C. Points
1 Holy Lovenia 549
2 Samuel Cahyawijaya 480
3 Rahmad Mahendra 317
4 Salsabil Maulana Akbar 243
5 Lester James V. Miranda 234
6 Zheng-Xin Yong 164
7 Jennifer Santoso 164
8 Elyanah Aco 158
9 Akhdan Fadhilah 157
10 Jonibek Mansurov 132
11 Fajri Koto 121
12 Joseph Marvin Imperial 118
13 Ruochen Zhang 114
14 Genta Indra Winata 108
15 Onno P. Kampman 107
16 Joel Ruben Antony Moniz 93
17 Muhammad Ravi Shulthan Habibi 92
18 Frederikus Hudi 83
19 Sedrick Keh 81
20 Alham Fikri Aji 80
21 Railey Montalan 78
22 Peerat Limkonchotiwat 72
23 Ryan Ignatius 56
24 Joanito Agili Lopo 50
25 William Nixon 50
26 Börje F. Karlsson 49
27 James Jaya 48
28 Ryandito Diandaru 48
29 Yuze Gao 48
30 William Tjhi 46
31 Patrick Amadeus 46
32 Bin Wang 44
33 Jan Christian Blaise Cruz 43
34 Chenxi Whitehouse 36
35 Ivan Halim Parmonangan 36
36 Maria Khelli 36
37 Sebastian Ruder 35
38 Wenyu Zhang 34
39 Lucky Susanto 33
40 Reynard Adha Ryanda 32
41 Sonny Lazuardi Hermawan 30
42 Dan John Velasco 29
43 Muhammad Dehan Al Kautsar 29
44 Willy Fitra Hendria 29
45 Yasmin Moslem 29
46 Noah Flynn 28
47 Muhammad Farid Adilazuarda 27
48 Haochen Li 27
49 Johanes Lee 27
50 R. Damanhuri 27
51 Shuo Sun 27
52 Muhammad Reza Qorib 26
53 Amirbek Djanibekov 25
54 Wei Qi Leong 25
55 Quyet V. Do 24
56 Niklas Muennighoff 24
57 Tanrada Pansuwan 22
58 Ilham Firdausi Putra 21
59 Yan Xu 21
60 Ayu Purwarianti 20
61 Ngee Chia Tai 20
Table 19: Co-authors ordered by their amount of contribution points.

Appendix I Supplementary Details for SEA Language Prioritization

Based on the results of the global utility metric Blasi et al. (2022), we provide the top-20 SEA indigenous languages to be prioritized based on their demand (i.e., the number of SEA language speakers) and current utility (Figure 10) or resource availability (Figure 11).262626https://github.com/SEACrowd/globalutility While the current utility, also known as the model capability, is relative to the model performance on eng, the resource availability is relative to 500, which is approximately the number of datasets in Korean language available in HuggingFace. The Korean language is chosen as the pivot because it is considered a higher-resource language than most by Joshi et al. (2020).

Appendix J Contributor Demographics

Table 18 describes the geographical distribution of the authors in SEACrowd.

Appendix K Languages Under Study

Table 30-48 present the list of SEA indigenous languages covered by SEACrowd. Information regarding the ISO 639-3 code, language name, region, and population is obtained from Eberhard et al. (2021); Hammarström et al. (2024); Project (2024); Dryer and Haspelmath (2013) and Wikipedia272727https://www.wikipedia.org/.

Appendix L Amount of Contributions by Co-Authors

Table 19 provides a list of co-authors sorted by their amount of contributions in SEACrowd. The full details of their contributions can be seen in our contribution tracking.

Model name Model size Backbone Seen langs URL
Commercial
GPT-4 N/A GPT-4 N/A https://openai.com/index/gpt-4/. We used turbo-2024-04-09 for NLU and gpt-4o-2024-05-13 for NLG.
Command-R 36B Command-R 2 SEA langs (vie, ind), 22 non-SEA langs https://cohere.com/blog/command-r
English
Mistral 7B Mistral N/A mistralai/Mistral-7B-Instruct-v0.3
Llama3 8B Llama3 N/A meta-llama/Meta-Llama-3-8B-Instruct
Falcon 7B Falcon 0 SEA langs (mainly English) tiiuae/falcon-7b-instruct
Multilingual
mT0 3B mT5 2 SEA langs (vie, ind), 43 non-SEA langs bigscience/mt0-xl
BLOOMZ 7B BLOOM 2 SEA langs (vie, ind), 43 non-SEA langs bigscience/bloomz-3b
BactrianX-Llama 7B Llama 6 SEA langs (ind, vie, khm, mya, tha, tgl, vie), 46 non-SEA langs MBZUAI/bactrian-x-llama-7b-merged
AYA-23 8B Command 2 SEA langs (ind, vie), 21 non-SEA langs CohereForAI/aya-23-8B
AYA-101 13B T5 9 SEA langs (ind, vie, tha, zsm, mya, ceb, fil, jav, sun), 92 non-SEA langs CohereForAI/aya-101
SEA regional
SEA-LION 7B MPT 8 SEA langs (ind, vie, tha, tgl, zsm, khm, lao, mya), 3 non-SEA langs aisingapore/sea-lion-7b-instruct
SeaLLM v2.5 7B SeaLLM 8 SEA langs (ind, vie, tha, tgl, zsm, khm, lao, mya) SeaLLMs/SeaLLM-7B-v2.5
Sailor 7B Qwen 1.5 5 SEA langs (ind, vie, lao, zlm, tha), 2 non-SEA langs sail/Sailor-7B-Chat
SEA country
Cendol-mT5 3B mT5 1 SEA lang (ind), 18 local Indonesian langs indonlp/cendol-mt5-xl
Cendol-Llama2 7B Llama2 1 SEA lang (ind), 18 local Indonesian langs indonlp/cendol-llama2-7b
Merak v4 7B Llama2 1 SEA lang (ind) Ichsan2895/Merak-7B-v4
WangchanX-Llama3 8B Llama3 4 SEA langs (ind, vie, tha, mya) and 26 non-SEA langs airesearch/LLaMa3-8b-WangchanX-sft-Demo
Malaysian Llama3 8B Llama3 1 SEA lang (zlm) mesolitica/malaysian-llama-3-8b-instruct-16k
Table 20: LLMs used in SEACrowd NLU and NLG evaluation.
Model name Model size Backbone Seen langs URL
Multilingual
Whisper v3 1.54B Whisper v3 89 non-SEA & 9 SEA (ind, jav, lao, zlm, mya, tgl, tha, sun, vie) openai/whisper-large-v3
\hdashlineMMS 1B 1B MMS 993 non-SEA & 205 SEA (abp, ace, acn, agn, ahk, akb, alj, alp, amk, aoz, atb, atq, ayz, ban, bbc, bcl, bdg, bdq, bep, bgr, bhz, bkd, blt, blx, blz, bno, bpr, bps, bru, btd, bts, btx, bvz, bzi, ceb, cek, cfm, cgc, cmr, cnh, ctd, dbj, dnt, dnw, dtp, eip, frd, gbi, gor, had, hap, hil, hlt, hnn, hvn, iba, ifa, ifb, ifk, ifu, ify, ilo, ind, itv, jav, jmd, kac, kak, kdt, khg, khm, kje, kjg, klw, kmd, kml, knb, kne, kpq, kps, kqe, kqr, krj, krr, kvw, kxf, kxm, kyb, kyo, kyu, kzf, lao, law, lbw, lcp, lew, lex, lhu, lis, lje, ljp, llg, lnd, lsi, mad, mak, mbb, mbt, mej, mhx, mhy, min, mkn, mnb, mnw, mnx, mog, mqf, mqj, mqn, mrw, mtd, mtj, mvp, mwq, mwv, mya, myl, nfa, nia, nij, nlc, nlk, nod, npy, nst, obo, pag, pam, pce, pez, plw, pmf, ppk, prf, prk, prt, pse, ptu, pww, raw, rej, rgu, rhg, ril, rol, saj, sas, sbl, sda, sea, sgb, shn, sjm, slu, sml, sne, suc, sun, sxn, sya, sza, tbk, tbl, tby, tcz, tdj, tes, tgl, tha, tih, tlb, tnt, tom, tvw, twb, twe, twu, txa, txq, ubl, urk, ury, vie, war, wlo, xdy, xmm, xsb, xte, yka, yli, yva, zlm, zyp) facebook/mms-1b-all
\hdashlineSeamless M4T v2 2.3B Seamless 83 non-SEA & 9 SEA (ind, jav, khm, lao, mya, tgl, tha, vie, zlm) facebook/seamless-m4t-v2-large
Fine-tuned on specific language(s)
XLSR English 300M Wav2Vec2 46 non-SEA & 7 SEA (ceb, cnh, ind, lao, tam, tgl, vie) & fine-tuning language(s) jonatasgrosman/wav2vec2-large-xlsr-53-english
XLSR Ind-Jav-Sun indonesian-nlp/wav2vec2-indonesian-javanese-sundanese
XLSR Indonesian Galuh/wav2vec2-large-xlsr-indonesian
XLSR Thai wannaphong/wav2vec2-large-xlsr-53-th-cv8-newmm
XLS-R Tagalog sil-ai/wav2vec2-bloom-speech-tgl
XLS-R Burmese sil-ai/wav2vec2-bloom-speech-mya
XLS-R Khmer vitouphy/wav2vec2-xls-r-300m-khmer
\hdashlineWhisper Indonesian 1.54B Whisper 89 non-SEA & 9 SEA (ind, jav, lao, msa, mya, tgl, tha, sun, vie) cahya/whisper-large-id
Whisper Thai biodatlab/whisper-th-large-v3-combined
Whisper Khmer ksoky/whisper-large-khmer-asr
Table 21: Speech models used in SEACrowd speech evaluation.
Model name Model size Backbone Pre-training images URL
English
LLaVA 1.5 N/A N/A N/A N/A
LLaVA 1.6 7B Mistral-7B N/A liuhaotian/llava-v1.6-mistral-7b
Idefics2 8B Mistral-7B-v0.1 1.5B HuggingFaceM4/idefics2-8b
PaliGemma 2B Gemma-2B N/A google/paligemma-3b-pt-224
Multilingual
mBLIP N/A blip2-flan-t5-xl N/A Gregor/mblip-mt0-xl
Table 22: VLMs used in SEACrowd VL evaluation.
No. Prompt template
Sentiment Analysis
1 Classify the sentiment of the text below.\n[INPUT] => Sentiment ([OPTIONS]): [LABEL_CHOICE]
2 Predict the sentiment of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE]
3 [INPUT]\nWhat would be the sentiment of the text above? [OPTIONS]? [LABEL_CHOICE]
Topic Classification
1 Classify the topic of the text below.\n[INPUT] => Topic ([OPTIONS]): [LABEL_CHOICE]
2 Predict the topic of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE]
3 [INPUT]\nWhat would be the topic of the text above? [OPTIONS]? [LABEL_CHOICE]
Commonsense Reasoning \rightarrow *_seacrowd_text
1 Classify the morality of the text below.\n[INPUT] => Morality ([OPTIONS]): [LABEL_CHOICE]
2 Predict the morality of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE]
3 [INPUT]\nWhat would be the morality of the text above? [OPTIONS]? [LABEL_CHOICE]
Commonsense Reasoning \rightarrow *_seacrowd_qa
1 Question: [QUESTION]\nWhat reply makes more sense to answer this question?\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE]
2 Based on the the following question: "[QUESTION]" and choices: [ANSWER_CHOICE the correct answer is: [LABEL_CHOICE]
3 Question: [QUESTION]\nChoices: [ANSWER_CHOICES]\nThe correct answer to the given question is: [LABEL_CHOICE]
All QAs
1 Refer to the passage below and answer the following question:\nPassage: [CONTEXT]\nQuestion: [QUESTION]\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE]
2 [CONTEXT]\nBased on the above text, [QUESTION]\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE]
3 [CONTEXT]\nQuestion: [QUESTION]\nChoices:[ANSWER_ CHOICES]\nReferring to the passage above, the correct answer to the given question is: [LABEL_CHOICE]
NLI
1 Hypothesis: [INPUT_A]\nPremise: [INPUT_B]\nQuestion: What is the relation between the hypothesis and the premise? [OPTIONS]? [LABEL_CHOICE]
2 Given the following premise and hypothesis:\nHypothesis: [INPUT_A]\nPremise: [INPUT_B]\nDetermine the logical relationship (([OPTIONS])): [LABEL_CHOICE]
3 Choose the most appropriate relationship ([OPTIONS]) between the premise and hypothesis:\nRelationship between "[INPUT_B]" and "[INPUT_A]": [LABEL_CHOICE]
Table 23: Prompt templates used for NLU tasks.
No. Prompt template
Machine Translation (MT)
1 Translate the following text from [SOURCE] to [TARGET]. Give your translation directly.\nText: [INPUT]\nTranslation:
Summarization
1 Write a summary from the following text.\nText: [INPUT]\nSummary:
Abstractive & Extractive QA
1 Refer to the passage below and answer the following question:\nPassage: [CONTEXT]\nQuestion: [QUESTION]\nAnswer:
Table 24: Prompt templates used for NLG tasks.
Lang. Prompt template
Image Captioning
eng Caption the following image in [LANGUAGE].
fil Ilarawan ang sumusunod na larawan.
ind Deskripsikan gambar berikut.
Table 25: Prompt templates used for the image captioning task in VL evaluation.
abl abs ace ban bbc bew bhp bjn btx bug ceb eng fil ilo ind jav kac khm lao lus mad mak min mui mya nij pag rej shn sun tha vie war zsm Overall
GPT-4 63.3 39.0 39.3 60.3 7.1 68.5 2.8 60.4 27.8 40.4 85.6 52.1 55.9 69.5 60.7 59.7 30.8 66.4 51.8 70.0 37.1 44.3 57.9 71.8 47.6 40.2 79.4 34.0 21.7 58.5 59.6 56.1 84.9 61.6 51.9
Command-R 50.1 80.8 57.6 62.8 47.4 81.8 58.2 57.1 57.3 57.9 66.7 69.4 51.1 56.8 58.3 61.2 36.5 41.5 33.8 63.9 61.9 58.4 66.4 81.7 34.8 53.3 75.6 69.6 35.4 63.2 42.7 55.9 67.6 55.7 58.0
Mistral 36.7 53.6 46.4 49.6 33.0 59.3 44.3 44.6 44.3 48.8 53.5 69.2 48.4 49.1 52.5 46.7 33.2 29.8 30.7 56.1 45.7 44.8 51.2 62.6 27.4 40.1 69.2 48.6 31.9 48.3 40.8 45.2 54.4 49.6 46.8
Llama3 37.3 40.3 43.2 48.9 34.8 44.5 32.6 42.2 38.5 42.9 51.2 59.5 45.2 46.7 49.2 44.4 28.5 34.6 30.3 46.8 39.0 38.0 43.6 49.2 35.2 39.6 60.5 38.5 31.1 45.2 43.8 45.5 50.3 49.0 42.6
Falcon 21.1 63.2 13.3 19.0 23.0 37.9 62.1 15.6 31.9 15.7 19.5 43.7 25.1 18.8 30.8 27.0 14.2 10.2 12.7 15.0 30.3 32.3 23.6 37.0 18.0 23.0 18.8 36.0 14.1 28.2 15.9 18.8 19.1 17.4 25.1
mT0 37.6 63.6 43.7 51.2 37.0 66.1 38.4 43.6 41.3 50.3 62.5 49.4 41.0 59.0 47.2 56.0 40.9 57.5 61.2 57.0 46.7 45.8 52.6 68.8 45.9 40.9 62.6 47.8 47.0 58.8 41.8 41.4 61.4 49.4 50.5
BLOOMZ 25.6 66.5 28.4 34.2 35.8 53.9 48.0 30.4 36.3 33.3 30.9 51.7 28.9 27.8 44.7 38.2 23.1 18.9 23.6 28.1 37.8 34.5 39.9 60.2 23.0 34.6 33.1 42.2 19.8 41.3 25.9 34.8 32.1 34.3 35.3
BactrianX-Llama 24.9 48.6 21.2 28.5 26.9 33.4 45.9 22.8 31.4 22.7 27.9 45.6 32.0 24.3 38.3 30.0 19.9 17.0 20.7 21.0 30.0 28.8 26.2 35.7 22.8 27.2 26.5 29.2 20.5 30.2 24.5 27.1 28.3 31.5 28.6
AYA-23 43.3 21.2 26.9 35.0 24.3 31.2 16.8 30.9 25.1 26.5 36.0 50.8 33.5 32.7 46.8 36.9 20.5 15.1 22.0 27.4 31.0 31.7 27.3 35.5 23.7 37.3 32.6 22.8 20.8 34.9 32.7 44.8 37.1 47.9 31.3
AYA-101 42.5 64.3 71.2 65.2 58.8 68.2 43.3 63.5 52.7 60.7 71.7 62.8 52.8 65.0 54.2 62.6 43.1 62.2 67.8 71.8 56.9 49.0 69.3 70.2 51.5 57.2 75.7 52.9 53.8 67.2 49.5 48.0 70.5 56.4 59.8
SEA-LION 10.3 62.3 13.5 16.5 21.3 35.3 60.3 13.4 31.8 15.2 13.6 26.6 20.6 10.2 27.6 21.4 8.7 16.8 15.2 12.5 26.8 28.3 22.8 34.6 23.0 16.0 14.4 34.1 9.7 23.4 16.3 14.7 14.2 13.3 21.9
SeaLLM v2.5 50.7 55.1 34.5 43.4 36.3 53.9 53.2 45.8 45.8 37.7 47.6 42.5 52.6 44.7 53.4 49.8 27.4 42.6 50.3 45.8 48.7 49.8 46.8 58.4 41.0 39.1 55.7 47.8 28.7 50.1 49.0 54.5 55.4 60.6 47.0
Sailor 50.4 59.2 43.8 55.5 44.1 61.5 43.9 50.5 44.8 45.7 45.6 63.0 40.2 45.0 51.3 53.1 29.9 32.7 53.9 53.9 47.6 46.5 52.8 63.9 28.1 52.7 59.3 42.2 26.7 54.0 46.3 47.7 49.2 52.1 48.1
Cendol-mT5 15.0 98.5 38.3 42.3 84.7 99.4 95.6 33.3 92.6 68.6 14.1 38.7 23.8 12.2 33.4 50.5 10.4 20.3 15.3 9.6 76.5 70.2 65.2 99.6 16.6 52.6 12.8 98.9 7.2 56.6 26.4 14.7 15.1 15.9 44.8
Cendol-Llama2 17.5 80.0 30.8 33.5 60.6 49.3 73.4 27.9 45.1 32.3 18.7 36.8 21.4 17.8 37.4 35.1 14.7 13.2 15.9 15.0 46.3 38.1 37.1 51.6 19.9 40.3 17.7 47.7 16.5 38.5 20.6 17.3 18.5 18.4 32.5
Merak 37.0 68.6 37.7 48.3 36.4 66.1 60.1 41.4 50.4 47.8 42.4 59.6 37.9 39.7 48.5 48.4 27.9 24.2 28.0 44.3 51.7 51.0 50.5 70.3 27.2 40.0 58.6 57.9 28.6 50.8 29.3 35.3 43.7 47.1 45.2
WangchanX-Llama3 38.4 59.3 26.8 35.2 35.0 43.3 56.9 31.6 38.3 31.2 32.3 57.6 36.6 29.3 45.0 38.7 23.7 24.3 25.1 26.6 40.4 41.4 34.8 43.6 31.6 37.0 31.2 42.9 23.5 39.8 36.5 38.4 31.3 37.0 36.6
Malaysian Llama3 38.9 62.3 38.1 41.9 39.2 46.9 58.3 39.5 40.5 35.9 37.8 55.5 34.5 33.1 48.6 42.6 24.7 18.9 20.4 33.6 42.1 41.0 42.5 48.5 22.2 39.6 46.8 41.1 19.6 44.0 33.7 34.6 37.7 49.9 39.2
Overall 35.6 60.4 36.4 42.9 38.1 55.6 49.7 38.6 43.1 39.7 42.1 51.9 37.9 37.9 46.0 44.6 25.5 30.3 32.1 38.8 44.3 43.0 45.0 58.0 30.0 39.5 46.1 46.4 25.4 46.3 35.3 37.5 42.8 41.5 41.4
Table 26: NLU evaluation results in weighted F1-score per language.
ace ban bbc bjn bug ceb fil hmv ilo ind jav kac khm lao ljl lus mad min mya nij pag shn sun tha vie war zsm Overall
GPT-4 5.8 6.0 7.4 4.7 5.6 13.7 9.5 8.5 14.2 3.7 6.8 7.4 2.7 3.4 2.7 11.3 3.7 6.3 2.8 4.2 10.4 3.0 6.1 2.1 10.0 13.2 5.0 6.7
Command-R 19.6 26.1 16.4 30.0 16.0 44.3 52.5 16.8 29.4 57.9 32.6 8.8 8.7 14.2 6.0 19.5 17.2 31.6 9.5 18.4 20.4 8.9 27.5 24.3 46.8 34.4 50.1 25.5
Mistral 12.4 15.0 10.0 13.9 11.1 28.5 37.2 10.2 15.9 28.6 15.4 7.3 8.7 10.8 4.2 11.7 9.5 18.0 5.7 12.4 17.5 9.5 14.8 15.1 25.1 22.4 31.1 15.6
Llama3 11.0 12.3 8.1 13.8 7.6 25.1 33.2 7.6 18.4 21.9 17.0 4.8 6.5 5.8 3.2 9.6 8.5 16.4 4.5 9.5 11.8 6.3 15.1 9.6 21.7 20.5 25.2 13.2
Falcon 7.3 9.5 8.2 8.3 7.9 18.6 23.6 6.6 9.7 15.3 7.7 6.0 3.1 3.1 4.2 9.3 6.6 11.8 1.8 8.7 12.9 4.5 7.7 2.4 13.5 13.5 17.0 9.2
mT0 4.8 5.6 3.7 5.7 3.1 4.6 6.8 4.5 3.8 29.3 5.8 2.1 4.3 6.1 1.7 3.4 3.6 6.5 5.0 3.5 3.6 3.5 6.8 9.4 19.6 6.1 9.1 6.4
BLOOMZ 3.8 4.6 2.8 5.3 2.9 4.1 5.1 3.4 4.2 32.3 4.9 3.0 1.5 2.4 1.5 4.0 2.7 5.7 1.2 3.2 4.9 2.6 4.6 3.3 24.1 5.4 10.1 5.7
BactrianX-Llama 10.9 11.6 8.9 12.3 8.8 22.0 32.1 8.5 12.1 25.1 11.4 6.9 6.4 8.2 4.1 10.9 8.7 14.1 4.3 8.4 15.2 8.0 11.4 10.8 19.4 16.6 23.4 12.6
AYA-23 9.3 10.5 8.0 11.6 6.9 14.2 17.5 5.6 8.3 18.3 11.3 5.7 4.0 5.9 2.7 8.1 7.6 12.2 3.3 9.0 8.8 6.5 10.4 6.8 24.3 10.6 17.7 9.8
AYA-101 26.4 26.8 14.6 21.6 12.6 49.3 46.6 33.3 25.8 49.5 38.8 12.2 25.9 37.2 4.4 17.8 13.4 29.7 17.6 13.2 23.3 20.4 35.6 22.2 36.5 36.9 41.9 27.2
SEA-LION 7.2 8.1 6.5 9.3 5.8 12.5 17.1 4.9 7.0 13.9 7.9 5.3 7.0 9.6 2.0 7.6 6.0 9.5 4.8 6.6 8.4 4.9 8.0 5.9 21.2 10.3 14.1 8.6
SeaLLM v2.5 15.2 20.2 11.7 19.5 11.5 37.1 49.1 14.5 26.8 43.0 26.6 7.5 17.8 22.2 4.7 15.1 12.2 26.8 9.2 14.6 19.2 9.4 22.0 21.6 36.7 28.8 45.7 21.8
Sailor 19.2 24.5 15.3 23.1 14.6 29.0 39.7 8.6 13.5 46.8 30.6 7.1 12.5 24.4 6.2 10.5 16.0 28.8 5.8 19.1 16.5 9.0 26.7 22.0 41.1 21.5 49.9 21.6
Cendol-mT5 8.3 11.4 14.2 11.6 6.9 7.2 8.4 4.7 5.5 35.8 17.5 4.0 6.3 8.5 2.0 5.2 6.1 10.5 2.9 8.8 6.6 4.1 17.1 5.5 4.4 6.4 20.5 9.3
Cendol-Llama2 8.6 10.0 14.4 19.3 6.6 6.9 8.2 6.4 6.4 36.1 19.1 5.5 3.0 4.3 4.1 4.5 14.1 22.0 1.9 17.5 5.4 4.8 17.3 3.4 8.1 7.6 22.0 10.6
Merak 7.4 10.3 6.7 11.3 7.1 8.2 12.8 6.3 6.7 29.5 9.6 3.7 3.8 5.9 3.2 8.0 6.5 12.5 2.4 8.0 8.2 5.6 10.6 5.9 7.2 7.4 20.4 8.7
WangchanX-Llama3 19.8 24.4 14.3 28.9 13.4 42.2 48.6 12.7 29.4 50.1 29.4 7.7 18.1 19.7 6.0 17.6 15.6 30.0 10.4 18.1 22.4 13.9 28.0 25.1 39.2 35.5 45.4 24.7
Malaysian Llama3 15.2 17.3 12.3 22.2 11.1 19.7 24.0 8.7 12.6 38.6 19.4 7.2 6.7 9.0 5.9 10.6 12.4 23.5 4.2 14.3 13.9 8.3 19.0 14.2 17.3 15.6 44.4 15.8
Overall 11.8 14.1 10.2 15.1 8.9 21.5 26.2 9.5 13.9 32.0 17.3 6.2 8.2 11.2 3.8 10.3 9.5 17.6 5.4 11.0 12.8 7.4 16.0 11.7 23.1 17.4 27.4 14.1
Table 27: NLG evaluation results in ROUGE-L per language.
Lang. Subset Original Task Domain # Samples
Translationese
eng emotes_3k_eng_seacrowd_t2t Commonsense Reasoning Ethics 2000
eng aya_evaluation_suite_eng_seacrowd_t2t Instruction Tuning General 400
ind belebele_ind_latn_seacrowd_qa QA General 1969
ind parallel_asian_treebank_ind_eng_seacrowd_t2t Machine Translation News 31
ind aya_evaluation_suite_ind_seacrowd_t2t Instruction Tuning General 4
ind bactrian_x_id_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1972
ind seaeval_cross_logiqa_ind_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 16
ind seaeval_cross_mmlu_ind_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 8
khm belebele_khm_khmr_seacrowd_qa QA General 399
khm khmer_alt_pos_seacrowd_seq_label POS Tagging News 1595
khm parallel_asian_treebank_khm_eng_seacrowd_t2t Machine Translation News 6
khm aya_evaluation_suite_khm_seacrowd_t2t Instruction Tuning General 8
khm bactrian_x_km_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1992
lao belebele_lao_laoo_seacrowd_qa QA General 1969
lao parallel_asian_treebank_lao_eng_seacrowd_t2t Machine Translation News 31
lao aya_evaluation_suite_lao_seacrowd_t2t Instruction Tuning General 400
mya belebele_mya_mymr_seacrowd_qa QA General 1969
mya parallel_asian_treebank_mya_eng_seacrowd_t2t Machine Translation News 31
mya aya_evaluation_suite_mya_seacrowd_t2t Instruction Tuning General 8
mya bactrian_x_my_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1992
fil belebele_tgl_latn_seacrowd_qa QA General 2000
fil bactrian_x_tl_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 2000
tha belebele_tha_thai_seacrowd_qa QA General 1969
tha parallel_asian_treebank_tha_eng_seacrowd_t2t Machine Translation News 31
tha aya_evaluation_suite_tha_seacrowd_t2t Instruction Tuning General 8
tha bactrian_x_th_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1992
vie belebele_vie_latn_seacrowd_qa QA General 1969
vie parallel_asian_treebank_vie_eng_seacrowd_t2t Machine Translation News 31
vie aya_evaluation_suite_vie_seacrowd_t2t Instruction Tuning General 4
vie bactrian_x_vi_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1972
vie seaeval_cross_logiqa_vie_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 16
vie seaeval_cross_mmlu_vie_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 8
zlm belebele_zsm_latn_seacrowd_qa QA General 1969
zlm parallel_asian_treebank_zlm_eng_seacrowd_t2t Machine Translation News 31
zlm aya_evaluation_suite_zsm_seacrowd_t2t Instruction Tuning General 400
zlm seaeval_cross_logiqa_zlm_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 1056
zlm seaeval_cross_mmlu_zlm_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 300
Natural
eng cosem_seacrowd_ssp Language Modeling Social media 2000
ind sea_bench_ind_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 200
khm gklmip_newsclass_seacrowd_text Sentiment Analysis E-commerce 1436
khm sea_bench_khm_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
lao sea_bench_lao_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
mya gklmip_sentiment_seacrowd_text Sentiment Analysis E-commerce 716
mya sea_bench_mya_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
fil sea_bench_tgl_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
tha sea_bench_tha_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 40
tha vistec_tp_th_21_seacrowd_seq_label NER Social media 1960
vie sea_bench_vie_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 200
zlm sea_bench_zlm_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
Table 28: Train data used in the translationese classifier experiment.
Lang. Subset Original Task Domain # Samples
Translationese
eng emotes_3k_eng_seacrowd_t2t Commonsense Reasoning Ethics 2000
eng aya_evaluation_suite_eng_seacrowd_t2t Instruction Tuning General 400
ind belebele_ind_latn_seacrowd_qa QA General 1969
ind parallel_asian_treebank_ind_eng_seacrowd_t2t MT News 31
ind aya_evaluation_suite_ind_seacrowd_t2t Instruction Tuning General 4
ind bactrian_x_id_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1972
ind seaeval_cross_logiqa_ind_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 16
ind seaeval_cross_mmlu_ind_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 8
khm belebele_khm_khmr_seacrowd_qa QA General 399
khm khmer_alt_pos_seacrowd_seq_label POS Tagging News 1595
khm parallel_asian_treebank_khm_eng_seacrowd_t2t MT News 6
khm aya_evaluation_suite_khm_seacrowd_t2t Instruction Tuning General 8
khm bactrian_x_km_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1992
lao belebele_lao_laoo_seacrowd_qa QA General 1969
lao parallel_asian_treebank_lao_eng_seacrowd_t2t MT News 31
lao aya_evaluation_suite_lao_seacrowd_t2t Instruction Tuning General 400
mya belebele_mya_mymr_seacrowd_qa QA General 1969
mya parallel_asian_treebank_mya_eng_seacrowd_t2t MT News 31
mya aya_evaluation_suite_mya_seacrowd_t2t Instruction Tuning General 8
mya bactrian_x_my_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1992
fil belebele_tgl_latn_seacrowd_qa QA General 2000
fil bactrian_x_tl_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 2000
tha belebele_tha_thai_seacrowd_qa QA General 1969
tha parallel_asian_treebank_tha_eng_seacrowd_t2t MT News 31
tha aya_evaluation_suite_tha_seacrowd_t2t Instruction Tuning General 8
tha bactrian_x_th_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1992
vie belebele_vie_latn_seacrowd_qa QA General 1969
vie parallel_asian_treebank_vie_eng_seacrowd_t2t MT News 31
vie aya_evaluation_suite_vie_seacrowd_t2t Instruction Tuning General 4
vie bactrian_x_vi_seacrowd_t2t Instruction Tuning Mixed, Multi-domain, Wikipedia 1972
vie seaeval_cross_logiqa_vie_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 16
vie seaeval_cross_mmlu_vie_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 8
zlm belebele_zsm_latn_seacrowd_qa QA General 1969
zlm parallel_asian_treebank_zlm_eng_seacrowd_t2t MT News 31
zlm aya_evaluation_suite_zsm_seacrowd_t2t Instruction Tuning General 400
zlm seaeval_cross_logiqa_zlm_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 1056
zlm seaeval_cross_mmlu_zlm_seacrowd_qa Commonsense Reasoning, QA Commentary, General, Multi-domain, Culture & heritage 300
Natural
eng cosem_seacrowd_ssp Language Modeling Social media 2000
ind sea_bench_ind_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 200
khm gklmip_newsclass_seacrowd_text Sentiment Analysis E-commerce 1436
khm sea_bench_khm_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
lao sea_bench_lao_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
mya gklmip_sentiment_seacrowd_text Sentiment Analysis E-commerce 716
mya sea_bench_mya_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
fil sea_bench_tgl_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
tha sea_bench_tha_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 40
tha vistec_tp_th_21_seacrowd_seq_label NER Social media 1960
vie sea_bench_vie_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 200
zlm sea_bench_zlm_seacrowd_t2t Instruction Tuning Commentary, General, Multi-domain, Culture & heritage 160
Table 29: Test data used in the translationese classifier experiment.
Refer to caption
(a) τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01
Refer to caption
(b) τ=0.2𝜏0.2\tau=0.2italic_τ = 0.2
Refer to caption
(c) τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5
Refer to caption
(d) τ=0.7𝜏0.7\tau=0.7italic_τ = 0.7
Refer to caption
(e) τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0
Figure 10: Top-20 SEA indigenous languages to be prioritized based on their potential demand and current utility.
Refer to caption
(a) τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01
Refer to caption
(b) τ=0.2𝜏0.2\tau=0.2italic_τ = 0.2
Refer to caption
(c) τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5
Refer to caption
(d) τ=0.7𝜏0.7\tau=0.7italic_τ = 0.7
Refer to caption
(e) τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0
Figure 11: Top-20 SEA indigenous languages to be prioritized based on their potential demand and data availability.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 ind Indonesian Indonesia <1B
2 jav Javanese Indonesia <100M
3 vie Vietnamese Vietnam <100M
4 tha Thai Thailand, Cambodia <100M
5 fil Filipino Philippines <100M
6 mya Burmese Myanmar <100M
7 sun Sunda Indonesia <100M
8 tgl Tagalog Philippines <100M
9 khm Khmer Cambodia, Vietnam <100M
10 ceb Cebuano Philippines <100M
11 tts Northeastern Thai Thailand <100M
12 zlm Malay Malaysia <100M
13 zsm Standard Malay Malaysia, Brunei, Singapore <100M
Table 30: SEA indigenous languages with \geq10M speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 ilo Ilocano Philippines <10M
2 mad Madura Indonesia <10M
3 nod Northern Thai Laos, Thailand <10M
4 hil Hiligaynon Philippines <10M
5 min Minangkabau Indonesia <10M
6 bug Bugis Indonesia <10M
7 bew Betawi Indonesia <10M
8 sou Southern Thai Thailand <10M
9 lao Lao Cambodia, Laos <10M
10 hmv Hmong Dô Vietnam <10M
11 ace Aceh Indonesia <10M
12 bjn Banjar Indonesia <10M
13 ban Bali Indonesia <10M
14 shn Shan Myanmar, Thailand <10M
15 mui Musi Indonesia <10M
16 msi Sabah Malay Malaysia <10M
17 meo Kedah Malay Malaysia, Thailand <10M
18 pcc Giáy Vietnam <10M
19 war Waray-Waray Philippines <10M
20 mak Makasar Indonesia <10M
21 bcl Central Bikol Philippines <10M
22 xmm Manado Malay Indonesia <10M
23 sas Sasak Indonesia <10M
24 bbc Batak Toba Indonesia <10M
25 pam Kapampangan Philippines <10M
26 rki Rakhine Myanmar <10M
27 tyz Tày Vietnam <10M
28 abs Ambonese Malay Indonesia <10M
29 pse Central Malay Indonesia <10M
30 iba Iban Brunei, Indonesia, Malaysia <10M
31 kxm Northern Khmer Thailand <10M
32 khg Khams Tibetan Myanmar <10M
33 ksw S’gaw Karen Myanmar, Thailand <10M
34 btd Batak Dairi Indonesia <10M
35 bts Batak Simalungun Indonesia <10M
36 cbk Chavacano Philippines <10M
37 pag Pangasinan Philippines <10M
38 mtq Muong Vietnam <10M
39 btm Batak Mandailing Indonesia <10M
40 mdh Maguindanaon Philippines <10M
41 pmy Papuan Malay Indonesia <10M
42 gor Gorontalo Indonesia <10M
43 jax Jambi Malay Indonesia <10M
44 kjp Pwo Eastern Karen Myanmar, Thailand <10M
45 max North Moluccan Malay Indonesia <10M
46 mfa Pattani Malay Thailand <10M
Not in SEACrowd
47 mfp Makassar Indonesian Indonesia <10M
Table 31: SEA indigenous languages with <10M speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 nut Nung Vietnam <1M
2 kac Jingpho Myanmar <1M
3 tsg Tausug Philippines <1M
4 nij Ngaju Indonesia <1M
5 ljp Lampung Api Indonesia <1M
6 mqy Manggarai Indonesia <1M
7 mrw Maranao Philippines <1M
8 nia Nias Indonesia <1M
9 akb Batak Angkola Indonesia <1M
10 sda Toraja-Sa’dan Indonesia <1M
11 mnw Mon Myanmar, Thailand <1M
12 hni Hani Laos, Vietnam <1M
13 kjg Khmu Laos, Thailand, Vietnam <1M
14 aoz Uab Meto Indonesia <1M
15 blt Tai Dam Laos, Vietnam <1M
16 lus Mizo Chin Myanmar <1M
17 cps Capiznon Philippines <1M
18 btx Batak Karo Indonesia <1M
19 lis Lisu Myanmar <1M
20 msb Masbatenyo Philippines <1M
21 blk Pa’o Myanmar, Thailand <1M
22 tdd Tai Nüa Myanmar <1M
23 day Land Dayak Indonesia <1M
24 xdy Malayic Dayak Indonesia <1M
25 bhp Bima Indonesia <1M
26 ibg Ibanag Philippines <1M
27 zmi Negeri Sembilan Malay Malaysia <1M
28 mdr Mandar Indonesia <1M
29 kge Komering Indonesia <1M
30 bdr West Coast Bajau Malaysia <1M
31 kdt Kuay Cambodia, Laos, Thailand <1M
32 prk Parauk Wa Myanmar <1M
33 sgd Surigaonon Philippines <1M
34 tet Tetun East Timor, Indonesia <1M
35 bto Rinconada Bikol Philippines <1M
36 tdt Tetun Dili East Timor <1M
37 ium Iu Mien Laos, Vietnam <1M
38 krj Kinaray-a Philippines <1M
39 kyk Kamayo Philippines <1M
40 lew Ledo Kaili Indonesia <1M
41 mkn Kupang Malay Indonesia <1M
42 rej Rejang Indonesia <1M
43 mfb Bangka Indonesia <1M
44 rob Tae’ Indonesia <1M
45 lbw Tolaki Indonesia <1M
46 knx Kendayan Indonesia, Malaysia <1M
47 gay Gayo Indonesia <1M
48 mnb Muna Indonesia <1M
49 rbl Miraya Bikol Philippines <1M
50 smw Sumbawa Indonesia <1M
51 kxd Brunei Brunei <1M
52 khb Laos, Myanmar <1M
53 lhu Lahu Laos, Myanmar <1M
54 twh Tai Dón Laos, Vietnam <1M
55 ysm Myanmar Sign Language Myanmar <1M
56 dtp Kadazan Dusun Malaysia <1M
57 fbl West Albay Bikol Philippines <1M
58 kvr Kerinci Indonesia <1M
59 pce Ruching Palaung Myanmar <1M
60 mry Mandaya Philippines <1M
61 nbe Konyak Naga Myanmar <1M
62 tcz Thado Chin Myanmar <1M
63 jra Jarai Cambodia, Vietnam <1M
64 xbr Kambera Indonesia <1M
65 mog Mongondow Indonesia <1M
66 pwo Pwo Western Karen Myanmar <1M
67 cja Western Cham Cambodia, Vietnam <1M
68 ahk Akha Laos, Myanmar, Thailand <1M
69 ssb Southern Sama Philippines <1M
70 sxn Sangir Indonesia <1M
Table 32: (1/2) SEA indigenous languages with <1M speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
71 btz Batak Alas-Kluet Indonesia <1M
72 ctd Tedim Chin Myanmar <1M
73 srv Southern Sorsoganon Philippines <1M
74 abl Lampung Nyo Indonesia <1M
75 dnw Western Dani Indonesia <1M
76 ktp Kaduo Laos <1M
77 slp Lamaholot Indonesia <1M
78 rad Rade Vietnam <1M
79 ski Sika Indonesia <1M
80 kpm Koho Vietnam <1M
81 bdq Bahnar Vietnam <1M
82 bdl Indonesian Bajau Indonesia <1M
83 bpr Koronadal Blaan Philippines <1M
84 ccp Chakma Myanmar <1M
85 kne Kankanaey Philippines <1M
86 kyu Western Kayah Myanmar <1M
87 mhy Ma’anyan Indonesia <1M
88 tnt Tontemboan Indonesia <1M
89 pll Shwe Palaung Myanmar <1M
90 daw Davawenyo Philippines <1M
91 cnh Hakha Chin Myanmar <1M
92 syb Central Subanen Philippines <1M
93 rbb Rumai Palaung Myanmar <1M
94 pmf Pamona Indonesia <1M
95 bln Southern Catanduanes Bikol Philippines <1M
96 itv Itawit Philippines <1M
97 pdu Kayan Myanmar <1M
98 mgm Mambae East Timor <1M
99 bhq Tukang Besi South Indonesia <1M
100 sly Selayar Indonesia <1M
101 mvp Duri Indonesia <1M
102 bgz Banggai Indonesia <1M
103 kjc Coastal Konjo Indonesia <1M
104 suc Western Subanon Philippines <1M
105 cyo Cuyonon Philippines <1M
106 khc Tukang Besi North Indonesia <1M
107 lhi Lahu Shi Myanmar <1M
108 mel Central Melanau Malaysia <1M
109 ibl Ibaloi Philippines <1M
110 end Ende Indonesia <1M
111 hvn Hawu Indonesia <1M
112 kkv Kangean Indonesia <1M
113 yka Yakan Philippines <1M
114 ljl Li’o Indonesia <1M
115 mkz Makasae East Timor <1M
116 bkd Binukid Philippines <1M
117 bkr Bakumpai Indonesia <1M
118 ekg Ekari Indonesia <1M
119 hnj Hmong Njua Laos, Thailand, Vietnam <1M
120 kak Kalanguya Philippines <1M
121 kkh Khün Myanmar <1M
122 lbx Lawangan Indonesia <1M
123 mhx Lhao Vo Myanmar <1M
124 mqj Mamasa Indonesia <1M
125 psp Filipino Sign Language Philippines <1M
126 tgn Tandaganon Philippines <1M
Not in SEACrowd
127 rhg Rohingya Myanmar <1M
128 pht Phu Thai Laos, Thailand, Vietnam <1M
129 tvn Tavoyan Myanmar <1M
130 osi Osing Indonesia <1M
131 ilp Iranun Philippines <1M
132 kzs Sugut Dusun Malaysia <1M
133 vkt Tenggarong Kutai Malay Indonesia <1M
134 phu Phuan Laos, Thailand <1M
135 csh Asho Chin Myanmar <1M
136 mlc Cao Lan Vietnam <1M
137 kjk Highland Konjo Indonesia <1M
138 liw Col Indonesia <1M
139 sss So Laos, Thailand <1M
140 dnv Danu Myanmar <1M
141 sdq Semandang Indonesia <1M
142 tjl Tai Laing Myanmar <1M
Table 33: (2/2) SEA indigenous languages with <1M speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 adr Adonara Indonesia <100K
2 sed Sedang Vietnam <100K
3 blf Buol Indonesia <100K
4 tbl Tboli Philippines <100K
5 hre Hre Vietnam <100K
6 rol Romblomanon Philippines <100K
7 akl Aklanon Philippines <100K
8 tdn Tondano Indonesia <100K
9 bps Sarangani Blaan Philippines <100K
10 kqr Kimaragang Malaysia <100K
11 sml Central Sama Philippines <100K
12 txs Tonsea Indonesia <100K
13 stb Northern Subanen Philippines <100K
14 bks Northern Sorsoganon Philippines <100K
15 kei Kei Indonesia <100K
16 klg Tagakaulo Philippines <100K
17 tld Talaud Indonesia <100K
18 atb Zaiwa Myanmar <100K
19 sse Balangingih Sama Philippines <100K
20 tes Tengger Indonesia <100K
21 tyr Tai Daeng Laos, Vietnam <100K
22 cia Cia-Cia Indonesia <100K
23 gbi Galela Indonesia <100K
24 otd Ot Danum Indonesia <100K
25 cts Northern Catanduanes Bikol Philippines <100K
26 loe Saluan Indonesia <100K
27 bno Bantoanon Philippines <100K
28 cmr Mro-Khimi Myanmar <100K
29 ubl Buhi’non Bikol Philippines <100K
30 cjm Eastern Cham Vietnam <100K
31 bkx Baikeno East Timor <100K
32 aaz Amarasi Indonesia <100K
33 bhw Biak Indonesia <100K
34 kqe Kalagan Philippines <100K
35 xnn Northern Kankanay Philippines <100K
36 xsb Sambal Philippines <100K
37 cfm Falam Chin Myanmar <100K
38 lbl Libon Bikol Philippines <100K
39 wlo Wolio Indonesia <100K
40 bth Biatah Bidayuh Indonesia, Malaysia <100K
41 kem Kemak East Timor, Indonesia <100K
42 raw Rawang Myanmar <100K
43 tft Ternate Indonesia <100K
44 zom Zo Myanmar <100K
45 cnk Khumi Chin Myanmar <100K
46 mqx Mamuju Indonesia <100K
47 msm Agusan Manobo Philippines <100K
48 nst Tangshang Naga Myanmar <100K
49 nxg Ngad’a Indonesia <100K
50 obo Obo Manobo Philippines <100K
51 pww Pwo Northern Karen Thailand <100K
52 sya Siang Indonesia <100K
53 tom Tombulu Indonesia <100K
54 xml Malaysian Sign Language Malaysia <100K
55 mbs Sarangani Manobo Philippines <100K
56 mwv Mentawai Indonesia <100K
57 msk Mansaka Philippines <100K
58 smk Bolinao Philippines <100K
59 bfn Bunak East Timor, Indonesia <100K
60 bgi Bagobo-Klata Philippines <100K
61 drg Rungus Malaysia <100K
62 kzf Da’a Kaili Indonesia <100K
63 wew Wejewa Indonesia <100K
64 rog Northern Roglai Vietnam <100K
65 ilk Bogkalot Philippines <100K
66 ktv Eastern Katu Vietnam <100K
67 dnt Mid Grand Valley Dani Indonesia <100K
68 frd Fordata Indonesia <100K
69 mbt Matigsalug Manobo Philippines <100K
70 nxe Nage Indonesia <100K
71 ptt Enrekang Indonesia <100K
Table 34: (1/5) SEA indigenous languages with <100K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
72 tiy Teduray Philippines <100K
73 tjg Tunjung Indonesia <100K
74 wmm Maiwa Indonesia <100K
75 sdo Bukar-Sadong Bidayuh Indonesia, Malaysia <100K
76 kyp Kang Laos <100K
77 tvo Tidore Indonesia <100K
78 hos Ho Chi Minh City Sign Language Vietnam <100K
79 mhs Buru Indonesia <100K
80 sti Bulo Stieng Cambodia, Vietnam <100K
81 law Lauje Indonesia <100K
82 bgs Tagabawa Philippines <100K
83 sjm Mapun Philippines <100K
84 blr Blang Myanmar, Thailand <100K
85 rgs Southern Roglai Vietnam <100K
86 smr Simeulue Indonesia <100K
87 czt Zotung Chin Myanmar <100K
88 kvq Geba Karen Myanmar <100K
89 mtd Mualang Indonesia <100K
90 xxk Ke’o Indonesia <100K
91 tkd Tukudede East Timor <100K
92 kix Khiamniungan Naga Myanmar <100K
93 bsb Brunei Bisaya Brunei, Malaysia <100K
94 dao Daai Chin Myanmar <100K
95 ddg Fataluku East Timor <100K
96 mqn Moronene Indonesia <100K
97 ges Geser-Gorom Indonesia <100K
98 pho Phunoi Laos <100K
99 slm Pangutaran Sama Philippines <100K
100 hro Haroi Vietnam <100K
101 ivv Ivatan Philippines <100K
102 mrh Mara Chin Myanmar <100K
103 btw Butuanon Philippines <100K
104 cma Maa Vietnam <100K
105 sbl Botolan Sambal Philippines <100K
106 cmo Central Mnong Cambodia, Vietnam <100K
107 blz Balantak Indonesia <100K
108 tpu Tampuan Cambodia <100K
109 blj Bulungan Indonesia <100K
110 cgc Kagayanen Philippines <100K
111 clu Caluyanun Philippines <100K
112 cml Koneq-koneq Indonesia <100K
113 gad Gaddang Philippines <100K
114 hlt Matu Chin Myanmar <100K
115 ifk Tuwali Ifugao Philippines <100K
116 ifu Mayoyao Ifugao Philippines <100K
117 knb Lubuagan Kalinga Philippines <100K
118 ksx Kedang Indonesia <100K
119 lcf Lubu Indonesia <100K
120 lsi Lacid Myanmar <100K
121 mba Higaonon Philippines <100K
122 mng Eastern Mnong Vietnam <100K
123 mro Mru Myanmar <100K
124 mta Cotabato Manobo Philippines <100K
125 set Sentani Indonesia <100K
126 tmn Taman Indonesia <100K
127 twu Termanu Indonesia <100K
128 txm Tomini Indonesia <100K
129 ulm Ulumanda’ Indonesia <100K
130 wow Wawonii Indonesia <100K
131 sne Bau Bidayuh Indonesia, Malaysia <100K
132 tdf Talieng Laos <100K
133 lbo Laven Laos <100K
134 acn Ngochang Myanmar <100K
135 tlb Tobelo Indonesia <100K
136 ifa Amganad Ifugao Philippines <100K
137 itd Southern Tidung Indonesia, Malaysia <100K
138 pha Pa-Hng Vietnam <100K
139 atd Ata Manobo Philippines <100K
140 bru Eastern Bru Laos, Vietnam <100K
141 kzp Kaidipang Indonesia <100K
142 abx Inabaknon Philippines <100K
Table 35: (2/5) SEA indigenous languages with <100K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
143 aol Alor Indonesia <100K
144 jmd Yamdena Indonesia <100K
145 laa Southern Subanen Philippines <100K
146 lmy Lamboya Indonesia <100K
147 txe Totoli Indonesia <100K
148 oyb Oy Laos <100K
149 mlf Mal Laos, Thailand <100K
150 lnd Lundayeh Brunei, Indonesia, Malaysia <100K
151 prh Porohanon Philippines <100K
152 brb Brao Cambodia, Laos, Vietnam <100K
153 lbn Rmeet Laos <100K
154 ilm Iranun Malaysia <100K
155 ptu Bambam Indonesia <100K
156 vkl Kulisusu Indonesia <100K
157 blw Balangao Philippines <100K
158 bsy Sabah Bisaya Malaysia <100K
159 krr Krung Cambodia <100K
160 dtb Labuk-Kinabatangan Kadazan Malaysia <100K
161 ayz Mai Brat Indonesia <100K
162 bac Badui Indonesia <100K
163 brv Western Bru Laos, Thailand <100K
164 bwp Mandobo Bawah Indonesia <100K
165 dna Upper Grand Valley Dani Indonesia <100K
166 dni Lower Grand Valley Dani Indonesia <100K
167 dtr Lotud Malaysia <100K
168 dun Dusun Deyah Indonesia <100K
169 kje Kisar Indonesia <100K
170 kli Kalumpang Indonesia <100K
171 kod Kodi Indonesia <100K
172 llg Lole Indonesia <100K
173 lrt Larantuka Malay Indonesia <100K
174 mnz Moni Indonesia <100K
175 pea Peranakan Indonesian Indonesia <100K
176 ppk Uma Indonesia <100K
177 prt Prai Laos, Thailand <100K
178 tmm Tai Thanh Vietnam <100K
179 tnw Tonsawang Indonesia <100K
180 twy Tawoyan Indonesia <100K
181 txq Tii Indonesia <100K
182 wlw Walak Indonesia <100K
183 skh Sikule Indonesia <100K
184 lbk Central Bontok Philippines <100K
185 cje Chru Vietnam <100K
186 hnn Hanunoo Philippines <100K
187 tlu Tulehu Indonesia <100K
188 wmh Waima’a East Timor <100K
189 hrk Haruku Indonesia <100K
190 lex Luang Indonesia <100K
191 puo Puoc Vietnam <100K
192 ren Rengao Vietnam <100K
193 alp Alune Indonesia <100K
194 bwe Bwe Karen Myanmar <100K
195 tlt Sou Nama Indonesia <100K
196 zyp Zyphe Chin Myanmar <100K
197 abz Abui Indonesia <100K
198 akg Anakalangu Indonesia <100K
199 had Hatam Indonesia <100K
200 htu Hitu Indonesia <100K
201 nlc Nalca Indonesia <100K
202 pac Pacoh Laos, Vietnam <100K
203 yog Yogad Philippines <100K
204 mxd Modang Indonesia <100K
205 jeh Jeh Laos, Vietnam <100K
206 kyn Northern Binukidnon Philippines <100K
207 phg Phuong Vietnam <100K
208 agn Agutaynen Philippines <100K
209 cnw Ngawn Chin Myanmar <100K
210 ila Ile Ape Indonesia <100K
211 krd Kairui-Midiki East Timor <100K
212 loa Loloda Indonesia <100K
213 mbb Western Bukidnon Manobo Philippines <100K
214 mwq Müün Chin Myanmar <100K
215 nxa Nauete East Timor <100K
216 prf Paranan Philippines <100K
Table 36: (3/5) SEA indigenous languages with <100K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
217 snl Sangil Philippines <100K
218 tby Tabaru Indonesia <100K
219 tea Temiar Malaysia <100K
220 yli Angguruk Yali Indonesia <100K
221 mej Meyah Indonesia <100K
222 mbi Ilianen Manobo Philippines <100K
223 plw Brooke’s Point Palawano Philippines <100K
224 duu Drung Myanmar <100K
225 heg Helong Indonesia <100K
226 mzq Mori Atas Indonesia <100K
227 uhn Damal Indonesia <100K
228 xmz Mori Bawah Indonesia <100K
229 kjm Kháng Vietnam <100K
230 hal Salang Laos, Vietnam <100K
231 idt Idaté East Timor <100K
232 dok Dondo Indonesia <100K
233 gal Galolen East Timor, Indonesia <100K
234 ksc Southern Kalinga Philippines <100K
235 txa Tombonuo Malaysia <100K
236 ngt Kriang Laos <100K
237 kmk Limos Kalinga Philippines <100K
238 alo Larike-Wakasihu Indonesia <100K
239 yno Yong Thailand <100K
240 ril Riang Lang Myanmar <100K
241 atq Aralle-Tabulahan Indonesia <100K
242 cek Eastern Khumi Chin Myanmar <100K
243 cua Cua Vietnam <100K
244 mnx Sougb Indonesia <100K
245 mqs West Makian Indonesia <100K
246 nuf Nusu Myanmar <100K
247 plc Central Palawano Philippines <100K
248 plv Southwest Palawano Philippines <100K
249 rgu Rikou Indonesia <100K
250 szw Sawai Indonesia <100K
251 tdj Tajio Indonesia <100K
252 xkl Mainstream Kenyah Indonesia, Malaysia <100K
253 yin Riang Lai Myanmar <100K
254 lcl Lisela Indonesia <100K
255 lra Rara Bakati’ Indonesia, Malaysia <100K
256 bve Berau Malay Indonesia <100K
257 kml Tanudan Kalinga Philippines <100K
258 beu Blagar Indonesia <100K
259 xem Mateq Indonesia <100K
260 lev Western Pantar Indonesia <100K
261 ptn Patani Indonesia <100K
262 oog Ong Laos <100K
263 spr Saparua Indonesia <100K
264 amk Ambai Indonesia <100K
265 ifb Batad Ifugao Philippines <100K
266 aax Mandobo Atas Indonesia <100K
267 bep Behoa Indonesia <100K
268 bvy Baybayanon Philippines <100K
269 csy Siyin Chin Myanmar <100K
270 dbj Ida’an Malaysia <100K
271 emb Embaloh Indonesia <100K
272 iry Iraya Philippines <100K
273 jak Jakun Malaysia <100K
274 jaq Yaqay Indonesia <100K
275 kps Tehit Indonesia <100K
276 kvb Kubu Indonesia <100K
277 kxf Kawyaw Myanmar <100K
278 kyt Kayagar Indonesia <100K
279 lje Rampi Indonesia <100K
280 lur Loura Indonesia <100K
281 mbd Dibabawon Manobo Philippines <100K
282 mbf Baba Malay Singapore <100K
283 mky East Makian Indonesia <100K
284 mvd Mamboru Indonesia <100K
285 ndx Nduga Indonesia <100K
286 pez Eastern Penan Brunei, Malaysia <100K
287 ple Palu’e Indonesia <100K
288 sea Semai Malaysia <100K
289 ssq So’a Indonesia <100K
Table 37: (4/5) SEA indigenous languages with <100K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
290 szb Ngalum Indonesia <100K
291 tbk Calamian Tagbanwa Philippines <100K
292 tbw Tagbanwa Philippines <100K
293 txx Tatana Malaysia <100K
294 wnk Wanukaka Indonesia <100K
295 yva Yawa Indonesia <100K
Not in SEACrowd
296 int Intha Myanmar <100K
297 loc Inonhan Philippines <100K
298 mqg Kota Bangun Kutai Malay Indonesia <100K
299 bfx Bantayanon Philippines <100K
300 tou Tho Vietnam <100K
301 ncq Northern Katang Laos <100K
302 bvu Bukit Malay Indonesia <100K
303 byd Benyadu’ Indonesia <100K
304 tsq Thai Sign Language Thailand <100K
305 nyw Nyaw Thailand <100K
306 rir Ribun Indonesia <100K
307 scg Sanggau Indonesia <100K
308 sct Southern Katang Laos <100K
309 stt Budeh Stieng Vietnam <100K
310 tco Taungyo Myanmar <100K
311 vkk Kaur Indonesia <100K
312 hab Hanoi Sign Language Vietnam <100K
313 djo Jangkang Indonesia <100K
314 sbx Seberuang Indonesia <100K
315 lso Laos Sign Language Laos <100K
316 sez Senthang Chin Myanmar <100K
317 soa Thai Song Thailand <100K
318 knl Keninjal Indonesia <100K
319 tth Upper Ta’oih Laos, Vietnam <100K
320 apg Ampanang Indonesia <100K
321 mnn Southern Mnong Vietnam <100K
322 pel Pekal Indonesia <100K
323 zkd Kadu Myanmar <100K
324 bkz Bungku Indonesia <100K
325 mkx Kinamiging Manobo Philippines <100K
326 bnu Bentong Indonesia <100K
327 kxy Kayong Vietnam <100K
328 mhp Balinese Malay Indonesia <100K
329 unz Unde Kaili Indonesia <100K
330 bld Bolango Indonesia <100K
331 kuf Western Katu Laos <100K
332 dnk Dengka Indonesia <100K
333 mvv Tagal Murut Indonesia, Malaysia <100K
334 skn Kolibugan Subanon Philippines <100K
335 szn Sula Indonesia <100K
336 cnb Uppu Chin Myanmar <100K
337 bhv Bahau Indonesia <100K
338 itt Maeng Itneg Philippines <100K
339 hji Haji Indonesia <100K
340 ghk Geko Karen Myanmar <100K
341 kvl Kayaw Myanmar <100K
342 tto Lower Ta’oih Laos <100K
343 bdb Basap Indonesia <100K
344 clj Laitu Chin Myanmar <100K
345 clt Lautu Chin Myanmar <100K
346 dup Duano Indonesia, Malaysia <100K
347 kyb Butbut Kalinga Philippines <100K
348 stg Trieng Vietnam <100K
349 cbw Kinabalian Philippines <100K
350 csv Sumtu Chin Myanmar <100K
351 riu Riung Indonesia <100K
352 srg Sulod Philippines <100K
353 ity Moyadan Itneg Philippines <100K
354 kkg Mabaka Valley Kalinga Philippines <100K
355 bne Bintauna Indonesia <100K
356 nlk Ninia Yali Indonesia <100K
357 hik Seit-Kaitetu Indonesia <100K
358 ksn Kasiguranin Philippines <100K
359 tsl Ts’ün-Lao Vietnam <100K
360 xao Khao Vietnam <100K
Table 38: (5/5) SEA indigenous languages with <100K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 xte Ketengban Indonesia <10K
2 bna Bonerate Indonesia <10K
3 bku Buhid Philippines <10K
4 aws South Awyu Indonesia <10K
5 woo Manombai Indonesia <10K
6 asc Casuarina Coast Asmat Indonesia <10K
7 tih Timugon Murut Malaysia <10K
8 asl Asilulu Indonesia <10K
9 sgb Mag-antsi Ayta Philippines <10K
10 eky Eastern Kayah Myanmar, Thailand <10K
11 ify Keley-i Kallahan Philippines <10K
12 inl Indonesian Sign Language Indonesia <10K
13 kgq Kamoro Indonesia <10K
14 kht Khamti Myanmar <10K
15 kpq Korupun-Sela Indonesia <10K
16 kti North Muyu Indonesia <10K
17 lcp Western Lawa Thailand <10K
18 mtj Moskona Indonesia <10K
19 slu Selaru Indonesia <10K
20 tmw Temuan Malaysia <10K
21 txt Citak Indonesia <10K
22 whk Wahau Kenyah Indonesia <10K
23 txn West Tarangan Indonesia <10K
24 dro Daro-Matu Melanau Malaysia <10K
25 awu Central Awyu Indonesia <10K
26 itb Binongan Itneg Philippines <10K
27 lti Leti Indonesia <10K
28 saj Sahu Indonesia <10K
29 kvv Kola Indonesia <10K
30 kvu Yinbaw Myanmar <10K
31 akc Mpur Indonesia <10K
32 cns Central Asmat Indonesia <10K
33 crw Chrau Vietnam <10K
34 lwl Eastern Lawa Thailand <10K
35 lzn Lainong Naga Myanmar <10K
36 mrz Marind Indonesia <10K
37 row Dela-Oenale Indonesia <10K
38 sfe Eastern Subanen Philippines <10K
39 ttd Tutong Brunei <10K
40 iwo Morop Indonesia <10K
41 twb Tawbuid Philippines <10K
42 bhz Bada Indonesia <10K
43 pwm Molbog Malaysia, Philippines <10K
44 psa Asue Awyu Indonesia <10K
45 ebk Eastern Bontok Philippines <10K
46 tre East Tarangan Indonesia <10K
47 npy Napu Indonesia <10K
48 gdg Ga’dang Philippines <10K
49 gir Red Gelao Vietnam <10K
50 kll Kagan Kalagan Philippines <10K
51 lwt Lewotobi Indonesia <10K
52 moo Monom Vietnam <10K
53 pnp Pancana Indonesia <10K
54 tdr Todrah Vietnam <10K
55 weo Wemale Indonesia <10K
56 woi Kamang Indonesia <10K
57 wrp Waropen Indonesia <10K
58 lha Laha Vietnam <10K
59 kvo Dobel Indonesia <10K
60 mtg Una Indonesia <10K
61 inn Isinay Philippines <10K
62 ihp Iha Indonesia <10K
63 jka Kaera Indonesia <10K
64 myl Moma Indonesia <10K
65 mmn Minamanwa Philippines <10K
66 nxr Ninggerum Indonesia <10K
67 blx Mag-Indi Ayta Philippines <10K
68 duw Dusun Witu Indonesia <10K
69 kgw Karon Dori Indonesia <10K
70 kyo Klon Indonesia <10K
71 lbt Lachi Vietnam <10K
72 mli Malimpung Indonesia <10K
73 nfa Dhao Indonesia <10K
74 pdo Padoe Indonesia <10K
75 raz Rahambuu Indonesia <10K
76 tpg Kula Indonesia <10K
77 urk Urak Lawoi’ Thailand <10K
78 wad Wamesa Indonesia <10K
79 wod Wolani Indonesia <10K
80 wul Silimo Indonesia <10K
Table 39: (1/6) SEA indigenous languages with <10K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
81 yac Pass Valley Yali Indonesia <10K
82 yoy Yoy Laos, Thailand <10K
83 and Ansus Indonesia <10K
84 mxn Moi Kelim Indonesia <10K
85 tlv Taliabu Indonesia <10K
86 bty Bobot Indonesia <10K
87 duq Dusun Malang Indonesia <10K
88 ums Pendau Indonesia <10K
89 vbb Southeast Babar Indonesia <10K
90 baj Barakai Indonesia <10K
91 bgr Bawm Chin Myanmar <10K
92 irr Ir Laos <10K
93 nbq Nggem Indonesia <10K
94 bqr Burusu Indonesia <10K
95 kvd Kui Indonesia <10K
96 bny Bintulu Malaysia <10K
97 rka Kraol Cambodia <10K
98 jah Jah Hut Malaysia <10K
99 kys Baram Kayan Malaysia <10K
100 smu Somray Cambodia <10K
101 sza Semelai Malaysia <10K
102 alk Alak Laos <10K
103 anl Anu-Khongso Chin Myanmar <10K
104 bei Bakati’ Indonesia <10K
105 irh Irarutu Indonesia <10K
106 kta Katua Vietnam <10K
107 kts South Muyu Indonesia <10K
108 kzi Kelabit Indonesia, Malaysia <10K
109 lmr Lamalera Indonesia <10K
110 mwt Moken Myanmar, Thailand <10K
111 ntx Tangkhul Naga Myanmar <10K
112 ror Rongga Indonesia <10K
113 sdu Sarudu Indonesia <10K
114 slz Ma’ya Indonesia <10K
115 sre Sara Bakati’ Indonesia <10K
116 tgb Tobilung Malaysia <10K
117 twe Teiwa Indonesia <10K
118 tyn Kombai Indonesia <10K
119 wah Watubela Indonesia <10K
120 nev Nyaheun Laos <10K
121 klz Kabola Indonesia <10K
122 awy Edera Awyu Indonesia <10K
123 abd Manide Philippines <10K
124 tnm Tabla Indonesia <10K
125 skb Saek Laos, Thailand <10K
126 kvw Wersing Indonesia <10K
127 xod Kokoda Indonesia <10K
128 bpq Banda Malay Indonesia <10K
129 bay Batuley Indonesia <10K
130 kgx Kamaru Indonesia <10K
131 khe Korowai Indonesia <10K
132 lkj Remun Malaysia <10K
133 pku Paku Indonesia <10K
134 saw Sawi Indonesia <10K
135 tcg Tamagario Indonesia <10K
136 pne Western Penan Malaysia <10K
137 xks Kumbewaha Indonesia <10K
138 pgu Pagu Indonesia <10K
139 tpo Tai Pao Laos, Vietnam <10K
140 zrs Mairasi Indonesia <10K
141 kzz Kalabra Indonesia <10K
142 bls Balaesang Indonesia <10K
143 kuv Kur Indonesia <10K
144 ree Rejang Kayan Malaysia <10K
145 abp Abellen Ayta Philippines <10K
146 adn Adang Indonesia <10K
147 ahh Aghu Indonesia <10K
148 bnd Banda Indonesia <10K
149 bnq Bantik Indonesia <10K
150 ckh Chak Myanmar <10K
151 due Umiray Dumaget Agta Philippines <10K
152 eip Lik Indonesia <10K
153 kgr Abun Indonesia <10K
154 kig Kimaghima Indonesia <10K
155 nsy Nasal Indonesia <10K
156 swt Sawila Indonesia <10K
157 tmg Ternateño Indonesia <10K
158 wms Wambon Indonesia <10K
159 mhe Mah Meri Malaysia <10K
160 bgl Bo Laos <10K
Table 40: (2/6) SEA indigenous languages with <10k speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
161 bpv Bian Marind Indonesia <10K
162 gzn Gane Indonesia <10K
163 dmr East Damar Indonesia <10K
164 obk Southern Bontok Philippines <10K
165 bzl Boano Indonesia <10K
166 hbu Habun East Timor <10K
167 zng Mang Vietnam <10K
168 gei Gebe Indonesia <10K
169 spb Sepa Indonesia <10K
170 agv Remontado Dumagat Philippines <10K
171 bzq Buli Indonesia <10K
172 brp Barapasi Indonesia <10K
173 cbl Bualkhaw Chin Myanmar <10K
174 grs Gresi Indonesia <10K
175 jmn Makuri Naga Myanmar <10K
176 kmt Kemtuik Indonesia <10K
177 kwe Kwerba Indonesia <10K
178 sko Seko Tengah Indonesia <10K
179 wrs Waris Indonesia <10K
180 kyi Kiput Malaysia <10K
181 nrm Narom Malaysia <10K
182 klw Tado Indonesia <10K
183 spu Sapuan Laos <10K
184 jei Yei Indonesia <10K
185 sqq Sou Laos <10K
186 awv Jair Awyu Indonesia <10K
187 bup Busoa Indonesia <10K
188 kkl Kosarek Yale Indonesia <10K
189 zka Kaimbulawa Indonesia <10K
190 kjr Kurudu Indonesia <10K
191 alj Alangan Philippines <10K
192 asy Yaosakor Asmat Indonesia <10K
193 dms Dampelas Indonesia <10K
194 enr Emem Indonesia <10K
195 hnu Hung Laos, Vietnam <10K
196 kwt Kwesten Indonesia <10K
197 kyj Karao Philippines <10K
198 lau Laba Indonesia <10K
199 ley Limola Indonesia <10K
200 mqf Momuna Indonesia <10K
201 mqo Modole Indonesia <10K
202 nir Nimboran Indonesia <10K
203 pmo Pom Indonesia <10K
204 sge Segai Indonesia <10K
205 szc Semaq Beri Malaysia <10K
206 tgt Central Tagbanwa Philippines <10K
207 tty Sikaritai Indonesia <10K
208 bgk Bit Laos <10K
209 grm Kota Marudu Talantang Malaysia <10K
210 srl Isirawa Indonesia <10K
211 wbw Woi Indonesia <10K
212 sib Sebop Malaysia <10K
213 bnb Bookan Murut Malaysia <10K
214 llm Lasalimu Indonesia <10K
215 rmm Roma Indonesia <10K
216 pcb Pear Cambodia <10K
217 abc Ambala Ayta Philippines <10K
218 nxx Nafri Indonesia <10K
219 lwh White Lachi Vietnam <10K
220 ury Orya Indonesia <10K
221 irx Kamberau Indonesia <10K
222 atk Ati Philippines <10K
223 bgb Bobongko Indonesia <10K
224 bvz Bauzi Indonesia <10K
225 bzp Kemberano Indonesia <10K
226 cbn Nyahkur Thailand <10K
227 dbf Edopi Indonesia <10K
228 eno Enggano Indonesia <10K
229 mkm Moklen Thailand <10K
230 nxl South Nuaulu Indonesia <10K
231 vko Kodeoha Indonesia <10K
232 wbb Wabo Indonesia <10K
233 yir North Awyu Indonesia <10K
234 zbc Central Berawan Malaysia <10K
235 bya Batak Philippines <10K
Table 41: (3/6) SEA indigenous languages with <10K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
236 bdg Bonggi Malaysia <10K
237 fau Fayu Indonesia <10K
238 ilu Ili’uun Indonesia <10K
239 yet Yetfa Indonesia <10K
240 dmy Sowari Indonesia <10K
241 ddw Dawera-Daweloor Indonesia <10K
242 jhi Jehai Malaysia <10K
243 xmt Matbat Indonesia <10K
244 beg Belait Brunei <10K
245 ivb Ibatan Philippines <10K
246 oia Oirata Indonesia <10K
247 bkl Berik Indonesia <10K
248 duo Dupaninan Agta Philippines <10K
249 kdw Koneraw Indonesia <10K
250 msf Mekwei Indonesia <10K
251 nqm Ndom Indonesia <10K
252 sbg Moi Lemas Indonesia <10K
253 seu Serui-Laut Indonesia <10K
254 tve Te’un Indonesia <10K
255 tzn Tugun Indonesia <10K
256 wng Wanggom Indonesia <10K
257 bnj Bangon Philippines <10K
258 snv Sa’ban Indonesia, Malaysia <10K
259 bdw Baham Indonesia <10K
260 ran Riantana Indonesia <10K
261 rnn Roon Indonesia <10K
262 szp Suabo Indonesia <10K
263 zbe East Berawan Malaysia <10K
264 scb Chut Laos, Vietnam <10K
265 tvm Tela-Masbuar Indonesia <10K
266 udj Ujir Indonesia <10K
267 agy Southern Alta Philippines <10K
268 air Airoran Indonesia <10K
269 aqm Atohwaim Indonesia <10K
270 asi Buruwai Indonesia <10K
271 att Pamplona Atta Philippines <10K
272 bcd North Babar Indonesia <10K
273 bnf Masiwang Indonesia <10K
274 btq Batek Malaysia <10K
275 cth Thaiphum Chin Myanmar <10K
276 dem Dem Indonesia <10K
277 dmg Upper Kinabatangan Malaysia <10K
278 dnu Danau Myanmar <10K
279 etz Semimi Indonesia <10K
280 jbj Arandai Indonesia <10K
281 kbv Dla Indonesia <10K
282 kpu Kafoa Indonesia <10K
283 kvy Yintale Myanmar <10K
284 msg Moraid Indonesia <10K
285 nks North Asmat Indonesia <10K
286 pnx Phong-Kniang Laos <10K
287 sob Sobei Indonesia <10K
288 wgo Ambel Indonesia <10K
289 wno Wano Indonesia <10K
290 xse Sempan Indonesia <10K
291 zbw West Berawan Malaysia <10K
Not in SEACrowd
292 rbk Northern Bontok Philippines <10K
293 kvt Lahta Myanmar <10K
294 lbg Laopang Laos <10K
295 stu Samtao Myanmar <10K
296 kxk Zayein Myanmar <10K
297 iti Inlaud Itneg Philippines <10K
298 nqq Chen-Kayu Naga Myanmar <10K
299 pnc Pannei Indonesia <10K
300 zkn Kanan Myanmar <10K
301 mlz Malaynon Philippines <10K
302 khf Khuen Laos <10K
303 kkx Kohin Indonesia <10K
304 lmj West Lembata Indonesia <10K
305 dkr Kuijau Malaysia <10K
306 ebc Beginci Indonesia <10K
307 mtw Southern Binukidnon Philippines <10K
308 mqk Rajah Kabunsuwan Manobo Philippines <10K
309 csx Cambodian Sign Language Cambodia <10K
310 tis Masadiit Itneg Philippines <10K
311 csj Songlai Chin Myanmar <10K
312 mqc Mangole Indonesia <10K
313 bpz Bilba Indonesia <10K
314 lmf South Lembata Indonesia <10K
315 wha Sou Upaa Indonesia <10K
316 lkc Kucong Vietnam <10K
317 mqa Maba Indonesia <10K
318 lcq Luhu Indonesia <10K
319 mjb Makalero East Timor <10K
Table 42: (4/6) SEA indigenous languages with <10K speakers.
No. ISO 639-3 Language Region(s) Population
Not in SEACrowd
320 krv Kavet Cambodia <10K
321 cey Ekai Chin Myanmar <10K
322 kjt Phrae Pwo Karen Thailand <10K
323 kuk Kepo’ Indonesia <10K
324 put Putoh Indonesia <10K
325 rjg Rajong Indonesia <10K
326 sjb Sajau Basap Indonesia <10K
327 tkz Takua Vietnam <10K
328 amv Ambelau Indonesia <10K
329 wlh Welaun East Timor, Indonesia <10K
330 plz Paluan Murut Malaysia <10K
331 jkp Paku Karen Myanmar <10K
332 adb Atauran East Timor <10K
333 nea Eastern Ngad’a Indonesia <10K
334 ntd Northern Tidung Malaysia <10K
335 phh Phula Vietnam <10K
336 reb Rembong Indonesia <10K
337 skx Seko Padang Indonesia <10K
338 swu Suwawa Indonesia <10K
339 tgr Tareng Laos <10K
340 weu Rawngtu Chin Myanmar <10K
341 sau Saleman Indonesia <10K
342 thi Tai Long Laos <10K
343 low Tampias Lobu Malaysia <10K
344 npg Ponyo-Gongwang Naga Myanmar <10K
345 ukk Muak Sa-aak Myanmar <10K
346 tlq Tai Loi Laos, Myanmar <10K
347 hkn Mel-Khaonh Cambodia <10K
348 jkm Mobwa Karen Myanmar <10K
349 lmq Lamatuka Indonesia <10K
350 lvu Levuka Indonesia <10K
351 lwe Lewoeleng Indonesia <10K
352 rtc Rungtu Chin Myanmar <10K
353 ruu Lanas Lobu Malaysia <10K
354 tiu Adasen Philippines <10K
355 umn Paungnyuan Naga Myanmar <10K
356 lhh Laha Indonesia <10K
357 bjx Vanaw Kalinga Philippines <10K
358 bvt Bati Indonesia <10K
359 kqv Okolod Indonesia, Malaysia <10K
360 xkk Kachok Cambodia <10K
361 iwk I-wak Philippines <10K
362 lka Lakalei East Timor <10K
363 bzn Boano Indonesia <10K
364 sbr Sembakung Murut Indonesia, Malaysia <10K
365 bfg Busang Kayan Indonesia <10K
366 hap Hupla Indonesia <10K
367 kxi Keningau Murut Malaysia <10K
368 llq Lolak Indonesia <10K
369 roc Cacgia Roglai Vietnam <10K
370 sls Singapore Sign Language Singapore <10K
371 ste Liana-Seti Indonesia <10K
372 ulu Uma’ Lung Indonesia <10K
373 wli Waioli Indonesia <10K
374 wrx Wae Rana Indonesia <10K
375 xhv Khua Laos, Vietnam <10K
376 tdy Tadyawan Philippines <10K
377 zbt Batui Indonesia <10K
378 sws Seluwasan Indonesia <10K
379 pni Aoheng Indonesia <10K
380 tuj Tugutil Indonesia <10K
381 nps Nipsan Indonesia <10K
382 uan Kuan Laos <10K
383 vbk Southwestern Bontok Philippines <10K
384 dmv Dumpas Malaysia <10K
385 xko Kiorr Laos <10K
386 kve Kalabakan Murut Malaysia <10K
387 mcm Malaccan Portuguese Creole Malaysia <10K
388 ltu Latu Indonesia <10K
389 gef Gerai Indonesia <10K
390 cnc Côông Vietnam <10K
391 bpo Anasi Indonesia <10K
392 hld Halang Doan Laos, Vietnam <10K
393 nxk Kokak Naga Myanmar <10K
394 puj Punan Tubu Indonesia <10K
395 xkn Kayan River Kayan Indonesia <10K
396 ycp Chepya Laos <10K
397 lcs Lisabata-Nuniali Indonesia <10K
398 haf Haiphong Sign Language Vietnam <10K
399 slt Sila Laos, Vietnam <10K
Table 43: (5/6) SEA indigenous languages with <10K speakers.
No. ISO 639-3 Language Region(s) Population
Not in SEACrowd
400 kvh Komodo Indonesia <10K
401 apf Pahanan Agta Philippines <10K
402 bzb Andio Indonesia <10K
403 jal Yalahatan Indonesia <10K
404 mvr Marau Indonesia <10K
405 agz Mt. Iriga Agta Philippines <10K
406 dkk Dakka Indonesia <10K
407 gak Gamkonora Indonesia <10K
408 kmd Majukayang Kalinga Philippines <10K
409 mqp Manipa Indonesia <10K
410 pzn Jejara Naga Myanmar <10K
411 xkd Mendalam Kayan Indonesia <10K
412 xay Kayan Mahakam Indonesia <10K
413 xky Uma’ Lasan Indonesia, Malaysia <10K
414 mqq Minokok Malaysia <10K
415 neo Ná-Meo Vietnam <10K
416 tln Talondo’ Indonesia <10K
417 bqy Kata Kolok Indonesia <10K
418 mxr Murik Malaysia <10K
419 nty Mantsi Vietnam <10K
420 tev Teor Indonesia <10K
421 ttp Tombelala Indonesia <10K
422 ayt Magbukun Ayta Philippines <10K
423 ckn Kaang Chin Myanmar <10K
424 cno Con Laos <10K
425 goq Gorap Indonesia <10K
426 hov Hovongan Indonesia <10K
427 lpn Long Phuri Naga Myanmar <10K
428 nlq Lao Naga Myanmar <10K
429 nqy Akyaung Ari Naga Myanmar <10K
430 nuo Ngoaun Laos, Vietnam <10K
431 psg Penang Sign Language Malaysia <10K
432 ues Kioko Indonesia <10K
Table 44: (6/6) SEA indigenous languages with <10K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 sow Sowanda Indonesia <1K
2 duv Duvle Indonesia <1K
3 hmu Hamap Indonesia <1K
4 ktt Ketum Indonesia <1K
5 mpz Mpi Thailand <1K
6 tvw Sedoa Indonesia <1K
7 syo Su’ung Cambodia <1K
8 mgk Mawes Indonesia <1K
9 mss West Masela Indonesia <1K
10 dij Dai Indonesia <1K
11 drn West Damar Indonesia <1K
12 lji Laiyolo Indonesia <1K
13 mth Munggui Indonesia <1K
14 psn Panasuan Indonesia <1K
15 ret Reta Indonesia <1K
16 twg Tereweng Indonesia <1K
17 bpg Bonggo Indonesia <1K
18 agt Central Cagayan Agta Philippines <1K
19 kvz Tsaukambo Indonesia <1K
20 skp Sekapan Malaysia <1K
21 bsm Busami Indonesia <1K
22 bzi Bisu Thailand <1K
23 kzm Kais Indonesia <1K
24 mhz Mor Indonesia <1K
25 nkj Nakai Indonesia <1K
26 pru Puragi Indonesia <1K
27 skv Skou Indonesia <1K
28 laq Qabiao Vietnam <1K
29 ssm Semnam Malaysia <1K
30 slg Selungai Murut Indonesia, Malaysia <1K
31 tpf Tarpia Indonesia <1K
32 vto Vitou Indonesia <1K
33 wsa Warembori Indonesia <1K
34 dgc Casiguran Dumagat Agta Philippines <1K
35 bfe Betaf Indonesia <1K
36 kgb Kawe Indonesia <1K
37 kwh Kowiai Indonesia <1K
38 ppm Papuma Indonesia <1K
39 tdi Tomadino Indonesia <1K
40 tmu Iau Indonesia <1K
41 uka Kaburi Indonesia <1K
42 bkn Bukitan Indonesia, Malaysia <1K
43 imr Imroing Indonesia <1K
44 tgq Tring Malaysia <1K
45 tlk Taloki Indonesia <1K
46 ert Eritai Indonesia <1K
47 lpe Lepki Indonesia <1K
48 vme East Masela Indonesia <1K
49 mxz Central Masela Indonesia <1K
50 aos Taikat Indonesia <1K
51 cog Chong Thailand <1K
52 dpp Papar Malaysia <1K
53 jet Manem Indonesia <1K
54 kag Kajaman Malaysia <1K
55 kgi Selangor Sign Language Malaysia <1K
56 kly Kalao Indonesia <1K
57 knd Konda Indonesia <1K
58 kuc Kwinsu Indonesia <1K
59 lvi Lavi Laos <1K
60 nbn Kuri Indonesia <1K
61 ner Yahadian Indonesia <1K
62 oni Onin Indonesia <1K
63 orz Ormu Indonesia <1K
64 pkt Maleng Laos, Vietnam <1K
65 rth Ratahan Indonesia <1K
66 sbt Kimki Indonesia <1K
67 tcm Tanahmerah Indonesia <1K
68 trt Tunggare Indonesia <1K
69 wtw Wotu Indonesia <1K
70 xkq Koroni Indonesia <1K
71 cwg Cheq Wong Malaysia <1K
72 bpp Kaure Indonesia <1K
73 isd Isnag Philippines <1K
74 pna Punan Bah-Biau Malaysia <1K
75 skz Sekar Indonesia <1K
76 thm Aheu Thailand <1K
77 toy Topoiyo Indonesia <1K
78 dbe Dabe Indonesia <1K
79 bvk Bukat Indonesia <1K
80 dei Demisa Indonesia <1K
Table 45: (1/3) SEA indigenous languages with <1K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
81 jel Yelmek Indonesia <1K
82 nun Anong Myanmar <1K
83 opk Kopkaka Indonesia <1K
84 pas Papasena Indonesia <1K
85 tmj Samarokena Indonesia <1K
86 urn Uruangnirin Indonesia <1K
87 xau Kauwera Indonesia <1K
88 kdy Keijar Indonesia <1K
89 auu Auye Indonesia <1K
90 auw Awyi Indonesia <1K
91 flh Foau Indonesia <1K
92 gop Yeretuar Indonesia <1K
93 jau Yaur Indonesia <1K
94 lhn Lahanan Malaysia <1K
95 pee Taje Indonesia <1K
96 phq Phana’ Laos <1K
97 tnz Ten’edn Malaysia, Thailand <1K
98 wru Waru Indonesia <1K
99 sve Serili Indonesia <1K
100 bgv Warkay-Bipim Indonesia <1K
101 bhc Biga Indonesia <1K
102 bqb Bagusa Indonesia <1K
103 bsa Abinomn Indonesia <1K
104 ccm Malaccan Malay Creole Malaysia <1K
105 giq Green Gelao Vietnam <1K
106 kja Mlap Indonesia <1K
107 kzv Komyandaret Indonesia <1K
108 mrf Elseng Indonesia <1K
109 swr Saweru Indonesia <1K
110 tad Tause Indonesia <1K
111 tbp Diebroud Indonesia <1K
112 tmo Temoq Malaysia <1K
113 tyh O’du Laos, Vietnam <1K
114 wuy Wauyai Indonesia <1K
115 xwr Kwerba Mamberamo Indonesia <1K
116 rmh Murkim Indonesia <1K
117 tml Tamnim Citak Indonesia <1K
118 wet Perai Indonesia <1K
119 bqq Biritai Indonesia <1K
120 brs Baras Indonesia <1K
121 bzu Burmeso Indonesia <1K
122 emw Emplawas Indonesia <1K
123 kiq Kosare Indonesia <1K
124 kiy Kirikiri Indonesia <1K
125 kns Kensiu Malaysia, Thailand <1K
126 lcc Legenyem Indonesia <1K
127 mso Mombum Indonesia <1K
128 mvx Meoswar Indonesia <1K
129 sao Sause Indonesia <1K
130 snu Viid Indonesia <1K
131 tlg Tofanma Indonesia <1K
132 kgv Karas Indonesia <1K
133 lnh Lanoh Malaysia <1K
134 asz As Indonesia <1K
135 kbi Kaptiau Indonesia <1K
136 msl Molof Indonesia <1K
137 wfg Zorop Indonesia <1K
138 dmu Tebi Indonesia <1K
139 llk Lelak Malaysia <1K
140 tcq Kaiy Indonesia <1K
141 aqn Northern Alta Philippines <1K
142 bnv Beneraf Indonesia <1K
143 enc En Vietnam <1K
144 erw Erokwanas Indonesia <1K
145 jbr Jofotek-Bromnya Indonesia <1K
146 khh Kehu Indonesia <1K
147 khp Kapauri Indonesia <1K
148 kxn Kanowit-Tanjong Melanau Malaysia <1K
149 mmb Momina Indonesia <1K
150 nec Nedebang Indonesia <1K
151 nyl Nyeu Thailand <1K
152 rac Rasawa Indonesia <1K
153 tnu Tai Khang Laos <1K
154 wai Wares Indonesia <1K
155 yki Yoke Indonesia <1K
156 bed Bedoanas Indonesia <1K
157 mzt Mintil Malaysia <1K
158 agf Arguni Indonesia <1K
159 apx Aputai Indonesia <1K
160 kcd Ngkâlmpw Kanum Indonesia <1K
Table 46: (2/3) SEA indigenous languages with <1K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
161 ugo Ugong Thailand <1K
162 wbe Waritai Indonesia <1K
163 mra Mlabri Laos, Thailand <1K
164 afz Obokuitai Indonesia <1K
165 mgf Maklew Indonesia <1K
166 ttn Towei Indonesia <1K
167 knq Kintaq Malaysia <1K
168 ulf Usku Indonesia <1K
169 awh Awbono Indonesia <1K
170 bti Burate Indonesia <1K
171 byl Bayono Indonesia <1K
172 diy Diuwe Indonesia <1K
173 kpi Kofei Indonesia <1K
174 krz Sota Kanum Indonesia <1K
175 kwr Kwer Indonesia <1K
176 tfo Tefaro Indonesia <1K
177 tkx Tangko Indonesia <1K
178 tti Tobati Indonesia <1K
Not in SEACrowd
179 lcd Lola Indonesia <1K
180 ors Orang Seletar Malaysia <1K
181 kpd Koba Indonesia <1K
182 trx Tringgus-Sembaan Bidayuh Malaysia <1K
183 kqt Klias River Kadazan Malaysia <1K
184 atp Pudtol Atta Philippines <1K
185 tcp Tawr Chin Myanmar <1K
186 kyd Karey Indonesia <1K
187 pyy Pyen Myanmar <1K
188 ttw Long Wat Malaysia <1K
189 xmx Salawati Indonesia <1K
190 ymn Sunum Indonesia <1K
191 wkd Mo Indonesia <1K
192 abf Abai Sungai Malaysia <1K
193 esy Eskayan Philippines <1K
194 kzb Kaibobo Indonesia <1K
195 njs Nisa Indonesia <1K
196 nni North Nuaulu Indonesia <1K
197 whu Wahau Kayan Indonesia <1K
198 xke Kereho Indonesia <1K
199 lce Sekak Indonesia <1K
200 sdx Sibu Melanau Malaysia <1K
201 bfk Ban Khor Sign Language Thailand <1K
202 kax Kao Indonesia <1K
203 srk Serudung Murut Malaysia <1K
204 pud Punan Aput Indonesia <1K
205 bgy Benggoi Indonesia <1K
206 kzd Kadai Indonesia <1K
207 kvp Kompane Indonesia <1K
208 auq Anus Indonesia <1K
209 azt Faire Atta Philippines <1K
210 hud Huaulu Indonesia <1K
211 lgh Laghuu Vietnam <1K
212 tip Trimuris Indonesia <1K
213 tyj Tai Yo Laos, Vietnam <1K
214 tys Tày Sa Pa Vietnam <1K
215 mqi Mariri Indonesia <1K
216 pdn Fedan Indonesia <1K
217 mnq Minriq Malaysia <1K
218 daz Dao Indonesia <1K
219 gnq Gana Malaysia <1K
220 lrn Lorang Indonesia <1K
221 bsu Bahonsuai Indonesia <1K
222 puc Punan Merap Indonesia <1K
223 rmx Romam Vietnam <1K
224 tyl Thu Lao Vietnam <1K
225 yrs Yarsun Indonesia <1K
226 atl Mt. Iraya Agta Philippines <1K
227 puf Punan Merah Indonesia <1K
228 umi Ukit Malaysia <1K
229 jvd Javindo Indonesia <1K
230 srt Sauri Indonesia <1K
Table 47: (3/3) SEA indigenous languages with <1K speakers.
No. ISO 639-3 Language Region(s) Population
In SEACrowd
1 mnu Mer Indonesia <100
2 itx Itik Indonesia <100
3 kxq Smärky Kanum Indonesia <100
4 lix Liabuku Indonesia <100
5 awr Awera Indonesia <100
6 bdx Budong-Budong Indonesia <100
7 ire Yeresiam Indonesia <100
8 tds Doutai Indonesia <100
9 mrx Dineor Indonesia <100
10 amq Amahai Indonesia <100
11 kzu Kayupulau Indonesia <100
12 mok Morori Indonesia <100
13 plh Paulohi Indonesia <100
14 sgu Salas Indonesia <100
15 aip Burumakok Indonesia <100
16 dbn Duriankere Indonesia <100
17 dul Inagta Alabat Philippines <100
18 moq Mor Indonesia <100
19 naa Namla Indonesia <100
20 mvs Massep Indonesia <100
21 aem Arem Laos, Vietnam <100
22 mqr Mander Indonesia <100
23 xkw Kembra Indonesia <100
24 kkb Kwerisa Indonesia <100
25 atz Arta Philippines <100
26 ibh Bih Vietnam <100
27 khd Bädi Kanum Indonesia <100
28 nul Nusa Laut Indonesia <100
29 scq Chung Cambodia <100
30 mqt Mok Myanmar, Thailand <10
31 btj Bacanese Malay Indonesia <10
32 wor Woria Indonesia <10
33 spi Saponi Indonesia <10
34 dsn Dusner Indonesia <10
35 lgi Lengilu Indonesia <10
36 btn Ratagnon Philippines <10
37 tni Tandia Indonesia <10
38 huw Hukumina Indonesia <10
39 kzl Kayeli Indonesia <10
40 sxm Samre Cambodia, Thailand <10
41 hpo Hpon Myanmar <10
42 mpy Mapia Indonesia <10
43 nil Nila Indonesia <10
44 sbo Sabüm Malaysia <10
45 srw Serua Indonesia <10
46 tas Tay Boi Vietnam <10
47 xbn Kenaboi Malaysia <10
48 xxt Tambora Indonesia <10
Not in SEACrowd
49 orn Orang Kanaq Malaysia <100
50 lva Makuva East Timor <100
51 spg Sihan Malaysia <100
52 ibu Ibu Indonesia <100
53 pnm Punan Batu Malaysia <100
54 csd Chiangmai Sign Language Thailand <100
55 ays Sorsogon Ayta Philippines <100
56 lio Liki Indonesia <100
57 pey Petjo Indonesia <100
58 hti Hoti Indonesia <100
59 huk Hulung Indonesia <100
60 ism Masimasi Indonesia <100
61 kzx Kamarian Indonesia <100
62 pns Ponosakan Indonesia <100
63 agk Katubung Agta Philippines <10
64 nae Naka’ela Indonesia <10
65 atm Ata Philippines <10
66 ihb Iha Based Pidgin Indonesia <10
67 tvy Timor Pidgin East Timor <10
68 duy Dicamay Agta Philippines <10
69 dyg Villa Viciosa Agta Philippines <10
70 lox Loun Indonesia <10
71 onx Onin Based Pidgin Indonesia <10
72 tcl Taman Myanmar <10
73 vms Moksela Indonesia <10
74 wea Wewaw Myanmar <10
Table 48: SEA indigenous languages with <100 speakers.