SEACrowd: A Multilingual Multimodal Data Hub
and Benchmark Suite for Southeast Asian Languages

Holy Lovenia^★,1,2 Rahmad Mahendra^★,3,2 Salsabil Maulana Akbar^★,2 Lester James V. Miranda^★,4
Jennifer Santoso^★,5 Elyanah Aco^★,6 Akhdan Fadhilah^★,7 Jonibek Mansurov^★,8 Joseph Marvin Imperial^★,9,10
Onno P. Kampman^★,11 Joel Ruben Antony Moniz^★,6 Muhammad Ravi Shulthan Habibi^★,3,2 Frederikus Hudi^★,12,13
Railey Montalan^★,1 Ryan Ignatius⁶ Joanito Agili Lopo¹⁴ William Nixon¹⁵ Börje F. Karlsson¹⁶ James Jaya⁶
Ryandito Diandaru⁶ Yuze Gao⁶ Patrick Amadeus¹⁵ Bin Wang⁶ Jan Christian Blaise Cruz¹⁷ Chenxi Whitehouse¹⁸
Ivan Halim Parmonangan¹⁹ Maria Khelli¹⁵ Wenyu Zhang⁶ Lucky Susanto²⁰ Reynard Adha Ryanda²¹
Sonny Lazuardi Hermawan²² Dan John Velasco¹⁷ Muhammad Dehan Al Kautsar¹⁵ Willy Fitra Hendria⁶
Yasmin Moslem²³ Noah Flynn²⁴ Muhammad Farid Adilazuarda⁸ Haochen Li⁶ Johanes Lee¹⁵ R. Damanhuri²⁵
Shuo Sun⁶ Muhammad Reza Qorib²⁶ Amirbek Djanibekov⁸ Wei Qi Leong¹ Quyet V. Do²⁷ Niklas Muennighoff²⁸
Tanrada Pansuwan¹⁸ Ilham Firdausi Putra⁶ Yan Xu^29,27 Ngee Chia Tai¹ Ayu Purwarianti^6,30
Sebastian Ruder³¹ William Tjhi¹ Peerat Limkonchotiwat^★,32 Alham Fikri Aji^★,8 Sedrick Keh^★,33
Genta Indra Winata^★,2 Ruochen Zhang^★,34 Fajri Koto^★,8,2 Zheng-Xin Yong^★,34 Samuel Cahyawijaya^★,27,2
¹AI Singapore ²IndoNLP ³Universitas Indonesia ⁴Allen Institute for Artificial Intelligence ⁵RevComm, Inc.
⁶Independent Researcher ⁷Tohoku University ⁸MBZUAI ⁹University of Bath ¹⁰National University Philippines
¹¹MOH Office for Healthcare Transformation (MOHT) ¹²NAIST ¹³Works Applications Lab ¹⁴Universitas Gadjah Mada
¹⁵Institut Teknologi Bandung ¹⁶Beijing Academy of Artificial Intelligence (BAAI) ¹⁷Samsung Research Philippines
¹⁸University of Cambridge ¹⁹Queensland University of Technology ²⁰Monash University Indonesia ²¹Imperial College London
²²Independent Design Engineer ²³Bering Lab ²⁴Amazon ²⁵Universitas Diponegoro ²⁶NUS ²⁷HKUST ²⁸Contextual AI
²⁹Huawei Noah’s Ark Lab ³⁰Prosa.ai ³¹Cohere ³²VISTEC ³³Toyota Research Institute ³⁴Brown University
^★Major contributors

Abstract

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub¹¹1https://seacrowd.github.io/seacrowd-catalogue/ that fills the resource gap by providing standardized corpora²²2https://github.com/SEACrowd/seacrowd-datahub/ in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

Holy Lovenia^★,1,2 Rahmad Mahendra^★,3,2 Salsabil Maulana Akbar^★,2 Lester James V. Miranda^★,4 Jennifer Santoso^★,5 Elyanah Aco^★,6 Akhdan Fadhilah^★,7 Jonibek Mansurov^★,8 Joseph Marvin Imperial^★,9,10 Onno P. Kampman^★,11 Joel Ruben Antony Moniz^★,6 Muhammad Ravi Shulthan Habibi^★,3,2 Frederikus Hudi^★,12,13 Railey Montalan^★,1 Ryan Ignatius⁶ Joanito Agili Lopo¹⁴ William Nixon¹⁵ Börje F. Karlsson¹⁶ James Jaya⁶ Ryandito Diandaru⁶ Yuze Gao⁶ Patrick Amadeus¹⁵ Bin Wang⁶ Jan Christian Blaise Cruz¹⁷ Chenxi Whitehouse¹⁸ Ivan Halim Parmonangan¹⁹ Maria Khelli¹⁵ Wenyu Zhang⁶ Lucky Susanto²⁰ Reynard Adha Ryanda²¹ Sonny Lazuardi Hermawan²² Dan John Velasco¹⁷ Muhammad Dehan Al Kautsar¹⁵ Willy Fitra Hendria⁶ Yasmin Moslem²³ Noah Flynn²⁴ Muhammad Farid Adilazuarda⁸ Haochen Li⁶ Johanes Lee¹⁵ R. Damanhuri²⁵ Shuo Sun⁶ Muhammad Reza Qorib²⁶ Amirbek Djanibekov⁸ Wei Qi Leong¹ Quyet V. Do²⁷ Niklas Muennighoff²⁸ Tanrada Pansuwan¹⁸ Ilham Firdausi Putra⁶ Yan Xu^29,27 Ngee Chia Tai¹ Ayu Purwarianti^6,30 Sebastian Ruder³¹ William Tjhi¹ Peerat Limkonchotiwat^★,32 Alham Fikri Aji^★,8 Sedrick Keh^★,33 Genta Indra Winata^★,2 Ruochen Zhang^★,34 Fajri Koto^★,8,2 Zheng-Xin Yong^★,34 Samuel Cahyawijaya^★,27,2 ¹AI Singapore ²IndoNLP ³Universitas Indonesia ⁴Allen Institute for Artificial Intelligence ⁵RevComm, Inc. ⁶Independent Researcher ⁷Tohoku University ⁸MBZUAI ⁹University of Bath ¹⁰National University Philippines ¹¹MOH Office for Healthcare Transformation (MOHT) ¹²NAIST ¹³Works Applications Lab ¹⁴Universitas Gadjah Mada ¹⁵Institut Teknologi Bandung ¹⁶Beijing Academy of Artificial Intelligence (BAAI) ¹⁷Samsung Research Philippines ¹⁸University of Cambridge ¹⁹Queensland University of Technology ²⁰Monash University Indonesia ²¹Imperial College London ²²Independent Design Engineer ²³Bering Lab ²⁴Amazon ²⁵Universitas Diponegoro ²⁶NUS ²⁷HKUST ²⁸Contextual AI ²⁹Huawei Noah’s Ark Lab ³⁰Prosa.ai ³¹Cohere ³²VISTEC ³³Toyota Research Institute ³⁴Brown University ^★Major contributors

1 Introduction

Despite the Southeast Asia (SEA) region being home to 1,300 indigenous languages (18% of the world’s languages) and 671 million people (8.75% of the world’s population), the representation of texts, images, and audio datasets from this region is significantly lacking in machine learningmodels. This deficiency negatively impacts the model quality for SEA languages. The language coverage of SEA languages in two common pre-training resources, Common Crawl³³3https://commoncrawl.github.io/cc-crawl-statistics/plots/languages and C4 Xue et al. (2021), is extremely limited, with only 2.36% (in 11 languages) and 10.62% (in 11 languages), respectively. In modalities beyond text, the representation is even more limited. For instance, Common Voice, one of the largest multilingual speech corpora, includes only 6 SEA indigenous languages Conneau et al. (2021); Ardila et al. (2020), and LAION-5B, one of the largest multilingual vision-language (VL) corpora, includes only 12 SEA indigenous languages Schuhmann et al. (2022). While datasets for other SEA indigenous languages may exist, they are often scattered, insufficiently documented, or varied in quality and formatting, thereby making access and usage challenging Cahyawijaya et al. (2023a); Joshi et al. (2020); Aji et al. (2023).

In terms of evaluation, the sparse availability of high-quality test sets for these languages also complicates evaluating models for SEA languages. Despite there being 1,300+ languages in the SEA region, prior works Winata et al. (2023); Cahyawijaya et al. (2021); Koto and Koto (2020); Zhang et al. (2024); Wang et al. (2024); Nguyen et al. (2023); Leong et al. (2023); Yong et al. (2023) have only evaluated fewer than 10 SEA languages collectively. The actual performance of current models on most SEA languages remains largely unknown.

Moreover, the dominance of Anglocentric training data potentially results in cultural bias when generating texts, images, or audio in underrepresented SEA languages Søgaard (2022); Talat et al. (2022). Further, Durmus et al. (2023); AlKhamissi et al. (2024); Cahyawijaya et al. (2024a) have shown that the learned representations in large language models (LLMs) often fail to reflect local cultural values in SEA Koto et al. (2024); Liu et al. (2024); Adilazuarda et al. (2024). This raises concerns about the ability of current LLMs to generate natural, high-quality texts for this region. In addition, the discrepancy in language support creates language barriers in technological access and risks marginalizing minority groups who do not speak the dominant language.

Refer to caption — Figure 1: Mapping between tasks, schemas, modalities, and language regions across 498 datasheets in SEACrowd.

In this work, we investigate the current AI progress for SEA languages by addressing the challenges of resources, evaluation, and generation quality. Our contributions are three-fold:

•

We bridge the resource gap by centralizing and standardizing $\sim$ 500 corpora in nearly 1,000 SEA languages in SEACrowd, a comprehensive and standardized resource center, across 3 modalities: text, image, and audio.
•

We close the evaluation gap in SEA languages with the SEACrowd Benchmarks, which cover 38 SEA indigenous languages on 13 tasks across 3 modalities, providing insights into the performance of a diverse spectrum of AI models. Further, our study reveals that the generative outputs of existing LLMs exhibit a closer resemblance to translationese rather than natural data in 9 SEA languages.
•

We offer insights and strategies for the future development of AI in SEA, aiming to promote a sustainable and prosperous future through continuous AI advancements.

2 SEACrowd

SEACrowd represents the first comprehensive AI dataset collection initiative for SEA, developed through a collaborative effort among researchers and engineers primarily based in the SEA region. As addressed in §1, resource scarcity and the scattered nature of the data are crucial challenges in SEA. SEACrowd addresses these issues through two primary contributions: 1) consolidating datasheets to enhance data discoverability; and 2) standardizing dataloaders for easier use, especially in multiple dataset loading. We also follow data provenance practices (Longpre et al., 2023) to preserve the proprietary rights of dataset owners.

Consolidating datasheets

We invited contributors to submit datasheet forms Gebru et al. (2021) for publicly available datasets across all modalities including text, audio, and image in SEA languages and/or cultures. These datasheets include detailed information about each dataset, such as data subset(s), description, task, language, license, URL access, annotation method(s), annotation validation, relevant publications, publication venue, and data splits. For each submission, we manually verify and correct it as necessary to ensure datasheet accuracy.

Standardizing dataloaders

For each approved datasheet, we created a standardized dataloader wrapper to facilitate ready-to-use data access since only 38.4% of the consolidated data sources were originally hosted on Hugging Face⁴⁴4https://huggingface.co/. To support diverse task types, we carefully designed the standardized seacrowd schema to support different data structures and modalities (see Appendix F). We also adhere to data provenance practices (Longpre et al., 2023) and document the relevant metadata (e.g., license) in the dataloaders. Furthermore, we engaged with data owners and successfully converted three private datasets into public ones.

These efforts have culminated in 498 datasheets in SEACrowd Catalogue and 399 dataloaders in SEACrowd Data Hub (§2.1). Notably, our centralized data repository covers $\sim$ 1,000 SEA languages, underscoring the extensive linguistic diversity captured by SEACrowd. We elaborate on the SEACrowd dataset statistics in §2.2. SEACrowd’s contribution guidelines, progression details, and reviewing procedure are in Appendix C, D, and E.

2.1 SEACrowd Catalogue & Data Hub

SEACrowd comprises two interconnected platforms: SEACrowd Catalogue⁵⁵5SEACrowd Catalogue is also present in csv format. and SEACrowd Data Hub. These platforms work in tandem to consolidate the datasheet submissions and provide a standardized pipeline for SEACrowd. Specifically, Catalogue houses the datasheets (metadata), while Data Hub stores the standardized dataloaders and the seacrowd library⁶⁶6All codes are available under Apache License 2.0. for the schemas and configurations (Appendix F). These systems share information on the datasheets and dataloaders, allowing users to seamlessly explore and utilize them.

2.2 Datasets in SEACrowd

SEACrowd consolidates 498 datasheets with diverse tasks in SEA languages and provides standardized access through dataloaders to 399 of them. As shown in Figure 1, approximately 81% of the datasets in SEACrowd are textual data, with the remaining $\sim$ 8% and $\sim$ 11% being VL and speech, respectively. The complete list of SEA indigenous languages covered by SEACrowd and their mapping to the relevant SEA regions are provided in Appendix K. Around $\sim$ 53% of the datasets have a commercially permissive license.

A total of 83 tasks are provided in SEACrowd with a breakdown of 66 in NLP (e.g., abusive language detection, intent classification, instruction tuning, named entity recognition, etc.), 10 in VL (image-to-text generation, sign language recognition, video captioning, etc.), and 7 in speech (e.g., automatic speech recognition, text-to-speech, speech emotion recognition, and others). These tasks are then standardized into 20 dataloader schemas described in Appendix F. Further discussion regarding resources in SEACrowd is in §5.1.

3 SEACrowd Benchmarks

To understand the capability of state-of-the-art models, we conduct comprehensive evaluations of existing LLMs, VLMs, and speech models from various architectures and training approaches. To construct a benchmark suite⁷⁷7https://github.com/SEACrowd/seacrowd-experiments, we select a subset of the dataset that has been manually annotated and/or validated from the data presented in §2.2. More details regarding the data subsets, baselines, and prompts used for the evaluations are given in Appendix G.1, G.2, and G.3.

3.1 Datasets

NLP

Our natural language understanding (NLU) benchmark consists of 131 data subsets and 7 tasks: sentiment analysis, topic classification, natural language inference (NLI), commonsense reasoning, exam-style multiple-choice question answering (QA), culture understanding, and reading comprehension. It covers English (eng) and 33 SEA indigenous languages.

We utilize 100 data subsets for the natural language generation (NLG) benchmark, which covers machine translation (MT) between English and SEA languages from both directions, summarization, as well as extractive or abstractive question answering, covering 27 SEA indigenous languages.

Speech

We employ 19 automatic speech recognition (ASR) data subsets to evaluate the capability of speech models in 15 SEA indigenous languages.

VL

We assess the models on image captioning using four data subsets in 4 SEA indigenous languages, i.e., Filipino (fil), Indonesian (ind), Thai (tha), and Vietnamese (vie). This disparity in the evaluation scale is due to the fact that only a few datasets in SEACrowd are VL datasets, and even fewer are annotated by humans.

3.2 Baselines

Complete details regarding the model architectures, model sizes, seen languages, corresponding publications, and other aspects are in Appendix G.2.

NLP

To evaluate the zero-shot performance of instruction-tuned LLMs on SEA languages, we benchmark two commercial, i.e., GPT-4 OpenAI et al. (2024) and Command-R⁸⁸8https://docs.cohere.com/docs/command-r, and 17 open-source baselines, the majority of which are $\sim$ 7B-13B parameters. We categorize the open-source baselines according to the language(s) coverage in pre-training and/or instruction tuning, i.e., 1) English: Llama3 Touvron et al. (2023), Mistral Jiang et al. (2023), and Falcon Almazrouei et al. (2023); 2) Multilingual: AYA-101, AYA-23 Üstün et al. (2024), mT0, BLOOMZ Muennighoff et al. (2022), and BactrianX-Llama Li et al. (2023a); 3) SEA regional: SEA-LION Singapore (2023), Sailor Dou et al. (2024), and SeaLLM Nguyen et al. (2023); and 4) SEA country-specific: Cendol-mT5, Cendol-Llama2 Cahyawijaya et al. (2024b), and Merak Ichsan (2023) from Indonesia, WangchanX-Llama3 Phatthiyaphaibun et al. (2024) from Thailand, and Malaysian-Llama3⁹⁹9https://huggingface.co/mesolitica/malaysian-llama-3-8b-instruct-16k from Malaysia.

Speech

We evaluate the zero-shot performance of state-of-the-art multilingual pre-trained speech models in transcribing speech in SEA languages. Specifically, we consider Whisper v3 Radford et al. (2023), MMS 1B Pratap et al. (2024), and Seamless M4T v2 Communication et al. (2023), which have shown proficiency in accurately transcribing multiple languages without fine-tuning. Additionally, we include models that are fine-tuned on specific language(s), SEA or English, based on 1) Wav2Vec2 XLSR Conneau et al. (2021) and 2) XLS-R Babu et al. (2021), known for their cross-lingual speech representation learning by pre-training on raw speech waveforms across diverse languages, with XLS-R offering broader language coverage, and 3) Whisper, which leverages weakly supervised pre-training on spectrograms of speech in diverse languages. The specific fine-tuned models are evaluated: XLSR on ind, jav, sun; XLSR and Whisper on Indonesian (ind); XLSR and Whisper on Thai (tha); XLS-R on Tagalog (tgl); XLS-R on Burmese (mya); XLS-R and Whisper on Khmer (khm); and XLSR on English (eng). See Appendix G.2 for details.

VL

We consider state-of-the-art VLMs primarily trained on English pre-training and instruction-following data: LLaVA Liu et al. (2023b, a), InstructBLIP Dai et al. (2024), and Idefics2 Laurençon et al. (2024), and VLMs trained in a multilingual manner: mBLIP Geigle et al. (2023) and PaliGemma Gemma Team et al. (2024), to assess their image captioning ability in SEA languages.

3.3 Experimental Settings

We conduct all evaluations in a zero-shot fashion. We employ 3 prompt templates in English for each NLU task and 1 for each NLG task. We utilize the weighted F1 score to measure the model performance on NLU tasks and n-gram reference-based metrics, i.e., chrF++ Popović (2015, 2017) and ROUGE-L Lin (2004), on NLG tasks. As for VL, aside from a prompt template in English, we also use a prompt template in the respective SEA indigenous language per data subset. We report CIDEr Vedantam et al. (2015) for the image captioning task. For ASR, we use word error rate (WER) for languages with Latin script and character error rate (CER) for those with non-Latin script.

4 Result & Analysis

4.1 State-of-the-Art Models on SEA languages

LLMs

Figure 2(a) and 2(b) illustrate the overall model performance of the LLM baselines in SEA languages for both NLU tasks and NLG tasks. In our NLU evaluation, AYA-101, a large multilingual instruction-tuned language model covering 101 languages, demonstrates the best zero-shot performance. It is followed by the commercial baselines, which achieve a median of $\sim$ 0.6 weighted F1-score. Sailor and SeaLLM, models specifically trained with SEA languages, also display competitive performance. Similarly, mT0 exhibits strong generalization abilities due to its exposure to $\sim$ 100 languages in pre-training, including those from the SEA region Muennighoff et al. (2022). In contrast, most English and SEA country-specific baselines perform less effectively, likely due to their narrow focus on English or a limited set of SEA languages, such as Indonesian languages for Cendol and Thai for WangchanX-Llama3. Similar and consistent trends are observed on MT task, while the baselines’ poorer scores on abstractive/extractive QA and summarization indicate their ineffectiveness in producing acceptable outputs in SEA languages for these tasks, which is especially pronounced in the open-source baselines. Appendix G.4 describes the performance of LLMs per language.

To analyze the equality in model performance across SEA languages, following Khanuja et al. (2023), we utilize the Gini coefficient—originally used to observe income equality Dorfman (1979)—weighted by demand and parameterized by $\tau$ . Here, $\tau=1$ corresponds to a demographic notion of demand, considering language population size, while $\tau=0$ does not take population size into account Blasi et al. (2022). Table 1 shows that models trained on more SEA languages, such as multilingual and SEA regional baselines, generally exhibit greater language equity. For instance, although Command-R and GPT-4 are competitive performance-wise against AYA-101 and mT0, AYA-101 and mT0 demonstrate higher equality across all SEA languages under study. This trend is consistent across different $\tau$ (see Appendix G.5).

Speech models

Figure 3 presents the off-the-shelf speech model performance on ASR across languages in SEA, measured by the error rate percentage. 9 of the 15 SEA languages in our speech evaluation belong to the Austronesian language family. The other 6 are khm and vie, which belong to Austro-Asiatic, cnh and mya belong to Sino-Tibetan, and tha and vie belong to the Kra-Dai language family. The multilingual pre-trained baselines have a competitive generalization capability across languages, although it varies by language. For instance, Whisper v3 demonstrates significantly higher effectiveness for national languages such as ind, zlm, fil, tha, and vie, while performing less optimally for other indigenous languages. Conversely, Seamless M4T v2 shows a more balanced performance across the languages. Regarding fine-tuned baselines, error rates decrease for their seen languages. The fine-tuned Whisper models, however, manage to better optimize for the target language while retaining their original capabilities in other SEA languages compared to their Wav2Vec2 XLSR and XLS-R counterparts, despite both having been pre-trained in a multilingual manner. This observation aligns with the findings of Rouditchenko et al. (2023), who find that the number of hours seen per language and language family during pre-training is predictive of how the models compare, in which Whisper’s pre-training data duration for these four language families exceeds that of XLSR.

VLMs

Figure 4 depicts the zero-shot performance of off-the-shelf VLMs on image captioning in SEA indigenous languages. Despite the capability of LLMs for zero-shot cross-lingual generalization Huang et al. (2021); Täckström et al. (2012); Neubig and Hu (2018); Artetxe et al. (2020), VLMs trained only in English (i.e., InstructBLIP, LLaVA, and Idefics2) fail to exhibit this capability, struggling to generate adequate image captions in SEA languages. Multilingual VL pre-training is crucial to achieving aligned multilingual representations Burns et al. (2020); Li et al. (2023b); Huang et al. (2021). For instance, PaliGemma and mBLIP generate better image captions in tha and fil when prompted in the relevant SEA languages.

Model	Natural outputs
SEA-LION	58.57%
AYA-23	43.57%
Sailor	37.86%
Cendol-Llama2	37.37%
Malaysian Llama3	36.90%
WangchanX-Llama3	30.24%
Falcon	29.52%
BactrianX-Llama	28.10%
SeaLLM	27.38%
Merak	26.19%
BLOOMZ	25.00%
Cendol-MT5	24.05%
Command-R	20.95%
mT0-XL	19.76%
Mistral	19.52%
GPT-4	16.67%
Llama3	14.05%
AYA-101	8.33%

(a) Avg. by models

Language	Natural outputs
Indonesian (ind)	41.58%
Vietnamese (vie)	37.31%
Thai (tha)	34.21%
Khmer (khm)	29.21%
Lao (lao)	28.42%
Malay (zlm)	22.24%
Burmese (mya)	19.47%
Filipino (fil)	12.22%
English (eng)^†	8.95%

(b) Avg. by languages

Table 2: Current LLMs are still incapable of generating natural texts in SEA languages. ^†As spoken in SEA regions, not worldwide.

However, when prompted in eng, the performance of these multilingual baselines varies notably. PaliGemma’s performance collapses completely, while mBLIP’s performance shows both increases and decreases across different SEA languages. This raises the question of whether the multilingual VLMs can maintain consistent performance across different languages used in the instructions and the tasks. It highlights the need for further research into the mechanisms that drive these variations and how to achieve robust multilingual performance in VLMs across diverse linguistic contexts. Understanding these dynamics is crucial for improving VLMs’ generalization capabilities and ensuring equitable performance across all languages, despite most related works focusing on monolingual visual instruction tuning Liu et al. (2023b); Gong et al. (2023); Zhu et al. (2024).

4.2 Generation Quality in SEA Languages: Translationese vs. Natural Language

Classifying Translationese in SEA Languages

To analyze the generation quality of LLMs in SEA languages, we build a text classifier to discriminate between translationese and natural texts Riley et al. (2020). We construct a translationese classification training and testing dataset using 49 and 62 data subsets, respectively, covering approximately 39.9k and 51.5k sentences across English (eng) and 8 SEA languages: Indonesian (ind), Khmer (khm), Lao (lao), Burmese (mya), Filipino (fil), Thai (tha), Vietnamese (vie), and Malay (zlm). The training and test data are detailed in Appendix H.1.

We fine-tune a classifier from mDeBERTaV3 He et al. (2020, 2022)¹⁰¹⁰10https://huggingface.co/microsoft/mdeberta-v3-base using these data and achieve 79.08% accuracy on the test set in predicting translationese across these 9 languages. The detailed results and ablation studies of our translationese classifier experiments are provided in Appendix H.2. This classifier enables us to assess the generation quality of LLMs by distinguishing between translationese and naturally occurring text, providing insights into the models’ performance in producing authentic language output.

Generation Quality of LLMs

We evaluate the generation quality of LLMs in 9 SEA languages by generating answers to natural, general, and safety questions from Sea-Bench Nguyen et al. (2023). As shown in Table 2(a), LLMs with extensive language coverage but less focus on SEA languages, e.g., AYA-101 Üstün et al. (2024), GPT-4 OpenAI et al. (2024), mT0 Muennighoff et al. (2023); Xue et al. (2021), and Llama3 AI@Meta (2024), tend to produce natural sentences less than 20% of the time. In contrast, models with narrower language coverage but a greater focus on SEA languages, such as Cendol-Llama2 Cahyawijaya et al. (2024b), Sailor Dou et al. (2024), AYA-23 Aryabumi et al. (2024), and SEA-LION Singapore (2023), generate natural sentences over 35% of the time.

However, even the LLM with the least translationese generation, SEA-LION, only produces natural SEA sentences 57.71% of the time, highlighting a significant quality gap in generating natural sentences in SEA languages. As displayed in Table 2(b), the translationese issue varies across SEA languages. Languages such as Tagalog (tgl), Burmese (mya), and Malay (zlm) have more severe translationese problems, with existing LLMs producing natural sentences only 11.58%, 19.47%, and 22.24% of the time, respectively. This underscores the need for further improvements in LLMs to more effectively address the linguistic diversity and complexity of SEA languages.

5 Discussions

5.1 Resource Gaps in SEA

Coverage

SEACrowd covers 980 out of the 1,308 languages spoken in SEA (74.9%). Despite this high coverage, language representation in SEACrowd exhibits a very long-tail distribution, with over 700 languages having only 1 or 2 datasets, and only 23 languages having 20 datasets or more. These less represented languages typically exist only in the form of lexicons Asgari et al. (2020); List et al. (2022) or unlabeled data Leong et al. (2022); Kudugunta et al. (2024); Nguyen et al. (2024). Existing tasks in SEACrowd still cover only a small portion of languages. For instance, sentiment analysis data is available for only 22 languages, and named entity recognition (NER) data is available for just 17 languages. Furthermore, for modalities beyond text, SEA resources are extremely underrepresented. Approximately 90% of SEA indigenous languages lack both speech and VL datasets.

Quality

78.7% of the datasets in SEACrowd are published in peer-reviewed venues, and most of the data has undergone external validation. The overall quality of the datasets in SEACrowd is depicted in Figure 5(b). We compile the reported data construction methods by the authors, considering both the data collection method (i.e., data source) and label annotation validation (i.e., quality control). Nearly 19% of the datasets in SEACrowd have machine-generated and machine-translated annotations, while more than 80% were obtained from online texts (e.g., web crawling) and expert generation. In terms of label annotation validation, 62.4% of the datasets have been fully manually checked, while the remaining portion is partially validated and automatically checked. Note that these statistics only provide an initial indication of dataset collection quality on the surface and do not necessarily reflect the exact quality. Only a few datasets (6%) in SEACrowd report their detailed quality metrics (e.g., inter-annotator agreement scores). A deeper investigation is required for future work.

Cultural Relevance

The resource gap in SEA extends to the cultural aspect, where misrepresentation can lead to offensive behaviors, e.g., cultural appropriation and stereotyping Evans et al. (2020); Glotov (2023). As a proxy of the cultural relevance of SEA datasets, we manually curated 259 data subsets used in SEACrowd evaluation based on their data source. Specifically, we categorize them whether they are 1) translated from another language, 2) crawled from local sources, or 3) hand-crafted to capture cultural relevance. In Figure 5(c), approximately 70% lack cultural relevance, as many are machine-translated from English sources. About 20% are taken from local news, social media, or other local outlets, which potentially contain some culturally relevant data. Only the remaining 10% are designed to consider cultural relevance, derived from studies highlighting serious deficiencies in cultural understanding by LLMs for underrepresented languages Kabra et al. (2023); Koto et al. (2023a); Wibowo et al. (2023); Liu et al. (2024); Koto et al. (2024).

5.2 Conclusion & Future Work

Southeast Asia is home to highly diverse languages and cultures; the majority of its people do not use English as their primary language. The utility of English-first AI is limited for the majority of Southeast Asian users, especially in critical sectors like healthcare and education. Through SEACrowd, we have explored the AI landscape in SEA and bridged the gaps in resources, evaluation, and naturalness analysis of AI models in SEA languages. Further, our initiative has nurtured an open-source research community, which will actively continue to add and maintain datasheets and dataloaders, as well as drive AI research and developments in SEA.

Nonetheless, AI development in SEA requires concentrated efforts by a range of stakeholders, who may prioritize differently when it comes to incorporating the region’s 1,300+ languages into AI models. Moving forward, our work suggests AI development in SEA should prioritize two key metrics: 1) potential utility and 2) resource equity.¹¹¹¹11https://github.com/SEACrowd/globalutility

Potential utility

Potential utility is defined as the gap between current utility and ideal utility, in which model capability acts as a proxy for utility. Based on potential utility, unsurprisingly the development of the national languages (except for English and Chinese used in Singapore), i.e., Indonesian (ind), Burmese (mya), Vietnamese (vie), Thai (tha), Filipino (fil), Khmer (khm), Malay (zlm), and Lao (lao) in Figure 6, will bring the biggest benefit. Among them, we identify notable gaps in the naturalness of Malay, Burmese, and Filipino AI-generated outputs (§4.2). Focused efforts in resource building for these languages may move the needle the most for utility. Beyond the national languages, growing local languages or dialects with large speaker bases, e.g., Javanese (jav), Sundanese (sun), and Hmong (hmn), is key.

Resource equity

Resource equity is defined as the gap between existing and ideal resource availability (Figure 6). We found that many local languages or dialects still fall short of the expected level of resources. These include Northeastern Thai (tts), Northern Thai (nod), Hmong Do (hmv), Southern Thai (sou), Cebuano (ceb), Ilocano (ilo), and others. Efforts to narrow these gaps would not only help preserve these languages but also ensure the continuation of the cultural heritage of the speakers of these languages. More details on SEA language prioritization for different weightings of demand can be found in Appendix I.

To improve these metrics, governments, and industry leaders in the region should invest in R&D activities to improve regional language capability for both the national languages and local dialects. This could include funding for open data collection and collaborations with local communities to address the resource gap in local languages. This also requires long-term sustainable strategies, such as catalyzing profitable use cases based on inclusive AI models, promoting fair and responsible compensation schemes for data workers, and orchestrating win-win exemplar collaborations between data owners, AI, and application developers.

Acknowledgments

We would like to thank our amazing contributors: Joshua Spergel, Tiezheng Yu, Parinthapat Pengpun, Bin Wang, Ishan Jindal, Muhammad Satrio, Jipeng Zhang, Bhavish Pahwa, Haryo Akbarianto Wibowo, Hiroki Nomoto, Yohanes Sigit Purnomo W.P., Ahmad Fathan Hidayatullah, Bryan Wilie, Ruhiyah Faradishi Widiaputri, Rafif Rabbani, Fawwaz Mayda, Manoj Khatri, Supryadi Supryadi, Virach Sornlertlamvanich, Pavaris Ruangchutiphophan, Erland Hilman Fuadi, Mega Fransiska, Richardy Sapan, and Camilla Johnine Cosme for their hard work in submitting datasheets and implementing dataloaders for SEACrowd.

This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme; PhD Fellowship Award, the Hong Kong University of Science and Technology; and PF20-43679 Hong Kong PhD Fellowship Scheme, Research Grant Council, Hong Kong. JMI is funded by National University Philippines and the UKRI Centre for Doctoral Training in Accountable, Responsible and Transparent AI [EP/S023437/1] of the University of Bath. In addition, we would like to express our gratitude to Cohere For AI sfor providing research grants that enabled us to perform our experiments using a commercial baseline, specifically Command-R.

Limitations

While our work covers nearly 1,000 SEA languages, many dialects, which are considered as belonging to a parent language, are missing from our evaluation benchmark. For instance, for the Malay language, only Standard Malay (zsm) is evaluated, but not other dialects such as Sarawak Malay (zlm-sar). Furthermore, the majority of our datasets also do not contain code-switched texts, which is a common linguistic phenomenon of SEA language usage (Aji et al., 2023). Moreover, the language coverage of different evaluation tasks varies significantly. For instance, NLP tasks cover 34 languages in total, whereas VL tasks only cover 4 languages.

Ethics Statement

In developing an evaluation benchmark for SEA languages, we have taken several steps to ensure ethical considerations are addressed comprehensively. First, the data used for this benchmark is sourced from publicly available resources, ensuring compliance with legal and ethical standards regarding data privacy. Where applicable, explicit consent was obtained from data contributors. Furthermore, all the datasets and resources utilized in this benchmark are used in accordance with their respective licenses. Second, our benchmark aims to be inclusive, representing a wide range of SEA languages, including those that are underrepresented in current linguistic resources. Lastly, our research process, including data collection, benchmark development, and evaluation methodologies, is entirely open-sourced and is documented transparently to enable reproducibility and accountability.

References

Adelani et al. (2022a) David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. 2022a. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
Adelani et al. (2024) David Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba Alabi, Yanke Mao, Haonan Gao, and En-Shiun Lee. 2024. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julian’s, Malta. Association for Computational Linguistics.
Adelani et al. (2022b) David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, Mboning Tchiaze Elvis, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia, and Joyce Nakatumba-Nabende. 2022b. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Adelani et al. (2021) David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei. 2021. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
Adelani et al. (2023) David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode, Tolulope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko, Abeeb Afolabi, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Ogbu, Chinedu Mbonu, Chiamaka Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola Awosan, Tadesse Kebede, Toadoum Sari Sakayo, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Kanda Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. 2023. MasakhaNEWS: News topic classification for African languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 144–159, Nusa Dua, Bali. Association for Computational Linguistics.
Adilazuarda et al. (2023) Muhammad Farid Adilazuarda, Samuel Cahyawijaya, and Ayu Purwarianti. 2023. The obscure limitation of modular multilingual language models. ICLR Tiny Papers 2023.
Adilazuarda et al. (2024) Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Ashutosh Dwivedi, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards measuring and modeling "culture" in llms: A survey. Preprint, arXiv:2403.15412.
AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Aji et al. (2023) Alham Fikri Aji, Jessica Zosa Forde, Alyssa Marie Loo, Lintang Sutawika, Skyler Wang, Genta Indra Winata, Zheng-Xin Yong, Ruochen Zhang, A. Seza Doğruöz, Yin Lin Tan, and Jan Christian Blaise Cruz. 2023. Current status of NLP in south East Asia with insights from multilingualism and language diversity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract, pages 8–13, Nusa Dua, Bali. Association for Computational Linguistics.
Aji et al. (2022) Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, and Sebastian Ruder. 2022. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
AlKhamissi et al. (2024) Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. Preprint, arXiv:2402.13231.
Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance.
Ardila et al. (2020) Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
Artetxe et al. (2020) Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open weight releases to further multilingual progress. Preprint, arXiv:2405.15032.
Asai et al. (2023) Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila B Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2023. BUFFET: Benchmarking large language models for cross-lingual few-shot transfer. Preprint, arXiv:2305.14857.
Asgari et al. (2020) Ehsaneddin Asgari, Fabienne Braune, Benjamin Roth, Christoph Ringlstetter, and Mohammad Mofrad. 2020. UniSent: Universal adaptable sentiment lexica for 1000+ languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4113–4120, Marseille, France. European Language Resources Association.
Astuti et al. (2023) Laksmita Widya Astuti, Yunita Sari, and Suprapto. 2023. Code-mixed sentiment analysis using transformer for twitter social media data. International Journal of Advanced Computer Science and Applications, 14(10).
Babu et al. (2021) Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. 2021. Xls-r: Self-supervised cross-lingual speech representation learning at scale. Preprint, arXiv:2111.09296.
Bandarkar et al. (2023) Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2023. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
Blasi et al. (2022) Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
Burns et al. (2020) Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, and Bryan A Plummer. 2020. Learning to scale multilingual representations for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 197–213. Springer.
Cahyawijaya et al. (2022) Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, and Ayu Purwarianti. 2022. Nusacrowd: A call for open and reproducible nlp research in indonesian languages. Preprint, arXiv:2207.10524.
Cahyawijaya et al. (2024a) Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, and Pascale Fung. 2024a. High-dimension human value representation in large language models. arXiv preprint arXiv:2404.07900.
Cahyawijaya et al. (2023a) Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, and Ayu Purwarianti. 2023a. NusaCrowd: Open source initiative for Indonesian NLP resources. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13745–13818, Toronto, Canada. Association for Computational Linguistics.
Cahyawijaya et al. (2023b) Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Linuwih, Bryan Wilie, Galih Muridan, Genta Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2023b. NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945, Nusa Dua, Bali. Association for Computational Linguistics.
Cahyawijaya et al. (2024b) Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2024b. Cendol: Open instruction-tuned generative large language models for indonesian languages. Preprint, arXiv:2404.06138.
Cahyawijaya et al. (2021) Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, Adhiguna Kuncoro, Sebastian Ruder, Zhi Yuan Lim, Syafri Bahar, Masayu Khodra, Ayu Purwarianti, and Pascale Fung. 2021. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Catapang and Visperas (2023) Jasper Kyle Catapang and Moses Visperas. 2023. Emotion-based morality in Tagalog and English scenarios (EMoTES-3K): A parallel corpus for explaining (im)morality of actions. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 1–6, Tokyo, Japan. Association for Computational Linguistics.
Communication et al. (2023) Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, and Skyler Wang. 2023. Seamlessm4t: Massively multilingual & multimodal machine translation. Preprint, arXiv:2308.11596.
Conneau et al. (2021) Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430.
Conneau et al. (2022) Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. 2022. XTREME-S: Evaluating Cross-lingual Speech Representations. In Proc. Interspeech 2022, pages 3248–3252.
Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
Costa-jussà et al. (2024) Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang, and N. L. L. B. Team. 2024. Scaling neural machine translation to 200 languages. Nature.
Dabre et al. (2022) Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh Khapra, and Pratyush Kumar. 2022. IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849–1863, Dublin, Ireland. Association for Computational Linguistics.
Dac Lai et al. (2023) Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2023. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2307.
Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
Dorfman (1979) Robert Dorfman. 1979. A formula for the gini coefficient. The review of economics and statistics, pages 146–149.
Dou et al. (2024) Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, and Min Lin. 2024. Sailor: Open language models for south-east asia. Preprint, arXiv:2404.03608.
Dryer and Haspelmath (2013) Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online (v2020.3). Zenodo.
Durmus et al. (2023) Esin Durmus, Karina Nguyen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. 2023. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
Eberhard et al. (2021) David M. Eberhard, Gary F. Simons, and Charles D. Fennig. 2021. Ethnologue: Languages of the World. Twenty-fourth edition. Dallas, Texas: SIL International.
Ebrahimi et al. (2022) Abteen Ebrahimi, Manuel Mager, Arturo Oncevay, Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan Vladimir Meza Ruiz, Gustavo Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando Coto-Solano, Thang Vu, and Katharina Kann. 2022. AmericasNLI: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6279–6299, Dublin, Ireland. Association for Computational Linguistics.
Elias (2018) Alexander Elias. 2018. Lio and the central flores languages. Leiden: Leiden University Master thesis.
Evans et al. (2020) Leanne M Evans, Crystasany R Turner, and Kelly R Allen. 2020. " good teachers" with" good intentions": Misappropriations of culturally responsive pedagogy. Journal of Urban Learning, Teaching, and Research, 15(1):51–73.
Federmann et al. (2022) Christian Federmann, Tom Kocmi, and Ying Xin. 2022. NTREX-128 – news test references for MT evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 21–24, Online. Association for Computational Linguistics.
Gebru et al. (2021) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Communications of the ACM, 64(12):86–92.
Geigle et al. (2023) Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. 2023. mblip: Efficient bootstrapping of multilingual vision-llms. arXiv, abs/2307.06930.
Gemma Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Open models based on gemini research and technology. Preprint, arXiv:2403.08295.
Glotov (2023) Sergei Glotov. 2023. Intercultural film literacy education against cultural misrepresentation: Finnish visual art teachers’ perspectives. Journal of Media Literacy Education, 15(1):31–43.
Gong et al. (2023) Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
Hammarström et al. (2024) Harald Hammarström, Robert Forkel, Martin Haspelmath, and Sebastian Bank. 2024. Glottolog 5.0. leipzig: Max planck institute for evolutionary anthropology.
Hasan et al. (2021) Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
He et al. (2022) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2022. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
Huang et al. (2021) Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann. 2021. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2443–2459, Online. Association for Computational Linguistics.
Huynh et al. (2022) Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2022. ViNLI: A Vietnamese corpus for studies on open-domain natural language inference. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3858–3872, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Ichsan (2023) Muhammad Ichsan. 2023. Merak-7b: The llm for bahasa indonesia. Hugging Face Repository.
Imperial et al. (2019) Joseph Marvin Imperial, Jeyrome Orosco, Shiela Mae Mazo, and Lany Maceda. 2019. Sentiment analysis of typhoon related tweets using standard and bidirectional recurrent neural networks. arXiv preprint arXiv:1908.01765.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Jiang et al. (2022) Shengyi Jiang, Sihui Fu, Nankai Lin, and Yingwen Fu. 2022. Pretrained models and evaluation data for the khmer language. Tsinghua Science and Technology, 27(4):709–718.
Joshi et al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
Juan et al. (2015) Sarah Samson Juan, Laurent Besacier, Benjamin Lecouteux, and Mohamed Dyab. 2015. Using resources from a closely-related language to develop asr for a very under-resourced language: A case study for iban. In Proceedings of INTERSPEECH, Dresden, Germany.
Kabra et al. (2023) Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, and Graham Neubig. 2023. Multi-lingual and multi-cultural figurative language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada. Association for Computational Linguistics.
Kakwani et al. (2020) Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961, Online. Association for Computational Linguistics.
Karo et al. (2022) Ichwanul Muslim Karo Karo, Mohd Farhan Md Fudzee, Shahreen Kasim, and Azizul Azhar Ramli. 2022. Sentiment analysis in karonese tweet using machine learning. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 10(1):219–231.
Khanuja et al. (2023) Simran Khanuja, Sebastian Ruder, and Partha Talukdar. 2023. Evaluating the diversity, equity, and inclusion of NLP technology: A case study for Indian languages. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1763–1777, Dubrovnik, Croatia. Association for Computational Linguistics.
Koto et al. (2023a) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023a. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore. Association for Computational Linguistics.
Koto et al. (2023b) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023b. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
Koto et al. (2022) Fajri Koto, Timothy Baldwin, and Jey Han Lau. 2022. Cloze evaluation for deeper understanding of commonsense stories in Indonesian. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 8–16, Dublin, Ireland. Association for Computational Linguistics.
Koto and Koto (2020) Fajri Koto and Ikhwan Koto. 2020. Towards computational linguistics in Minangkabau language: Studies on sentiment analysis and machine translation. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, pages 138–148, Hanoi, Vietnam. Association for Computational Linguistics.
Koto et al. (2024) Fajri Koto, Rahmad Mahendra, Nurul Aisyah, and Timothy Baldwin. 2024. Indoculture: Exploring geographically-influenced cultural commonsense reasoning across eleven indonesian provinces. Preprint, arXiv:2404.01854.
Kudugunta et al. (2024) Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2024. Madlad-400: a multilingual and document-level large audited dataset. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
Kumar et al. (2022) Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh Mishra, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra, and Pratyush Kumar. 2022. IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Laurençon et al. (2024) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models? Preprint, arXiv:2405.02246.
Le and Luu (2023) Thang Le and Anh Luu. 2023. A parallel corpus for Vietnamese central-northern dialect text transfer. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13839–13855, Singapore. Association for Computational Linguistics.
Leong et al. (2022) Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham Owodunni, and Daniel Whitenack. 2022. Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8608–8621, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Leong et al. (2023) Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto, Hamsawardhini Rengarajan, Kengatharaiyer Sarveswaran, and William Chandra Tjhi. 2023. Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models. arXiv preprint arXiv:2309.06085.
Li et al. (2023a) Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. 2023a. Bactrian-x: A multilingual replicable instruction-following model with low-rank adaptation. arXiv preprint arXiv:2305.15011.
Li et al. (2023b) Zejun Li, Zhihao Fan, Jingjing Chen, Qi Zhang, Xuanjing Huang, and Zhongyu Wei. 2023b. Unifying cross-lingual and cross-modal modeling towards weakly supervised multilingual vision-language pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5939–5958, Toronto, Canada. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Lin et al. (2022) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
List et al. (2022) Johann-Mattis List, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch, and Russell D. Gray. 2022. Lexibank, a public repository of standardized wordlists with computed phonological and lexical features. Scientific Data, 9(1):316.
Liu et al. (2024) Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. 2024. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings. Preprint, arXiv:2309.08591.
Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved baselines with visual instruction tuning.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual instruction tuning. In NeurIPS.
Longpre et al. (2021) Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406.
Longpre et al. (2023) Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2023. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
Mager et al. (2021) Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, and Katharina Kann, editors. 2021. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas. Association for Computational Linguistics, Online.
Mahendra et al. (2021) Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. 2021. IndoNLI: A natural language inference dataset for Indonesian. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10511–10527, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Muennighoff et al. (2023) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
Muzad and Rahutomo (2016) Aad Muzad and Faisal Rahutomo. 2016. Korpus berita daring bahasa indonesia dengan depth first focused crawling. Prosiding Sentrinov (Seminar Nasional Terapan Riset Inovatif), 2(1):11–20.
Neubig and Hu (2018) Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 875–880, Brussels, Belgium. Association for Computational Linguistics.
Nguyen et al. (2020) Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan Nguyen. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2595–2605, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Nguyen et al. (2024) Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4226–4237, Torino, Italia. ELRA and ICCL.
Nguyen et al. (2023) Xuan-Phi Nguyen, Wenxuan Zhang, Li Xin, Mahani Aljunied, Weiwen Xu, Hou Pong Chan, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2023. Seallms - large language models for southeast asia. Preprint, arXiv:arXiv:2312.00738.
OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
Palen-Michel and Lignos (2023) Chester Palen-Michel and Constantine Lignos. 2023. LR-sum: Summarization for less-resourced languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6829–6844, Toronto, Canada. Association for Computational Linguistics.
Phatthiyaphaibun et al. (2023) Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai natural language processing in python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore. Association for Computational Linguistics.
Phatthiyaphaibun et al. (2024) Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, and Sarana Nutanong. 2024. Wangchanlion and wangchanx mrc eval. Preprint, arXiv:2403.16127.
Ponti et al. (2020) Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
Popović (2015) Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
Popović (2017) Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
Pratap et al. (2024) Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. 2024. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
Project (2024) The Joshua Project. 2024. The joshua project.
Purwarianti and Crisdayanti (2019) Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti. 2019. Improving bi-lstm performance for indonesian sentiment analysis using paragraph vector. In 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pages 1–5. IEEE.
Purwarianti et al. (2007) Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa. 2007. A machine learning approach for Indonesian question answering system. In Artificial Intelligence and Applications, pages 573–578.
Putra et al. (2024) I Made Suwija Putra, Daniel Siahaan, and Ahmad Saikhu. 2024. Snli indo: A recognizing textual entailment dataset in indonesian derived from the stanford natural language inference dataset. Data in Brief, 52:109998.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
Riccosan and Saputra (2023) Riccosan and Karen Etania Saputra. 2023. Multilabel multiclass sentiment and emotion dataset from indonesian mobile application review. Data in Brief, 50:109576.
Riley et al. (2020) Parker Riley, Isaac Caswell, Markus Freitag, and David Grangier. 2020. Translationese as a language in “multilingual” NMT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online. Association for Computational Linguistics.
Rizqullah et al. (2023) Muhammad Razif Rizqullah, Ayu Purwarianti, and Alham Fikri Aji. 2023. Qasina: Religious domain question answering using sirah nabawiyah. In 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), pages 1–6. IEEE.
Rouditchenko et al. (2023) Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, and James Glass. 2023. Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages. In Proc. INTERSPEECH 2023, pages 2268–2272.
Ruder et al. (2023) Sebastian Ruder, Jonathan H Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel Sarr, Xinyi Wang, et al. 2023. Xtreme-up: A user-centric scarce-data benchmark for under-represented languages. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1856–1884.
Sanh et al. (2021) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization. Preprint, arXiv:2110.08207.
Sani et al. (2012) Auliya Sani, Sakriani Sakti, Graham Neubig, Tomoki Toda, Adi Mulyanto, and Satoshi Nakamura. 2012. Towards language preservation: Preliminary collection and vowel analysis of indonesian ethnic speech data. In 2012 International Conference on Speech Database and Assessments, pages 118–122.
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
Setya and Mahendra (2018) Ken Nabila Setya and Rahmad Mahendra. 2018. Semi-supervised textual entailment on indonesian wikipedia data. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 416–427. Springer.
Singapore (2023) AI Singapore. 2023. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion.
Singh et al. (2024) Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, et al. 2024. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619.
Søgaard (2022) Anders Søgaard. 2022. Should we ban English NLP for a year? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5254–5260, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Sutoyo et al. (2022) Rhio Sutoyo, Said Achmad, Andry Chowanda, Esther Widhi Andangsari, and Sani M. Isa. 2022. Prdect-id: Indonesian product reviews dataset for emotions classification tasks. Data in Brief, 44:108554.
Täckström et al. (2012) Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 477–487, Montréal, Canada. Association for Computational Linguistics.
Talat et al. (2022) Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim Masoud, Margaret Mitchell, Dragomir Radev, Shanya Sharma, Arjun Subramonian, Jaesung Tae, Samson Tan, Deepak Tunuguntla, and Oskar Van Der Wal. 2022. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41, virtual+Dublin. Association for Computational Linguistics.
Thapliyal et al. (2022) Ashish V. Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut. 2022. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715–729, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Tran et al. (2021) Khanh Quoc Tran, Phap Ngoc Trinh, Khoa Nguyen-Anh Tran, An Tran-Hoai Le, Luan Van Ha, and Kiet Van Nguyen. 2021. An empirical investigation of online news classification on an open-domain, large-scale and high-quality dataset in vietnamese. In New Trends in Intelligent Software Methodologies, Tools and Techniques, pages 367–379. IOS Press.
Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. 2024. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
Van Nguyen et al. (2022) Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen. 2022. New vietnamese corpus for machine reading comprehension of health news articles. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 21(5).
Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
Wang et al. (2023) Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F Chen. 2023. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. arXiv preprint arXiv:2309.04766.
Wang et al. (2024) Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F. Chen. 2024. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. NAACL.
Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Wibowo et al. (2023) Haryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya, Radityo Eko Prasojo, and Alham Fikri Aji. 2023. Copal-id: Indonesian language reasoning with local culture and nuances. arXiv preprint arXiv:2311.01012.
Wilie et al. (2020) Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. 2020. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
Winata et al. (2023) Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. 2023. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
Winata et al. (2024) Genta Indra Winata, Ruochen Zhang, and David Ifeoluwa Adelani. 2024. Miners: Multilingual language models as semantic retrievers. arXiv preprint arXiv:2406.07424.
Workshop et al. (2022) BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Yong et al. (2023) Zheng Xin Yong, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Long Phan, Rowena Garcia, Thamar Solorio, and Alham Aji. 2023. Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63, Singapore. Association for Computational Linguistics.
Zhang et al. (2023a) Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, and Alham Aji. 2023a. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, Singapore. Association for Computational Linguistics.
Zhang et al. (2023b) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023b. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Advances in Neural Information Processing Systems, volume 36, pages 5484–5505. Curran Associates, Inc.
Zhang et al. (2024) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2024. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems, 36.
Zhu et al. (2024) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ICLR.

Appendix A Summary of SEACrowd

Benchmark	# Languages	# Indigenous SEA Languages	# Datasets	# Tasks
SEACrowd (ours)^†	39	38	254	13 (11 text, 1 speech, 1 vision)
NusaCrowd^† Cahyawijaya et al. (2023a)	19	19	137	12 (11 text, 1 speech)
BUFFET Asai et al. (2023)	54	N/A	15	8 (8 text)
XTREME-UP Ruder et al. (2023)	88	11	269	9 (7 text, 1 speech, 1 vision)

Table 3: Benchmark comparison. ^†The numbers in SEACrowd and NusaCrowd are the numbers of datasets included in the evaluation.

Addressing the resource gaps and challenges in AI development for Southeast Asian (SEA) languages is essential for our region’s sustainable and prosperous future. The lack of representation of SEA languages in machine learning pre-training models severely impacts their quality. Additionally, the scarcity of high-quality datasets and evaluation tools further hinders progress in AI for SEA languages. The dominance of English-centric training data introduces cultural biases and fails to capture the local values and nuances of SEA cultures. To overcome these obstacles, SEACrowd provides a comprehensive and standardized resource center, along with evaluation tasks, for nearly 1,000 SEA indigenous and non-indigenous languages across various modalities. SEACrowd closes the resource and evaluation gaps, enabling researchers and developers to improve the performance of AI models for SEA languages.

The journey does not end here. Concrete next steps are essential to drive AI advancement in Southeast Asia. Strategic investments in research and development, collaborations with local communities, and efforts toward language preservation are imperative. Governments, industry leaders, and stakeholders must prioritize the development of national and under-resourced local languages to ensure resource equity and promote inclusivity in AI technology. By taking bold actions, such as funding initiatives for data collection and model training, establishing partnerships with local communities, and focusing on language preservation, we can unlock the full potential of Southeast Asian languages. Not only this will spur economic growth but also preserve the region’s rich cultural heritage.

In conclusion, developing AI for Southeast Asian languages is not a mere necessity but an opportunity to create a sustainable and prosperous future. By addressing resource gaps, accurately evaluating models, and fostering inclusive AI development, we can harness the power of SEA languages to drive long-term economic growth while preserving our region’s cultural diversity.

Appendix B Related Work

SEA data resources

LLM research efforts for SEA languages are limited by the lack of available datasets and benchmarks. Up to this day, resources for SEA NLP tasks are concentrated on relatively higher-resource SEA indigenous languages, such as Indonesian (Mahendra et al., 2021; Wilie et al., 2020; Cahyawijaya et al., 2021, 2023a) and Vietnamese (Nguyen et al., 2020; Huynh et al., 2022; Le and Luu, 2023; Van Nguyen et al., 2022). NusaCrowd Cahyawijaya et al. (2023a) introduce the first multimodal benchmark for Indonesian languages, including text and speech. Ruder et al. (2023) introduce a multimodal benchmark encompassing 11 indigenous languages from SEA, spanning a wide array of languages totaling 88.

Additionally, Asai et al. (2023) present an LLM benchmark for cross-lingual few-shot transfer, comprising 15 distinct tasks and 54 languages sourced from varied multilingual datasets. Furthermore, Dou et al. (2024) find that publicly available pre-training data for SEA languages suffer from quality issues such as textual duplicates and excessive occurrences of Unicode escapes. On the other hand, pre-trained LLMs specifically for SEA languages suffer from limited language coverage; for instance, Cendol Cahyawijaya et al. (2024b), Sailor Dou et al. (2024), SEA-LION (Singapore, 2023), and SeaLLMs Nguyen et al. (2023) have only covered up to 11 different SEA languages, including English and Chinese.

Open-source Community Initiatives in NLP

Open-source and open-science communities play a crucial role in engaging native speakers to curate large-scale multilingual NLP resources. In the past, collaborative efforts have been organized to collect data and train multilingual language models either on a global scale (Workshop et al., 2022; Singh et al., 2024; Üstün et al., 2024) or on a regional level, e.g., Masakhane for African languages (Adelani et al., 2021, 2022b, 2022a, 2023), AI4Bharat for Indian languages (Kakwani et al., 2020; Kumar et al., 2022; Dabre et al., 2022, inter alia), and AmericasNLP for Latin American languages Mager et al. (2021); Ebrahimi et al. (2022). In the SEA region, there have been community-based initiatives, e.g., IndoNLP, PyThaiNLP, and RojakNLP, to study NLP on Indonesian languages (Aji et al., 2022; Wilie et al., 2020; Cahyawijaya et al., 2021, 2023a), Thai language (Phatthiyaphaibun et al., 2023), and the code-switching phenomenon in SEA (Aji et al., 2023; Yong et al., 2023; Winata et al., 2024), respectively.

Submission	Points	Max points
Public datasheet	2+bonus	6
Dataloader	3	6 if difficult
Private datasheet	1	-
Access to private data	4+bonus	10 if high-quality
Datasheet review	1	1
Dataloader review	2	4 if difficult
Private datasheet review	0.5	-
Private data contact	1	5 if succeeds

Table 4: Amount of points obtained for contributions related to datasheet, dataloader, and private data.

Appendix C Contributing to SEACrowd

C.1 Open Contributions

We identify four tasks for open contribution in SEACrowd.¹²¹²12Landing page: https://github.com/SEACrowd. These tasks and the workflow of SEACrowd are heavily influenced by and extended upon NusaCrowd Cahyawijaya et al. (2023a, 2022), a collaborative effort to pool data resources for Indonesian NLP.

•

Submitting Metadata for Existing Public Datasets. Contributors can submit detailed datasheets for existing datasets through this form.¹³¹³13Public datasheet form: https://form.jotform.com/team/232952680898069/seacrowd-sea-datasets. Contributors must provide important information such as data license, size, language and dialect, annotation method, and so on. The approved datasheets, as well as under review datasheets, will show up and be indexed in a monitor spreadsheet and the SEACrowd Catalogue (Figure 7).
•

Building a Dataloader. From the approved datasheets from the previous task, contributors can further contribute by building a HuggingFace dataset loader to ensure that all datasets in SEACrowd are standardized in terms of formatting and usage. Contributors can follow a dataloader guide and examples available¹⁴¹⁴14Dataloader guide: https://github.com/SEACrowd/seacrowd-datahub/blob/master/DATALOADER.md. in the SEACrowd Data Hub. Dataloader maintainers and reviewers also monitor the self-assigned dataloader issues after 2 weeks of inactivity and ping contributors in case of a blocking impediment.
•

Identifying Private AI Datasets for SEA Languages, Cultures, and/or Regions. Unfortunately, a number of prior works involving SEA languages are still not publicly available. These may be due to several different reasons, including (but not limited to): non-release contracts related to funding, inclusion of private and personally identifiable data, and the use of explicitly private data such as those used by for-profit companies.

In this task, contributors can search for works that contain private data and fill out a corresponding record form.¹⁵¹⁵15Papers with private dataset form: https://form.jotform.com/team/232952680898069/seacrowd-paper-with-private-dataset. The SEACrowd team then attempts to contact the original data owners and negotiate the open-sourcing of their resources.
•

Opening a Private AI Dataset of SEA. If a contributor has previous work with closed data (or has been contacted by the SEACrowd team regarding closed-source data), they can decide to release their resources and register them in the collection via the public datasheet form. The resource will still be owned by the original contributor and is still tied to the contributor’s previous work, as SEACrowd simply catalogs it and records its now open-source license.

C.2 Measuring Contributions

To be considered as a co-author, 20 contribution points are required.¹⁶¹⁶16Submissions past the deadlines (see Appendix D.1) are still recorded, but contribution points are no longer given. To monitor how many points the contributors have obtained, the contribution point tracking is provided and updated regularly. The purpose of the point system is not to barrier collaboration but to reward rare and high-quality dataset entries. Table 4 describes the contribution points.¹⁷¹⁷17Contribution point guidelines: https://github.com/SEACrowd/seacrowd-datahub/blob/master/POINTS.md. A bonus of 1 point is given if the dataset modality is speech or vision. We also provide a bonus based on the language rarity in terms of available resources as defined by Joshi et al. (2020)¹⁸¹⁸18https://microsoft.github.io/linguisticdiversity/assets/lang2tax.txt, consisting of 1 point for languages in level 1 and 2, and 2 points for languages in level 0 or absent from the list. For other contributions not mentioned in Table 4 (e.g., maintenance, design, experiment, paper writing, etc.), the amount of contribution points is adjusted to the bulk and the complexity of the relevant work.

Appendix D Progression of SEACrowd

D.1 Timeline

SEACrowd released the open call for contributions on 1 November 2023. This lasted until 31 March 2024, for datasheet submissions, and until 15 May 2024 for both dataloaders and private dataset submissions. SEACrowd contributors have a biweekly discussion regarding the challenges they face while contributing, the next steps they should take to proceed, and/or experiment and research ideas for the paper. The detailed timeline can be seen in Figure 8.

D.2 Contribution Progress

Figure 9 shows the number of submissions for public datasheets, dataloader pull requests, and papers with private datasets in SEACrowd.

Appendix E Reviewing SEACrowd’s Submissions

We provide the complete reviewing guidelines in our Data Hub.¹⁹¹⁹19Reviewer SOP: https://github.com/SEACrowd/seacrowd-datahub/blob/master/REVIEWING.md

E.1 Datasheet Reviewing

The datasheet reviewing standard operating procedure (SOP) ensures the integrity and completeness of datasets submitted to SEACrowd. It outlines procedures for verifying dataset availability, avoiding duplicates, and ensuring correctness and relevance to the SEA region. The SOP includes FAQs addressing common issues such as dataset duplicates and incorrect information, along with an approval checklist covering aspects like data availability, dataset splits, and licensing. Reviewers are instructed on how to handle various scenarios, including correcting errors and determining points allocation for multiple contributors. For instance, if the datasheet submitted has incorrect or missing information, the reviewer can either ask the contributor to fix it (with some guidance) or fix it themself. Upon completion of the review, reviewers update the status, add notes and points, and await the generation of a GitHub issue for the approved datasheet.

E.2 Dataloader Reviewing

The dataloader reviewing SOP governs the review process for dataloaders in SEACrowd, ensuring adherence to the data structure and seacrowd schema and config standards. It specifies checks for metadata correctness, subset implementation, test script passing, and adherence to coding conventions. Additionally, it outlines dataloader config rules based on dataset types and provides guidelines for multilingual datasets. The SOP emphasizes the importance of reviewer collaboration, with each dataloader requiring two reviewers per submitted pull request, and outlines the approval and reviewer assignment process, either by allocation or by self-assignment based on availability and promptness.

Appendix F Schemas in SEACrowd

Schemas define and format the attributes of the dataset returned by a dataloader. For each dataloader, we implement 2 schema types: the source schema and the seacrowd schema. The source schema presents the dataset in a format similar to its original structure, while the seacrowd schema standardizes the data structure across similar tasks.

The following subsections define the seacrowd schemas in NLP (F.1), speech (F.2), and VL (F.3).

Subset ID	Language	Region	# Samples
Sentiment Analysis $\rightarrow$ *_seacrowd_text
lazada_review_filipino	fil	Philippines	1001
gklmip_sentiment	mya	Myanmar	716
indolem_sentiment	ind	Indonesia	1011
id_sentiment_analysis	ind	Indonesia	10806
karonese_sentiment	btx	Indonesia	1000
wisesight_thai_sentiment	tha	Thailand	2671
wongnai_reviews	tha	Thailand	6203
typhoon_yolanda_tweets	fil	Philippines	153
smsa	ind	Indonesia	500
prdect_id_sentiment	ind	Indonesia	5400
id_sent_emo_mobile_apps_sentiment	ind	Indonesia	21696
shopee_reviews_tagalog	fil	Philippines	2250
nusatranslation_senti_abs	abs	Indonesia	500
nusatranslation_senti_btk	btx	Indonesia	1200
nusatranslation_senti_bew	bew	Indonesia	1200
nusatranslation_senti_bhp	bhp	Indonesia	500
nusatranslation_senti_jav	jav	Indonesia	1200
nusatranslation_senti_mad	mad	Indonesia	1200
nusatranslation_senti_mak	mak	Indonesia	1200
nusatranslation_senti_min	min	Indonesia	1200
nusatranslation_senti_mui	mui	Indonesia	500
nusatranslation_senti_rej	rej	Indonesia	500
nusatranslation_senti_sun	sun	Indonesia	1200
nusax_senti_ind	ind	Indonesia	400
nusax_senti_ace	ace	Indonesia	400
nusax_senti_jav	jav	Indonesia	400
nusax_senti_sun	sun	Indonesia	400
nusax_senti_min	min	Indonesia	400
nusax_senti_bug	bug	Indonesia	400
nusax_senti_bbc	bbc	Indonesia	400
nusax_senti_ban	ban	Indonesia	400
nusax_senti_nij	nij	Indonesia	400
nusax_senti_mad	mad	Indonesia	400
nusax_senti_bjn	bjn	Indonesia	400
nusax_senti_eng	eng	Non-indigenous	400
indonglish	ind	Indonesia	1011

Table 5: Sentiment analysis data subsets used in SEACrowd NLU evaluation.

Subset ID	Language	Region	# Samples
NLI $\rightarrow$ *_seacrowd_pairs
indonli	ind	Indonesia	5183
wrete	ind	Indonesia	100
snli_indo	ind	Indonesia	9823
myxnli	mya	Myanmar	5010
xnli.tha	tha	Thailand	5010
xnli.vie	vie	Vietnam	5010

Table 6: NLI data subsets used in SEACrowd NLU evaluation.

Subset ID	Language	Region	# Samples
Topic Classification $\rightarrow$ *_seacrowd_text
gklmip_newsclass	khm	Cambodia	1436
indonesian_news_dataset	ind	Indonesia	2627
uit_vion	vie	Vietnam	26000
sib_200_ace_Arab	ace	Indonesia	204
sib_200_ace_Latn	ace	Indonesia	204
sib_200_ban_Latn	ban	Indonesia	204
sib_200_bjn_Arab	bjn	Indonesia	204
sib_200_bjn_Latn	bjn	Indonesia	204
sib_200_bug_Latn	bug	Indonesia	204
sib_200_ceb_Latn	ceb	Philippines	204
sib_200_ilo_Latn	ilo	Philippines	204
sib_200_ind_Latn	ind	Indonesia	204
sib_200_jav_Latn	jav	Indonesia	204
sib_200_kac_Latn	kac	Myanmar	204
sib_200_khm_Khmr	khm	Cambodia	204
sib_200_lao_Laoo	lao	Laos	204
sib_200_lus_Latn	lus	Myanmar	204
sib_200_min_Arab	min	Indonesia	204
sib_200_min_Latn	min	Indonesia	204
sib_200_mya_Mymr	mya	Myanmar	204
sib_200_pag_Latn	pag	Philippines	204
sib_200_shn_Mymr	shn	Myanmar	204
sib_200_sun_Latn	sun	Indonesia	204
sib_200_tgl_Latn	fil	Philippines	204
sib_200_tha_Thai	tha	Thailand	204
sib_200_vie_Latn	vie	Non-indigenous	204
sib_200_war_Latn	war	Philippines	204
sib_200_zsm_Latn	zsm	Malaysia	204
nusaparagraph_topic_btk	btx	Indonesia	500
nusaparagraph_topic_bew	bew	Indonesia	800
nusaparagraph_topic_bug	bug	Indonesia	300
nusaparagraph_topic_jav	jav	Indonesia	800
nusaparagraph_topic_mad	mad	Indonesia	700
nusaparagraph_topic_mak	mak	Indonesia	700
nusaparagraph_topic_min	min	Indonesia	800
nusaparagraph_topic_mui	mui	Indonesia	400
nusaparagraph_topic_rej	rej	Indonesia	350
nusaparagraph_topic_sun	sun	Indonesia	900

Table 7: Topic classification data subsets used in SEACrowd NLU evaluation.

Subset ID	Language	Region	# Samples
Commonsense Reasoning $\rightarrow$ *_seacrowd_text/qa
emotes_3k_tgl	fil	Philippines	2905
emotes_3k_eng	eng	Non-indigenous	2905
indo_story_cloze	ind	Indonesia	1135
xstorycloze_id	ind	Indonesia	1511
xstorycloze_my	mya	Myanmar	1511

Table 8: Commonsense reasoning data subsets used in SEACrowd NLU evaluation.

Subset ID	Language	Region	# Samples
Standard Testing QA $\rightarrow$ *_seacrowd_qa
indommlu_ind	ind	Indonesia	14979
indommlu_ban	ban	Indonesia	14979
indommlu_mad	mad	Indonesia	14979
indommlu_mak	mak	Indonesia	14979
indommlu_sun	sun	Indonesia	14979
indommlu_jav	jav	Indonesia	14979
indommlu_bjn	bjn	Indonesia	14979
indommlu_abl	abl	Indonesia	14979
indommlu_nij	nij	Indonesia	14979
seaeval_cross_mmlu_ind	ind	Indonesia	150
seaeval_cross_mmlu_vie	vie	Vietnam	150
seaeval_cross_mmlu_zlm	zsm	Malaysia	150
seaeval_cross_mmlu_fil	fil	Philippines	150
seaeval_cross_logiqa_ind	ind	Indonesia	176
seaeval_cross_logiqa_vie	vie	Vietnam	176
seaeval_cross_logiqa_zlm	zsm	Malaysia	176
seaeval_cross_logiqa_fil	fil	Philippines	176
m3exam_jav	jav	Indonesia	371
m3exam_tha	tha	Thailand	2168
m3exam_vie	vie	Vietnam	1789
okapi_m_arc_ind	ind	Indonesia	1170
okapi_m_arc_vie	vie	Vietnam	1170
Cultural QA $\rightarrow$ *_seacrowd_qa
copal_colloquial	ind	Indonesia	559
xcopa_tha	tha	Thailand	500
xcopa_vie	vie	Vietnam	500
xcopa_ind	ind	Indonesia	500
seaeval_sg_eval_eng	eng	Non-indigenous	103
seaeval_ph_eval_eng	eng	Non-indigenous	100
mabl_ind	ind	Indonesia	1140
mabl_jav	jav	Indonesia	600
mabl_sun	sun	Indonesia	600
Reading Comprehension QA $\rightarrow$ *_seacrowd_qa
belebele_ceb_latn	ceb	Philippines	900
belebele_ilo_latn	ilo	Philippines	900
belebele_ind_latn	ind	Indonesia	900
belebele_jav_latn	jav	Indonesia	900
belebele_kac_latn	kac	Myanmar	900
belebele_khm_khmr	khm	Cambodia	900
belebele_lao_laoo	lao	Laos	900
belebele_mya_mymr	mya	Myanmar	900
belebele_shn_mymr	shn	Myanmar	900
belebele_sun_latn	sun	Indonesia	900
belebele_tgl_latn	fil	Philippines	900
belebele_tha_thai	tha	Thailand	900
belebele_vie_latn	vie	Vietnam	900
belebele_war_latn	war	Philippines	900
belebele_zsm_latn	zsm	Malaysia	900

Table 9: Multiple-choice QA data subsets used in SEACrowd NLU evaluation.

Subset ID	Language	Region	# Samples
Extractive & Abstractive QA $\rightarrow$ *_seacrowd_qa
facqa	ind	Indonesia	311
iapp_squad	tha	Thailand	739
qasina	ind	Indonesia	500
mkqa_khm	khm	Cambodia	10000
mkqa_zsm	zsm	Malaysia	10000
mkqa_tha	tha	Thailand	10000
mkqa_vie	vie	Vietnam	10000

Table 10: Extractive and abstractive QA subsets used in SEACrowd NLG evaluation.

Subset ID	Language	Region	# Samples
Summarization $\rightarrow$ *_seacrowd_t2t
lr_sum_ind	ind	Indonesia	500
lr_sum_vie	vie	Vietnam	1460
lr_sum_lao	lao	Laos	1496
lr_sum_tha	tha	Thailand	500
lr_sum_khm	khm	Cambodia	486
lr_sum_mya	mya	Myanmar	990
xl_sum_mya	mya	Myanmar	570
xl_sum_ind	ind	Indonesia	4780
xl_sum_tha	tha	Thailand	826
xl_sum_vie	vie	Vietnam	4013

Table 11: Summarization data subsets used in SEACrowd NLG evaluation.

F.1 NLP

•

Unlabeled text (SSP). This schema could be used for language modeling in self-supervised pre-training. It consists of (id, text), where id denotes a unique row identifier of the dataset and text denotes an input text.
•

Single-label text classification (TEXT). This schema could be used for sentiment analysis, emotion classification, legal classification, and others. It consists of (id, text, label), where id denotes a unique row identifier of the dataset, text denotes an input text, and label denotes a deterministic target variable.
•

Multi-label text classification (TEXT MULTI). This schema could be used for hate speech detection and aspect-based sentiment analysis. It consists of (id, text, labels), where id denotes a unique row identifier of the dataset, text denotes an input text, and labels denotes a list of deterministic target variables.
•

Text-to-text (T2T). This schema could be used for machine translation, summarization, and paraphrasing. It consists of (id, text_1, text_2, text_1_name, text_2_name), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and text_1_name and text_2_name denote the names of the input text pair (e.g., ind and jav for translation input text pairs, or document and summary for summarization input text pairs).
•

Sequence labeling (SEQ LABEL). This schema could be used for named entity recognition (NER), POS tagging, and others. It consists of (id, tokens, labels), where id denotes a unique row identifier of the dataset, tokens denotes a list of tokens of an input text, and labels denotes a list of targets for the tokens.
•

Question answering (QA). This schema could be used for extractive QA, multiple-choice QA, and others. It consists of (id, question_id, document_id, question, type, choices, context, answer), where id denotes a unique row identifier of the dataset, question_id denotes a unique identifier of the question, document_id denotes a unique identifier of the context document, question denotes an input question to be answered, type denotes the type of the QA task (e.g., extractive, multiple-choice, open-generative, closed-generative, etc.), choices denotes a list of answer choices (if required), context denotes a passage that serves as the background information of the question (if required), and answer denotes the gold answer to the question (if required).
•

Single-label text pair classification (PAIRS). This could be used for textual entailment and next-sentence prediction. It consists of (id, text_1, text_2, label), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and label denotes the target variable.
•

Single-label text pair classification with continuous values or regression (PAIRS SCORE). This could be used for answer grading and semantic textual similarity. It consists of (id, text_1, text_2, label), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and label denotes a target variable as a continuous value.
•

Multi-label text pair classification (PAIRS MULTI). This could be used for morphological inflection. It consists of (id, text_1, text_2, labels), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and labels denotes a list of target variables.
•

Knowledge base (KB). This schema could be used for constituency parsing, dependency parsing, coreference resolution, dialogue systems, and other tasks with complex structures. It consists of (id, passages, entities, events, coreferences, relations). Considering its intricate structure, we encourage readers to take a look at the implementation of the knowledge base schema.
•

Tree (TREE). This schema could be used for constituency parsing, this schema assumes a document with subnode elements and a tree hierarchy. It consists of (id, passage, nodes), where id denotes a unique row identifier of the dataset, passage denotes the passage to that particular id; this passage consist of (id, type, text, offsets), nodes denotes the nodes to that particular id; this nodes consists of (id, type, text, offsets, subnodes).
•

Conversational Chat (CHAT). This schema could be used for conversational chat and/or multi-turn conversation. It consists of (id, input, output, meta), where id denotes a unique row identifier of the dataset, input denotes a sequence that consists of content and role as an input prompt and the role of the entity inputting the prompt, output denotes an answer from that input prompt, and meta denotes relevant details to allow some flexibility of the schema (if required).
•

End-to-end Task Oriented Dialogue (TOD). This schema could be used for end-to-end task-oriented dialogue. It consists of (dialogue_idx, dialogue), where dialogue_idx denotes a unique row identifier of the dialogue, dialogue denotes some core details such as turn label, system utterance, turn idx, belief state (consist of slots and act), user utterance, and system acts.

Subset ID	Language	Region	# Samples
Image Captioning $\rightarrow$ *_seacrowd_imtext
xm3600_fil	fil	Philippines	2760
xm3600_id	ind	Indonesia	2775
xm3600_th	tha	Thailand	2798
xm3600_vi	vie	Vietnam	2855

Table 12: Image captioning data subsets used in SEACrowd VL evaluation.

F.2 Speech

•

Speech-text (SPTEXT). This could be used for speech recognition, text-to-speech (TTS) or speech synthesis, and speech-to-text translation. It consists of (id, path, audio, text, speaker_id, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, text denotes an input text, speaker_id denotes a unique identifier of the speaker, metadata denotes relevant details such as the age and gender of the speaker (if required).
•

Speech-to-speech (S2S). This could be used for speech-to-speech translation. It consists of (id, path_1, audio_1, text_1, metadata_1, path_2, audio_2, text_2, metadata_2), where id denotes a unique row identifier of the dataset, path_1 and path_2 denote the file path to a respective input audio source, audio_1 and audio_2 denote the audio data loaded from the corresponding path, text_1 and text_2 denote input texts, and metadata_1 and metadata_2 denote relevant details such as the age of the speaker and their gender (if required).
•

Speech Classification (SPEECH). This schema could be used for speech classification, speech-language identification, and speech-emotion recognition for single-label use only. It consists of (id, path, audio, speaker_id, labels, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, speaker_id denotes a unique identifier of the speaker, labels denotes the label of that particular speech (only can be single-label), metadata denotes relevant details such as the age and gender of the speaker (if required).
•

Speech Classification for Multilabel (SPEECH MULTILABEL). This schema could be used for speech classification, speech-language identification, and speech-emotion recognition for multi-label use only. It consists of (id, path, audio, speaker_id, labels, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, speaker_id denotes a unique identifier of the speaker, labels denotes the sequence of labels of that particular speech (only can be multi-label), metadata denotes relevant details such as the age and gender of the speaker (if required).

Subset ID	Language	Region	# Samples
ASR $\rightarrow$ *_seacrowd_sptext
asr_ibsc	iba	Brunei	473
commonvoice_120_ind	ind	Indonesia	3647
commonvoice_120_tha	tha	Thailand	10964
commonvoice_120_cnh	cnh	Myanmar	763
commonvoice_120_vie	vie	Vietnam	1302
fleurs_ind	ind	Indonesia	687
fleurs_jav	jav	Indonesia	728
fleurs_tha	tha	Thailand	1021
fleurs_lao	lao	Laos	405
fleurs_mya	mya	Myanmar	880
fleurs_khm	khm	Cambodia	771
fleurs_vie	vie	Vietnam	857
fleurs_zlm	zlm	Malaysia	749
fleurs_fil	fil	Philippines	964
fleurs_ceb	ceb	Philippines	541
indspeech_newstra_ethnicsr_nooverlap_jav	jav	Indonesia	1000
indspeech_newstra_ethnicsr_nooverlap_sun	sun	Indonesia	1000
indspeech_newstra_ethnicsr_nooverlap_ban	ban	Indonesia	1000
indspeech_newstra_ethnicsr_nooverlap_btk	btx	Indonesia	1000

Table 13: ASR data subsets used in SEACrowd speech evaluation.

Subset ID		Language	Region	# Samples
Eng $\rightarrow$ XX	XX $\rightarrow$ Eng	Language	Region	# Samples
MT (Eng $\Leftrightarrow$ XX) $\rightarrow$ *_seacrowd_t2t
lio_and_central_flores_eng_ljl	lio_and_central_flores_ljl_eng	ljl	Indonesia	1658
flores200_eng_Latn_ace_Latn	flores200_ace_Latn_eng_Latn	ace	Indonesia	1012
flores200_eng_Latn_ban_Latn	flores200_ban_Latn_eng_Latn	ban	Indonesia	1012
flores200_eng_Latn_bjn_Latn	flores200_bjn_Latn_eng_Latn	bjn	Indonesia	1012
flores200_eng_Latn_bug_Latn	flores200_bug_Latn_eng_Latn	bug	Indonesia	1012
flores200_eng_Latn_ceb_Latn	flores200_ceb_Latn_eng_Latn	ceb	Philippines	1012
flores200_eng_Latn_ilo_Latn	flores200_ilo_Latn_eng_Latn	ilo	Philippines	1012
flores200_eng_Latn_ind_Latn	flores200_ind_Latn_eng_Latn	ind	Indonesia	1012
flores200_eng_Latn_jav_Latn	flores200_jav_Latn_eng_Latn	jav	Indonesia	1012
flores200_eng_Latn_kac_Latn	flores200_kac_Latn_eng_Latn	kac	Myanmar	1012
flores200_eng_Latn_khm_Khmr	flores200_khm_Khmr_eng_Latn	khm	Cambodia	1012
flores200_eng_Latn_lao_Laoo	flores200_lao_Laoo_eng_Latn	lao	Laos	1012
flores200_eng_Latn_lus_Latn	flores200_lus_Latn_eng_Latn	lus	Myanmar	1012
flores200_eng_Latn_min_Latn	flores200_min_Latn_eng_Latn	min	Indonesia	1012
flores200_eng_Latn_mya_Mymr	flores200_mya_Mymr_eng_Latn	mya	Myanmar	1012
flores200_eng_Latn_pag_Latn	flores200_pag_Latn_eng_Latn	pag	Philippines	1012
flores200_eng_Latn_shn_Mymr	flores200_shn_Mymr_eng_Latn	shn	Myanmar	1012
flores200_eng_Latn_sun_Latn	flores200_sun_Latn_eng_Latn	sun	Indonesia	1012
flores200_eng_Latn_tha_Thai	flores200_tha_Thai_eng_Latn	tha	Thailand	1012
flores200_eng_Latn_vie_Latn	flores200_vie_Latn_eng_Latn	vie	Vietnam	1012
flores200_eng_Latn_war_Latn	flores200_war_Latn_eng_Latn	war	Philippines	1012
flores200_eng_Latn_zsm_Latn	flores200_zsm_Latn_eng_Latn	zsm	Malaysia	1012
ntrex_128_eng-US_ind	ntrex_128_ind_eng-US	ind	Indonesia	1997
ntrex_128_eng-US_mya	ntrex_128_mya_eng-US	mya	Myanmar	1997
ntrex_128_eng-US_fil	ntrex_128_fil_eng-US	fil	Philippines	1997
ntrex_128_eng-US_khm	ntrex_128_khm_eng-US	khm	Cambodia	1997
ntrex_128_eng-US_lao	ntrex_128_lao_eng-US	lao	Laos	1997
ntrex_128_eng-US_zlm	ntrex_128_zlm_eng-US	zsm	Malaysia	1997
ntrex_128_eng-US_tha	ntrex_128_tha_eng-US	tha	Thailand	1997
ntrex_128_eng-US_vie	ntrex_128_vie_eng-US	vie	Vietnam	1997
ntrex_128_eng-US_hmv	ntrex_128_hmv_eng-US	hmv	Vietnam	1997
nusax_mt_eng_ind	-	ind	Indonesia	400
nusax_mt_eng_ace	nusax_mt_ace_eng	ace	Indonesia	400
nusax_mt_eng_jav	nusax_mt_jav_eng	jav	Indonesia	400
nusax_mt_eng_sun	nusax_mt_sun_eng	sun	Indonesia	400
nusax_mt_eng_min	nusax_mt_min_eng	min	Indonesia	400
nusax_mt_eng_bug	nusax_mt_bug_eng	bug	Indonesia	400
nusax_mt_eng_bbc	nusax_mt_bbc_eng	bbc	Indonesia	400
nusax_mt_eng_ban	nusax_mt_ban_eng	ban	Indonesia	400
nusax_mt_eng_nij	nusax_mt_nij_eng	nij	Indonesia	400
nusax_mt_eng_mad	nusax_mt_mad_eng	mad	Indonesia	400
nusax_mt_eng_bjn	nusax_mt_bjn_eng	bjn	Indonesia	400

Table 14: MT between English and SEA languages data subsets used in SEACrowd NLG evaluation.

F.3 VL

•

Image-text (IMTEXT). This schema could be used for image captioning, text-to-image generation, and vision-language pre-training. It consists of (id, text, image_paths, metadata), where id denotes a unique row identifier of the dataset, text denotes an input text, image_paths denotes a list of paths to the input image sources, and metadata denotes relevant details such as visual concepts and labels (if required).
•

General Image Classification (IMAGE). This schema could be used for image classification both single-label and multi-label. It consists of (id, labels, image_path, metadata), where id denotes a unique row identifier of the dataset, labels denotes the label of that particular image (can be single-label and multi-label), image_path denotes a list of paths to the input image sources, and metadata denotes relevant details such as visual concepts and labels (if required).
•

Image Question Answering (IMQA). This schema could be used for image/visual question answering. It consists of (id, question_id, document_id, questions, type, choices, context, answer, image_paths, meta), where id denotes a unique row identifier of the dataset, question_id denotes a unique identifier of the question, document_id denotes a unique identifier of the context document, question denotes an input question to be answered, type denotes the type of the QA task (e.g., extractive, multiple-choice, open-generative, closed-generative, etc.), choices denotes a list of answer choices (if required), context denotes a passage that serves as the background information of the question (if required), and answer denotes the gold answer to the question (if required), image_path denotes a list of paths to the input image sources, and metadata denotes relevant details to allow some flexibility of the schema (if required).
•

General Video-to-Text (VIDEO). This schema could be used for video-to-text retrieval and video captioning. It consists of (id, video_path, text, metadata), where id denotes a unique row identifier of the dataset, video_path denotes the file path to an input video source, text denotes the text associated with that particular frame/video, metadata denotes relevant details such as the resolution, duration, and FPS of the video (if required).

Appendix G Supplementary Details for SEA Evaluation

Model	$\tau=0.01$	$\tau=0.2$	$\tau=0.5$	$\tau=0.7$	$\tau=1.0$
Commercial
GPT-4	0.199	0.192	0.155	0.118	0.066
Command-R	0.201	0.198	0.185	0.168	0.126
English
Mistral	0.161	0.160	0.159	0.162	0.150
Llama3	0.138	0.137	0.131	0.129	0.113
Falcon	0.274	0.272	0.238	0.250	0.211
Multilingual
mT0	0.151	0.148	0.131	0.112	0.074
BLOOMZ	0.238	0.236	0.228	0.217	0.167
BactrianX-Llama	0.163	0.162	0.163	0.168	0.149
AYA-23	0.183	0.182	0.183	0.179	0.135
AYA-101	0.112	0.109	0.095	0.085	0.069
SEA regional
SEA-LION	0.250	0.242	0.204	0.164	0.102
SeaLLM v2.5	0.137	0.133	0.116	0.097	0.069
Sailor	0.152	0.151	0.145	0.139	0.113
SEA country
Cendol-mT5	0.407	0.404	0.378	0.328	0.200
Cendol-Llama2	0.294	0.290	0.267	0.232	0.149
Merak v4	0.209	0.207	0.199	0.190	0.155
WangchanX-Llama3	0.163	0.161	0.153	0.150	0.131
Malaysian Llama3	0.181	0.181	0.179	0.176	0.143

Table 15: Language equity across baselines based on Gini coefficient weighted by population with different

\tau

values. Lower Gini means higher equity.

G.1 Datasets

Table 5, 6, 7, 8, and 9 provide the details of data subsets used in the NLU evaluation. Sentiment analysis dataset is originally from NusaX Winata et al. (2023), NusaTranslation Cahyawijaya et al. (2023b), SentiTaglish²⁰²⁰20https://huggingface.co/datasets/ccosme/SentiTaglishProductsAndServices, SmSA Purwarianti and Crisdayanti (2019), PRDECT-ID Sutoyo et al. (2022), code-mixed Indonesian-English sentiment Astuti et al. (2023), Karonese tweet sentiment Karo et al. (2022), Typhoon Yolanda sentiment Imperial et al. (2019), GKLMIP Khmer sentiment Jiang et al. (2022), Wisesight sentiment corpus²¹²¹21https://github.com/PyThaiNLP/wisesight-sentiment, Filipino-Tagalog product reviews Sentiment²²²²22https://github.com/EricEchemane/Filipino-Tagalog-Product-Reviews-Sentiment-Analysis, and multilabel sentiment of Indonesian mobile apps review Riccosan and Saputra (2023).

Topic classification dataset is originally from NusaParagraph Cahyawijaya et al. (2023b), UIT-ViON Tran et al. (2021), SIB-200 Adelani et al. (2024), GKLMIP Khmer news Jiang et al. (2022), and Indonesian news Muzad and Rahutomo (2016). Natural Language Inference dataset is originally from IndoNLI Mahendra et al. (2021), WreTe Setya and Mahendra (2018), SNLI Indo Putra et al. (2024), MyXNLI²³²³23https://huggingface.co/datasets/akhtet/myXNLI, and XNLI Conneau et al. (2018). Commonsense reasoning dataset is originally from XStoryCloze Lin et al. (2022), IndoCloze Koto et al. (2022), and EMoTES-3K Catapang and Visperas (2023).

Open domain QA dataset is originally from IndoMMLU Koto et al. (2023b), SeaEval Wang et al. (2023), M3Exam Zhang et al. (2023b), and Okapi Dac Lai et al. (2023). Cultural QA dataset is originally from COPAL-ID Wibowo et al. (2023), XCOPA Ponti et al. (2020), SeaEval Wang et al. (2023), and Multilingual Fig-QA Kabra et al. (2023). The reading comprehension dataset is originally from Belebele Bandarkar et al. (2023).

Table 10, 11, and 14 provide the details of data subsets used in the NLG evaluation. The summarization dataset is originally from LR-Sum Palen-Michel and Lignos (2023) and XL-Sum Hasan et al. (2021). The machine translation dataset is originally from Lio and the Central Flores corpus Elias (2018), Flores-200 Costa-jussà et al. (2024) and NTREX-128 Federmann et al. (2022). Question answering dataset is originally from FacQA Purwarianti et al. (2007), QASiNa Rizqullah et al. (2023), MKQA Longpre et al. (2021), and Open Thai Wikipedia QA dataset²⁴²⁴24https://zenodo.org/records/4539916.

Table 12 and 13 provide the details of data subsets used in the VL and speech evaluation. The image captioning dataset is originally from XM3600 Thapliyal et al. (2022). Speech recognition dataset is originally from INDspeech NEWSTRA Ethnic collection Sani et al. (2012), ASR Iban Juan et al. (2015), FLEURS Conneau et al. (2022), and Common Voice Ardila et al. (2020).

G.2 Baselines

Table 20, 21, and 22 report the details of baseline models used in SEACrowd evaluation (§3). For each baseline model, we provide information regarding the model size, origin base model, seen languages in the training corpora use, and the URL where the models can be downloaded. In principle, this work does not aim to acquire and fit all available SEA-trained LLMs over the Internet, as this is computationally expensive. Rather, we want to initiate the exploration of select publicly available models to serve as baselines for the evaluation of foundational capabilities on SEA languages through benchmarking on NLU, NLG, speech, and vision tasks aggregated via SEACrowd.

Across the various models explored, as listed in the tables, we prioritized the diversity of model variation in terms of scale, openness, and coverage of SEA languages. In NLP tasks, we covered five language model families for the main experiments, namely English-only, multilingual, regional, and country-specific models. Instruction-tuned LLMs demonstrate the ability to generalize to unseen tasks Wei et al. (2021); Sanh et al. (2021); Ouyang et al. (2022). When these LLMs are based on a multilingual foundation, they have shown proficiency in generalizing across multiple languages Muennighoff et al. (2022); Adilazuarda et al. (2023); Zhang et al. (2023a). For NLU, we compute the weighted F1-score and obtain the answers via log-likelihood for open-source baselines or string matching for commercial baselines.

For the speech benchmark, only two model families are available: multilingual models and models fine-tuned on specific SEA languages. For vision tasks, we covered English-only and one multilingual model. These models utilize a visual backbone pre-trained on image-text alignment, e.g., CLIP Radford et al. (2021), to project image features into the input space of an existing pre-trained LM. In summary, we mostly explored open models readily accessible on HuggingFace but also included commercial models such as GPT-4 and Whisper V3 for performance benchmarking, reproducibility, and extension by future works.

Model	Hyperparameter	Value
Logistic Regression	max_iter	100
	C	np.linspace(0.001, 10, 100)
Naive Bayes	alpha	np.linspace(0.001, 1, 50)
	distribution	MultinomialNB
SVM	C	1
	kernel	["rbf", "linear"]

Table 16: Hyper-parameters of classical models for Translationese prediction through grid search.

G.3 Prompts

Tables 23, 24, and 25 describe the handwritten prompt templates used in NLU, NLG, and VL evaluation (§3). For all tasks, we used a zero-shot prompting procedure to serve as the baseline setup. Due to the task complexity and distribution of workload from volunteer contributors with available computing resources, we limited the experiment procedure for some setups to ensure the acquisition of results in line with target release dates. For NLU, we explored three prompt styles for each dataset from core tasks, including commonsense reasoning, question-answering, and NLI. For more challenging tasks requiring more intensive computing power such as NLG and VL, we used only one uniform prompt style, but we also explored prompts translated into SEA languages, i.e., Filipino, Indonesian, Thai, and Vietnamese for VL.

Model	3-label	HT vs. MT-Nat	MT vs. HT-Nat	Nat vs. HT-MT
LR (TF-IDF)	39.73	53.03	56.01	75.20
LR (BoW)	45.63	55.90	61.39	75.60
NB (TF-IDF)	33.43	49.53	50.55	73.05
NB (BoW)	33.70	49.10	50.64	71.26
SVM (TF-IDF)	39.55	52.63	55.10	76.40
SVM (BoW)	46.84	56.85	61.40	75.65
mDeBERTa	51.51	64.77	59.16	79.08

Table 17: Results of translationese classifier (accuracy) averaged across languages.

G.4 Evaluation Results

Table 26 and 27 describes the NLU and NLG results per language.

G.5 Language Equity Results

Table 15 presents the language equity of LLMs used in the evaluation across different weights of the number of language speakers in the Gini coefficient calculation.

Country	Affiliation	Origin
Indonesia	16	31
Malaysia	0	1
Philippines	3	7
Singapore	13	2
Thailand	1	2
Vietnam	0	1
Australia	1	0
Brazil/Sweden	0	1
Canada	1	0
China	2	8
Egypt	0	1
Germany	0	2
Hong Kong	2	0
India	0	1
Ireland	1	0
Japan	3	0
The Netherlands	0	1
UAE	5	0
UK	4	0
USA	9	1
Uzbekistan	0	2

Table 18: The demographics of the authors based on affiliation country and origin country.

Appendix H Supplementary Details for Translationese Classifier

H.1 Training & Evaluation Data

We manually select and validate the text collection method of each data subset for training and evaluating the translationese classifier, in Tables 28 and 29, respectively. This validation is done by checking the relevant publication, domain, and annotation method. If the texts in the data subsets are a product of machine or human translation, we regard them as translationese. We label data subsets with human-generated texts as natural data.

H.2 Experiments

We aim to assess the capability of ML models to differentiate between human-generated/natural samples (Nat), human-translated samples (HT), and machine-translated samples (MT). Our approach involves training classifiers using classical ML techniques and fine-tuning mDeBERTa models to enhance learning. Furthermore, we experiment by combining two label classes into one to evaluate the predictive difficulty of distinguishing between these labels. This analysis provides valuable insights into the relative similarity of the samples across these categories. The following section provides a comprehensive overview of our methodology for this study.

Classical ML

We use three classical machine learning methods: 1) Logistic Regression (LR), 2) Naive Bayes (NB), and 3) Support Vector Machine (SVM) with two different features, including TF-IDF and Bag-of-words (BoW). We run hyper-parameter tuning with grid search to find the best hyper-parameters for each method on validation set, and report the results on test set in Table 16.

Encoder LM

We explore fine-tuning encoder-only LM for developing a translationese classifier. We utilize mDeBERTa-v3_base model²⁵²⁵25https://huggingface.co/microsoft/mdeberta-v3-base He et al. (2020, 2022)–a multilingual encoder-only LM–as our backbone. We trained the model with AdamW Loshchilov and Hutter (2019) optimizer using a learning rate of 1e-5, batch size of 256, and warming up steps of 500 for a maximum of 10 epochs. We apply an early stopping of 3 epochs based on the validation accuracy. We show results in Table 17.

No.	Name	C. Points
1	Holy Lovenia	549
2	Samuel Cahyawijaya	480
3	Rahmad Mahendra	317
4	Salsabil Maulana Akbar	243
5	Lester James V. Miranda	234
6	Zheng-Xin Yong	164
7	Jennifer Santoso	164
8	Elyanah Aco	158
9	Akhdan Fadhilah	157
10	Jonibek Mansurov	132
11	Fajri Koto	121
12	Joseph Marvin Imperial	118
13	Ruochen Zhang	114
14	Genta Indra Winata	108
15	Onno P. Kampman	107
16	Joel Ruben Antony Moniz	93
17	Muhammad Ravi Shulthan Habibi	92
18	Frederikus Hudi	83
19	Sedrick Keh	81
20	Alham Fikri Aji	80
21	Railey Montalan	78
22	Peerat Limkonchotiwat	72
23	Ryan Ignatius	56
24	Joanito Agili Lopo	50
25	William Nixon	50
26	Börje F. Karlsson	49
27	James Jaya	48
28	Ryandito Diandaru	48
29	Yuze Gao	48
30	William Tjhi	46
31	Patrick Amadeus	46
32	Bin Wang	44
33	Jan Christian Blaise Cruz	43
34	Chenxi Whitehouse	36
35	Ivan Halim Parmonangan	36
36	Maria Khelli	36
37	Sebastian Ruder	35
38	Wenyu Zhang	34
39	Lucky Susanto	33
40	Reynard Adha Ryanda	32
41	Sonny Lazuardi Hermawan	30
42	Dan John Velasco	29
43	Muhammad Dehan Al Kautsar	29
44	Willy Fitra Hendria	29
45	Yasmin Moslem	29
46	Noah Flynn	28
47	Muhammad Farid Adilazuarda	27
48	Haochen Li	27
49	Johanes Lee	27
50	R. Damanhuri	27
51	Shuo Sun	27
52	Muhammad Reza Qorib	26
53	Amirbek Djanibekov	25
54	Wei Qi Leong	25
55	Quyet V. Do	24
56	Niklas Muennighoff	24
57	Tanrada Pansuwan	22
58	Ilham Firdausi Putra	21
59	Yan Xu	21
60	Ayu Purwarianti	20
61	Ngee Chia Tai	20

Table 19: Co-authors ordered by their amount of contribution points.

Appendix I Supplementary Details for SEA Language Prioritization

Based on the results of the global utility metric Blasi et al. (2022), we provide the top-20 SEA indigenous languages to be prioritized based on their demand (i.e., the number of SEA language speakers) and current utility (Figure 10) or resource availability (Figure 11).²⁶²⁶26https://github.com/SEACrowd/globalutility While the current utility, also known as the model capability, is relative to the model performance on eng, the resource availability is relative to 500, which is approximately the number of datasets in Korean language available in HuggingFace. The Korean language is chosen as the pivot because it is considered a higher-resource language than most by Joshi et al. (2020).

Appendix J Contributor Demographics

Table 18 describes the geographical distribution of the authors in SEACrowd.

Appendix K Languages Under Study

Table 30-48 present the list of SEA indigenous languages covered by SEACrowd. Information regarding the ISO 639-3 code, language name, region, and population is obtained from Eberhard et al. (2021); Hammarström et al. (2024); Project (2024); Dryer and Haspelmath (2013) and Wikipedia²⁷²⁷27https://www.wikipedia.org/.

Appendix L Amount of Contributions by Co-Authors

Table 19 provides a list of co-authors sorted by their amount of contributions in SEACrowd. The full details of their contributions can be seen in our contribution tracking.

Model name	Model size	Backbone	Seen langs	URL
Commercial
GPT-4	N/A	GPT-4	N/A	https://openai.com/index/gpt-4/. We used turbo-2024-04-09 for NLU and gpt-4o-2024-05-13 for NLG.
Command-R	36B	Command-R	2 SEA langs (vie, ind), 22 non-SEA langs	https://cohere.com/blog/command-r
English
Mistral	7B	Mistral	N/A	mistralai/Mistral-7B-Instruct-v0.3
Llama3	8B	Llama3	N/A	meta-llama/Meta-Llama-3-8B-Instruct
Falcon	7B	Falcon	0 SEA langs (mainly English)	tiiuae/falcon-7b-instruct
Multilingual
mT0	3B	mT5	2 SEA langs (vie, ind), 43 non-SEA langs	bigscience/mt0-xl
BLOOMZ	7B	BLOOM	2 SEA langs (vie, ind), 43 non-SEA langs	bigscience/bloomz-3b
BactrianX-Llama	7B	Llama	6 SEA langs (ind, vie, khm, mya, tha, tgl, vie), 46 non-SEA langs	MBZUAI/bactrian-x-llama-7b-merged
AYA-23	8B	Command	2 SEA langs (ind, vie), 21 non-SEA langs	CohereForAI/aya-23-8B
AYA-101	13B	T5	9 SEA langs (ind, vie, tha, zsm, mya, ceb, fil, jav, sun), 92 non-SEA langs	CohereForAI/aya-101
SEA regional
SEA-LION	7B	MPT	8 SEA langs (ind, vie, tha, tgl, zsm, khm, lao, mya), 3 non-SEA langs	aisingapore/sea-lion-7b-instruct
SeaLLM v2.5	7B	SeaLLM	8 SEA langs (ind, vie, tha, tgl, zsm, khm, lao, mya)	SeaLLMs/SeaLLM-7B-v2.5
Sailor	7B	Qwen 1.5	5 SEA langs (ind, vie, lao, zlm, tha), 2 non-SEA langs	sail/Sailor-7B-Chat
SEA country
Cendol-mT5	3B	mT5	1 SEA lang (ind), 18 local Indonesian langs	indonlp/cendol-mt5-xl
Cendol-Llama2	7B	Llama2	1 SEA lang (ind), 18 local Indonesian langs	indonlp/cendol-llama2-7b
Merak v4	7B	Llama2	1 SEA lang (ind)	Ichsan2895/Merak-7B-v4
WangchanX-Llama3	8B	Llama3	4 SEA langs (ind, vie, tha, mya) and 26 non-SEA langs	airesearch/LLaMa3-8b-WangchanX-sft-Demo
Malaysian Llama3	8B	Llama3	1 SEA lang (zlm)	mesolitica/malaysian-llama-3-8b-instruct-16k

Table 20: LLMs used in SEACrowd NLU and NLG evaluation.

Model name	Model size	Backbone	Seen langs	URL
Multilingual
Whisper v3	1.54B	Whisper v3	89 non-SEA & 9 SEA (ind, jav, lao, zlm, mya, tgl, tha, sun, vie)	openai/whisper-large-v3
\hdashlineMMS 1B	1B	MMS	993 non-SEA & 205 SEA (abp, ace, acn, agn, ahk, akb, alj, alp, amk, aoz, atb, atq, ayz, ban, bbc, bcl, bdg, bdq, bep, bgr, bhz, bkd, blt, blx, blz, bno, bpr, bps, bru, btd, bts, btx, bvz, bzi, ceb, cek, cfm, cgc, cmr, cnh, ctd, dbj, dnt, dnw, dtp, eip, frd, gbi, gor, had, hap, hil, hlt, hnn, hvn, iba, ifa, ifb, ifk, ifu, ify, ilo, ind, itv, jav, jmd, kac, kak, kdt, khg, khm, kje, kjg, klw, kmd, kml, knb, kne, kpq, kps, kqe, kqr, krj, krr, kvw, kxf, kxm, kyb, kyo, kyu, kzf, lao, law, lbw, lcp, lew, lex, lhu, lis, lje, ljp, llg, lnd, lsi, mad, mak, mbb, mbt, mej, mhx, mhy, min, mkn, mnb, mnw, mnx, mog, mqf, mqj, mqn, mrw, mtd, mtj, mvp, mwq, mwv, mya, myl, nfa, nia, nij, nlc, nlk, nod, npy, nst, obo, pag, pam, pce, pez, plw, pmf, ppk, prf, prk, prt, pse, ptu, pww, raw, rej, rgu, rhg, ril, rol, saj, sas, sbl, sda, sea, sgb, shn, sjm, slu, sml, sne, suc, sun, sxn, sya, sza, tbk, tbl, tby, tcz, tdj, tes, tgl, tha, tih, tlb, tnt, tom, tvw, twb, twe, twu, txa, txq, ubl, urk, ury, vie, war, wlo, xdy, xmm, xsb, xte, yka, yli, yva, zlm, zyp)	facebook/mms-1b-all
\hdashlineSeamless M4T v2	2.3B	Seamless	83 non-SEA & 9 SEA (ind, jav, khm, lao, mya, tgl, tha, vie, zlm)	facebook/seamless-m4t-v2-large
Fine-tuned on specific language(s)
XLSR English	300M	Wav2Vec2	46 non-SEA & 7 SEA (ceb, cnh, ind, lao, tam, tgl, vie) & fine-tuning language(s)	jonatasgrosman/wav2vec2-large-xlsr-53-english
XLSR Ind-Jav-Sun				indonesian-nlp/wav2vec2-indonesian-javanese-sundanese
XLSR Indonesian				Galuh/wav2vec2-large-xlsr-indonesian
XLSR Thai				wannaphong/wav2vec2-large-xlsr-53-th-cv8-newmm
XLS-R Tagalog				sil-ai/wav2vec2-bloom-speech-tgl
XLS-R Burmese				sil-ai/wav2vec2-bloom-speech-mya
XLS-R Khmer				vitouphy/wav2vec2-xls-r-300m-khmer
\hdashlineWhisper Indonesian	1.54B	Whisper	89 non-SEA & 9 SEA (ind, jav, lao, msa, mya, tgl, tha, sun, vie)	cahya/whisper-large-id
Whisper Thai				biodatlab/whisper-th-large-v3-combined
Whisper Khmer				ksoky/whisper-large-khmer-asr

Table 21: Speech models used in SEACrowd speech evaluation.

Model name	Model size	Backbone	Pre-training images	URL
English
LLaVA 1.5	N/A	N/A	N/A	N/A
LLaVA 1.6	7B	Mistral-7B	N/A	liuhaotian/llava-v1.6-mistral-7b
Idefics2	8B	Mistral-7B-v0.1	1.5B	HuggingFaceM4/idefics2-8b
PaliGemma	2B	Gemma-2B	N/A	google/paligemma-3b-pt-224
Multilingual
mBLIP	N/A	blip2-flan-t5-xl	N/A	Gregor/mblip-mt0-xl

Table 22: VLMs used in SEACrowd VL evaluation.

No.	Prompt template
Sentiment Analysis
1	Classify the sentiment of the text below.\n[INPUT] => Sentiment ([OPTIONS]): [LABEL_CHOICE]
2	Predict the sentiment of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE]
3	[INPUT]\nWhat would be the sentiment of the text above? [OPTIONS]? [LABEL_CHOICE]
Topic Classification
1	Classify the topic of the text below.\n[INPUT] => Topic ([OPTIONS]): [LABEL_CHOICE]
2	Predict the topic of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE]
3	[INPUT]\nWhat would be the topic of the text above? [OPTIONS]? [LABEL_CHOICE]
Commonsense Reasoning $\rightarrow$ *_seacrowd_text
1	Classify the morality of the text below.\n[INPUT] => Morality ([OPTIONS]): [LABEL_CHOICE]
2	Predict the morality of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE]
3	[INPUT]\nWhat would be the morality of the text above? [OPTIONS]? [LABEL_CHOICE]
Commonsense Reasoning $\rightarrow$ *_seacrowd_qa
1	Question: [QUESTION]\nWhat reply makes more sense to answer this question?\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE]
2	Based on the the following question: "[QUESTION]" and choices: [ANSWER_CHOICE the correct answer is: [LABEL_CHOICE]
3	Question: [QUESTION]\nChoices: [ANSWER_CHOICES]\nThe correct answer to the given question is: [LABEL_CHOICE]
All QAs
1	Refer to the passage below and answer the following question:\nPassage: [CONTEXT]\nQuestion: [QUESTION]\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE]
2	[CONTEXT]\nBased on the above text, [QUESTION]\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE]
3	[CONTEXT]\nQuestion: [QUESTION]\nChoices:[ANSWER_ CHOICES]\nReferring to the passage above, the correct answer to the given question is: [LABEL_CHOICE]
NLI
1	Hypothesis: [INPUT_A]\nPremise: [INPUT_B]\nQuestion: What is the relation between the hypothesis and the premise? [OPTIONS]? [LABEL_CHOICE]
2	Given the following premise and hypothesis:\nHypothesis: [INPUT_A]\nPremise: [INPUT_B]\nDetermine the logical relationship (([OPTIONS])): [LABEL_CHOICE]
3	Choose the most appropriate relationship ([OPTIONS]) between the premise and hypothesis:\nRelationship between "[INPUT_B]" and "[INPUT_A]": [LABEL_CHOICE]

Table 23: Prompt templates used for NLU tasks.

No.	Prompt template
Machine Translation (MT)
1	Translate the following text from [SOURCE] to [TARGET]. Give your translation directly.\nText: [INPUT]\nTranslation:
Summarization
1	Write a summary from the following text.\nText: [INPUT]\nSummary:
Abstractive & Extractive QA
1	Refer to the passage below and answer the following question:\nPassage: [CONTEXT]\nQuestion: [QUESTION]\nAnswer:

Table 24: Prompt templates used for NLG tasks.

Lang.	Prompt template
Image Captioning
eng	Caption the following image in [LANGUAGE].
fil	Ilarawan ang sumusunod na larawan.
ind	Deskripsikan gambar berikut.

Table 25: Prompt templates used for the image captioning task in VL evaluation.

	abl	abs	ace	ban	bbc	bew	bhp	bjn	btx	bug	ceb	eng	fil	ilo	ind	jav	kac	khm	lao	lus	mad	mak	min	mui	mya	nij	pag	rej	shn	sun	tha	vie	war	zsm	Overall
GPT-4	63.3	39.0	39.3	60.3	7.1	68.5	2.8	60.4	27.8	40.4	85.6	52.1	55.9	69.5	60.7	59.7	30.8	66.4	51.8	70.0	37.1	44.3	57.9	71.8	47.6	40.2	79.4	34.0	21.7	58.5	59.6	56.1	84.9	61.6	51.9
Command-R	50.1	80.8	57.6	62.8	47.4	81.8	58.2	57.1	57.3	57.9	66.7	69.4	51.1	56.8	58.3	61.2	36.5	41.5	33.8	63.9	61.9	58.4	66.4	81.7	34.8	53.3	75.6	69.6	35.4	63.2	42.7	55.9	67.6	55.7	58.0
Mistral	36.7	53.6	46.4	49.6	33.0	59.3	44.3	44.6	44.3	48.8	53.5	69.2	48.4	49.1	52.5	46.7	33.2	29.8	30.7	56.1	45.7	44.8	51.2	62.6	27.4	40.1	69.2	48.6	31.9	48.3	40.8	45.2	54.4	49.6	46.8
Llama3	37.3	40.3	43.2	48.9	34.8	44.5	32.6	42.2	38.5	42.9	51.2	59.5	45.2	46.7	49.2	44.4	28.5	34.6	30.3	46.8	39.0	38.0	43.6	49.2	35.2	39.6	60.5	38.5	31.1	45.2	43.8	45.5	50.3	49.0	42.6
Falcon	21.1	63.2	13.3	19.0	23.0	37.9	62.1	15.6	31.9	15.7	19.5	43.7	25.1	18.8	30.8	27.0	14.2	10.2	12.7	15.0	30.3	32.3	23.6	37.0	18.0	23.0	18.8	36.0	14.1	28.2	15.9	18.8	19.1	17.4	25.1
mT0	37.6	63.6	43.7	51.2	37.0	66.1	38.4	43.6	41.3	50.3	62.5	49.4	41.0	59.0	47.2	56.0	40.9	57.5	61.2	57.0	46.7	45.8	52.6	68.8	45.9	40.9	62.6	47.8	47.0	58.8	41.8	41.4	61.4	49.4	50.5
BLOOMZ	25.6	66.5	28.4	34.2	35.8	53.9	48.0	30.4	36.3	33.3	30.9	51.7	28.9	27.8	44.7	38.2	23.1	18.9	23.6	28.1	37.8	34.5	39.9	60.2	23.0	34.6	33.1	42.2	19.8	41.3	25.9	34.8	32.1	34.3	35.3
BactrianX-Llama	24.9	48.6	21.2	28.5	26.9	33.4	45.9	22.8	31.4	22.7	27.9	45.6	32.0	24.3	38.3	30.0	19.9	17.0	20.7	21.0	30.0	28.8	26.2	35.7	22.8	27.2	26.5	29.2	20.5	30.2	24.5	27.1	28.3	31.5	28.6
AYA-23	43.3	21.2	26.9	35.0	24.3	31.2	16.8	30.9	25.1	26.5	36.0	50.8	33.5	32.7	46.8	36.9	20.5	15.1	22.0	27.4	31.0	31.7	27.3	35.5	23.7	37.3	32.6	22.8	20.8	34.9	32.7	44.8	37.1	47.9	31.3
AYA-101	42.5	64.3	71.2	65.2	58.8	68.2	43.3	63.5	52.7	60.7	71.7	62.8	52.8	65.0	54.2	62.6	43.1	62.2	67.8	71.8	56.9	49.0	69.3	70.2	51.5	57.2	75.7	52.9	53.8	67.2	49.5	48.0	70.5	56.4	59.8
SEA-LION	10.3	62.3	13.5	16.5	21.3	35.3	60.3	13.4	31.8	15.2	13.6	26.6	20.6	10.2	27.6	21.4	8.7	16.8	15.2	12.5	26.8	28.3	22.8	34.6	23.0	16.0	14.4	34.1	9.7	23.4	16.3	14.7	14.2	13.3	21.9
SeaLLM v2.5	50.7	55.1	34.5	43.4	36.3	53.9	53.2	45.8	45.8	37.7	47.6	42.5	52.6	44.7	53.4	49.8	27.4	42.6	50.3	45.8	48.7	49.8	46.8	58.4	41.0	39.1	55.7	47.8	28.7	50.1	49.0	54.5	55.4	60.6	47.0
Sailor	50.4	59.2	43.8	55.5	44.1	61.5	43.9	50.5	44.8	45.7	45.6	63.0	40.2	45.0	51.3	53.1	29.9	32.7	53.9	53.9	47.6	46.5	52.8	63.9	28.1	52.7	59.3	42.2	26.7	54.0	46.3	47.7	49.2	52.1	48.1
Cendol-mT5	15.0	98.5	38.3	42.3	84.7	99.4	95.6	33.3	92.6	68.6	14.1	38.7	23.8	12.2	33.4	50.5	10.4	20.3	15.3	9.6	76.5	70.2	65.2	99.6	16.6	52.6	12.8	98.9	7.2	56.6	26.4	14.7	15.1	15.9	44.8
Cendol-Llama2	17.5	80.0	30.8	33.5	60.6	49.3	73.4	27.9	45.1	32.3	18.7	36.8	21.4	17.8	37.4	35.1	14.7	13.2	15.9	15.0	46.3	38.1	37.1	51.6	19.9	40.3	17.7	47.7	16.5	38.5	20.6	17.3	18.5	18.4	32.5
Merak	37.0	68.6	37.7	48.3	36.4	66.1	60.1	41.4	50.4	47.8	42.4	59.6	37.9	39.7	48.5	48.4	27.9	24.2	28.0	44.3	51.7	51.0	50.5	70.3	27.2	40.0	58.6	57.9	28.6	50.8	29.3	35.3	43.7	47.1	45.2
WangchanX-Llama3	38.4	59.3	26.8	35.2	35.0	43.3	56.9	31.6	38.3	31.2	32.3	57.6	36.6	29.3	45.0	38.7	23.7	24.3	25.1	26.6	40.4	41.4	34.8	43.6	31.6	37.0	31.2	42.9	23.5	39.8	36.5	38.4	31.3	37.0	36.6
Malaysian Llama3	38.9	62.3	38.1	41.9	39.2	46.9	58.3	39.5	40.5	35.9	37.8	55.5	34.5	33.1	48.6	42.6	24.7	18.9	20.4	33.6	42.1	41.0	42.5	48.5	22.2	39.6	46.8	41.1	19.6	44.0	33.7	34.6	37.7	49.9	39.2
Overall	35.6	60.4	36.4	42.9	38.1	55.6	49.7	38.6	43.1	39.7	42.1	51.9	37.9	37.9	46.0	44.6	25.5	30.3	32.1	38.8	44.3	43.0	45.0	58.0	30.0	39.5	46.1	46.4	25.4	46.3	35.3	37.5	42.8	41.5	41.4

Table 26: NLU evaluation results in weighted F1-score per language.

	ace	ban	bbc	bjn	bug	ceb	fil	hmv	ilo	ind	jav	kac	khm	lao	ljl	lus	mad	min	mya	nij	pag	shn	sun	tha	vie	war	zsm	Overall
GPT-4	5.8	6.0	7.4	4.7	5.6	13.7	9.5	8.5	14.2	3.7	6.8	7.4	2.7	3.4	2.7	11.3	3.7	6.3	2.8	4.2	10.4	3.0	6.1	2.1	10.0	13.2	5.0	6.7
Command-R	19.6	26.1	16.4	30.0	16.0	44.3	52.5	16.8	29.4	57.9	32.6	8.8	8.7	14.2	6.0	19.5	17.2	31.6	9.5	18.4	20.4	8.9	27.5	24.3	46.8	34.4	50.1	25.5
Mistral	12.4	15.0	10.0	13.9	11.1	28.5	37.2	10.2	15.9	28.6	15.4	7.3	8.7	10.8	4.2	11.7	9.5	18.0	5.7	12.4	17.5	9.5	14.8	15.1	25.1	22.4	31.1	15.6
Llama3	11.0	12.3	8.1	13.8	7.6	25.1	33.2	7.6	18.4	21.9	17.0	4.8	6.5	5.8	3.2	9.6	8.5	16.4	4.5	9.5	11.8	6.3	15.1	9.6	21.7	20.5	25.2	13.2
Falcon	7.3	9.5	8.2	8.3	7.9	18.6	23.6	6.6	9.7	15.3	7.7	6.0	3.1	3.1	4.2	9.3	6.6	11.8	1.8	8.7	12.9	4.5	7.7	2.4	13.5	13.5	17.0	9.2
mT0	4.8	5.6	3.7	5.7	3.1	4.6	6.8	4.5	3.8	29.3	5.8	2.1	4.3	6.1	1.7	3.4	3.6	6.5	5.0	3.5	3.6	3.5	6.8	9.4	19.6	6.1	9.1	6.4
BLOOMZ	3.8	4.6	2.8	5.3	2.9	4.1	5.1	3.4	4.2	32.3	4.9	3.0	1.5	2.4	1.5	4.0	2.7	5.7	1.2	3.2	4.9	2.6	4.6	3.3	24.1	5.4	10.1	5.7
BactrianX-Llama	10.9	11.6	8.9	12.3	8.8	22.0	32.1	8.5	12.1	25.1	11.4	6.9	6.4	8.2	4.1	10.9	8.7	14.1	4.3	8.4	15.2	8.0	11.4	10.8	19.4	16.6	23.4	12.6
AYA-23	9.3	10.5	8.0	11.6	6.9	14.2	17.5	5.6	8.3	18.3	11.3	5.7	4.0	5.9	2.7	8.1	7.6	12.2	3.3	9.0	8.8	6.5	10.4	6.8	24.3	10.6	17.7	9.8
AYA-101	26.4	26.8	14.6	21.6	12.6	49.3	46.6	33.3	25.8	49.5	38.8	12.2	25.9	37.2	4.4	17.8	13.4	29.7	17.6	13.2	23.3	20.4	35.6	22.2	36.5	36.9	41.9	27.2
SEA-LION	7.2	8.1	6.5	9.3	5.8	12.5	17.1	4.9	7.0	13.9	7.9	5.3	7.0	9.6	2.0	7.6	6.0	9.5	4.8	6.6	8.4	4.9	8.0	5.9	21.2	10.3	14.1	8.6
SeaLLM v2.5	15.2	20.2	11.7	19.5	11.5	37.1	49.1	14.5	26.8	43.0	26.6	7.5	17.8	22.2	4.7	15.1	12.2	26.8	9.2	14.6	19.2	9.4	22.0	21.6	36.7	28.8	45.7	21.8
Sailor	19.2	24.5	15.3	23.1	14.6	29.0	39.7	8.6	13.5	46.8	30.6	7.1	12.5	24.4	6.2	10.5	16.0	28.8	5.8	19.1	16.5	9.0	26.7	22.0	41.1	21.5	49.9	21.6
Cendol-mT5	8.3	11.4	14.2	11.6	6.9	7.2	8.4	4.7	5.5	35.8	17.5	4.0	6.3	8.5	2.0	5.2	6.1	10.5	2.9	8.8	6.6	4.1	17.1	5.5	4.4	6.4	20.5	9.3
Cendol-Llama2	8.6	10.0	14.4	19.3	6.6	6.9	8.2	6.4	6.4	36.1	19.1	5.5	3.0	4.3	4.1	4.5	14.1	22.0	1.9	17.5	5.4	4.8	17.3	3.4	8.1	7.6	22.0	10.6
Merak	7.4	10.3	6.7	11.3	7.1	8.2	12.8	6.3	6.7	29.5	9.6	3.7	3.8	5.9	3.2	8.0	6.5	12.5	2.4	8.0	8.2	5.6	10.6	5.9	7.2	7.4	20.4	8.7
WangchanX-Llama3	19.8	24.4	14.3	28.9	13.4	42.2	48.6	12.7	29.4	50.1	29.4	7.7	18.1	19.7	6.0	17.6	15.6	30.0	10.4	18.1	22.4	13.9	28.0	25.1	39.2	35.5	45.4	24.7
Malaysian Llama3	15.2	17.3	12.3	22.2	11.1	19.7	24.0	8.7	12.6	38.6	19.4	7.2	6.7	9.0	5.9	10.6	12.4	23.5	4.2	14.3	13.9	8.3	19.0	14.2	17.3	15.6	44.4	15.8
Overall	11.8	14.1	10.2	15.1	8.9	21.5	26.2	9.5	13.9	32.0	17.3	6.2	8.2	11.2	3.8	10.3	9.5	17.6	5.4	11.0	12.8	7.4	16.0	11.7	23.1	17.4	27.4	14.1

Table 27: NLG evaluation results in ROUGE-L per language.

Lang.	Subset	Original Task	Domain	# Samples
Translationese
eng	emotes_3k_eng_seacrowd_t2t	Commonsense Reasoning	Ethics	2000
eng	aya_evaluation_suite_eng_seacrowd_t2t	Instruction Tuning	General	400
ind	belebele_ind_latn_seacrowd_qa	QA	General	1969
ind	parallel_asian_treebank_ind_eng_seacrowd_t2t	Machine Translation	News	31
ind	aya_evaluation_suite_ind_seacrowd_t2t	Instruction Tuning	General	4
ind	bactrian_x_id_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1972
ind	seaeval_cross_logiqa_ind_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	16
ind	seaeval_cross_mmlu_ind_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	8
khm	belebele_khm_khmr_seacrowd_qa	QA	General	399
khm	khmer_alt_pos_seacrowd_seq_label	POS Tagging	News	1595
khm	parallel_asian_treebank_khm_eng_seacrowd_t2t	Machine Translation	News	6
khm	aya_evaluation_suite_khm_seacrowd_t2t	Instruction Tuning	General	8
khm	bactrian_x_km_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1992
lao	belebele_lao_laoo_seacrowd_qa	QA	General	1969
lao	parallel_asian_treebank_lao_eng_seacrowd_t2t	Machine Translation	News	31
lao	aya_evaluation_suite_lao_seacrowd_t2t	Instruction Tuning	General	400
mya	belebele_mya_mymr_seacrowd_qa	QA	General	1969
mya	parallel_asian_treebank_mya_eng_seacrowd_t2t	Machine Translation	News	31
mya	aya_evaluation_suite_mya_seacrowd_t2t	Instruction Tuning	General	8
mya	bactrian_x_my_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1992
fil	belebele_tgl_latn_seacrowd_qa	QA	General	2000
fil	bactrian_x_tl_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	2000
tha	belebele_tha_thai_seacrowd_qa	QA	General	1969
tha	parallel_asian_treebank_tha_eng_seacrowd_t2t	Machine Translation	News	31
tha	aya_evaluation_suite_tha_seacrowd_t2t	Instruction Tuning	General	8
tha	bactrian_x_th_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1992
vie	belebele_vie_latn_seacrowd_qa	QA	General	1969
vie	parallel_asian_treebank_vie_eng_seacrowd_t2t	Machine Translation	News	31
vie	aya_evaluation_suite_vie_seacrowd_t2t	Instruction Tuning	General	4
vie	bactrian_x_vi_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1972
vie	seaeval_cross_logiqa_vie_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	16
vie	seaeval_cross_mmlu_vie_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	8
zlm	belebele_zsm_latn_seacrowd_qa	QA	General	1969
zlm	parallel_asian_treebank_zlm_eng_seacrowd_t2t	Machine Translation	News	31
zlm	aya_evaluation_suite_zsm_seacrowd_t2t	Instruction Tuning	General	400
zlm	seaeval_cross_logiqa_zlm_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	1056
zlm	seaeval_cross_mmlu_zlm_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	300
Natural
eng	cosem_seacrowd_ssp	Language Modeling	Social media	2000
ind	sea_bench_ind_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	200
khm	gklmip_newsclass_seacrowd_text	Sentiment Analysis	E-commerce	1436
khm	sea_bench_khm_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
lao	sea_bench_lao_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
mya	gklmip_sentiment_seacrowd_text	Sentiment Analysis	E-commerce	716
mya	sea_bench_mya_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
fil	sea_bench_tgl_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
tha	sea_bench_tha_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	40
tha	vistec_tp_th_21_seacrowd_seq_label	NER	Social media	1960
vie	sea_bench_vie_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	200
zlm	sea_bench_zlm_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160

Table 28: Train data used in the translationese classifier experiment.

Lang.	Subset	Original Task	Domain	# Samples
Translationese
eng	emotes_3k_eng_seacrowd_t2t	Commonsense Reasoning	Ethics	2000
eng	aya_evaluation_suite_eng_seacrowd_t2t	Instruction Tuning	General	400
ind	belebele_ind_latn_seacrowd_qa	QA	General	1969
ind	parallel_asian_treebank_ind_eng_seacrowd_t2t	MT	News	31
ind	aya_evaluation_suite_ind_seacrowd_t2t	Instruction Tuning	General	4
ind	bactrian_x_id_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1972
ind	seaeval_cross_logiqa_ind_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	16
ind	seaeval_cross_mmlu_ind_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	8
khm	belebele_khm_khmr_seacrowd_qa	QA	General	399
khm	khmer_alt_pos_seacrowd_seq_label	POS Tagging	News	1595
khm	parallel_asian_treebank_khm_eng_seacrowd_t2t	MT	News	6
khm	aya_evaluation_suite_khm_seacrowd_t2t	Instruction Tuning	General	8
khm	bactrian_x_km_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1992
lao	belebele_lao_laoo_seacrowd_qa	QA	General	1969
lao	parallel_asian_treebank_lao_eng_seacrowd_t2t	MT	News	31
lao	aya_evaluation_suite_lao_seacrowd_t2t	Instruction Tuning	General	400
mya	belebele_mya_mymr_seacrowd_qa	QA	General	1969
mya	parallel_asian_treebank_mya_eng_seacrowd_t2t	MT	News	31
mya	aya_evaluation_suite_mya_seacrowd_t2t	Instruction Tuning	General	8
mya	bactrian_x_my_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1992
fil	belebele_tgl_latn_seacrowd_qa	QA	General	2000
fil	bactrian_x_tl_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	2000
tha	belebele_tha_thai_seacrowd_qa	QA	General	1969
tha	parallel_asian_treebank_tha_eng_seacrowd_t2t	MT	News	31
tha	aya_evaluation_suite_tha_seacrowd_t2t	Instruction Tuning	General	8
tha	bactrian_x_th_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1992
vie	belebele_vie_latn_seacrowd_qa	QA	General	1969
vie	parallel_asian_treebank_vie_eng_seacrowd_t2t	MT	News	31
vie	aya_evaluation_suite_vie_seacrowd_t2t	Instruction Tuning	General	4
vie	bactrian_x_vi_seacrowd_t2t	Instruction Tuning	Mixed, Multi-domain, Wikipedia	1972
vie	seaeval_cross_logiqa_vie_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	16
vie	seaeval_cross_mmlu_vie_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	8
zlm	belebele_zsm_latn_seacrowd_qa	QA	General	1969
zlm	parallel_asian_treebank_zlm_eng_seacrowd_t2t	MT	News	31
zlm	aya_evaluation_suite_zsm_seacrowd_t2t	Instruction Tuning	General	400
zlm	seaeval_cross_logiqa_zlm_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	1056
zlm	seaeval_cross_mmlu_zlm_seacrowd_qa	Commonsense Reasoning, QA	Commentary, General, Multi-domain, Culture & heritage	300
Natural
eng	cosem_seacrowd_ssp	Language Modeling	Social media	2000
ind	sea_bench_ind_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	200
khm	gklmip_newsclass_seacrowd_text	Sentiment Analysis	E-commerce	1436
khm	sea_bench_khm_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
lao	sea_bench_lao_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
mya	gklmip_sentiment_seacrowd_text	Sentiment Analysis	E-commerce	716
mya	sea_bench_mya_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
fil	sea_bench_tgl_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160
tha	sea_bench_tha_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	40
tha	vistec_tp_th_21_seacrowd_seq_label	NER	Social media	1960
vie	sea_bench_vie_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	200
zlm	sea_bench_zlm_seacrowd_t2t	Instruction Tuning	Commentary, General, Multi-domain, Culture & heritage	160

Table 29: Test data used in the translationese classifier experiment.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	ind	Indonesian	Indonesia	<1B
2	jav	Javanese	Indonesia	<100M
3	vie	Vietnamese	Vietnam	<100M
4	tha	Thai	Thailand, Cambodia	<100M
5	fil	Filipino	Philippines	<100M
6	mya	Burmese	Myanmar	<100M
7	sun	Sunda	Indonesia	<100M
8	tgl	Tagalog	Philippines	<100M
9	khm	Khmer	Cambodia, Vietnam	<100M
10	ceb	Cebuano	Philippines	<100M
11	tts	Northeastern Thai	Thailand	<100M
12	zlm	Malay	Malaysia	<100M
13	zsm	Standard Malay	Malaysia, Brunei, Singapore	<100M

Table 30: SEA indigenous languages with

\geq

10M speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	ilo	Ilocano	Philippines	<10M
2	mad	Madura	Indonesia	<10M
3	nod	Northern Thai	Laos, Thailand	<10M
4	hil	Hiligaynon	Philippines	<10M
5	min	Minangkabau	Indonesia	<10M
6	bug	Bugis	Indonesia	<10M
7	bew	Betawi	Indonesia	<10M
8	sou	Southern Thai	Thailand	<10M
9	lao	Lao	Cambodia, Laos	<10M
10	hmv	Hmong Dô	Vietnam	<10M
11	ace	Aceh	Indonesia	<10M
12	bjn	Banjar	Indonesia	<10M
13	ban	Bali	Indonesia	<10M
14	shn	Shan	Myanmar, Thailand	<10M
15	mui	Musi	Indonesia	<10M
16	msi	Sabah Malay	Malaysia	<10M
17	meo	Kedah Malay	Malaysia, Thailand	<10M
18	pcc	Giáy	Vietnam	<10M
19	war	Waray-Waray	Philippines	<10M
20	mak	Makasar	Indonesia	<10M
21	bcl	Central Bikol	Philippines	<10M
22	xmm	Manado Malay	Indonesia	<10M
23	sas	Sasak	Indonesia	<10M
24	bbc	Batak Toba	Indonesia	<10M
25	pam	Kapampangan	Philippines	<10M
26	rki	Rakhine	Myanmar	<10M
27	tyz	Tày	Vietnam	<10M
28	abs	Ambonese Malay	Indonesia	<10M
29	pse	Central Malay	Indonesia	<10M
30	iba	Iban	Brunei, Indonesia, Malaysia	<10M
31	kxm	Northern Khmer	Thailand	<10M
32	khg	Khams Tibetan	Myanmar	<10M
33	ksw	S’gaw Karen	Myanmar, Thailand	<10M
34	btd	Batak Dairi	Indonesia	<10M
35	bts	Batak Simalungun	Indonesia	<10M
36	cbk	Chavacano	Philippines	<10M
37	pag	Pangasinan	Philippines	<10M
38	mtq	Muong	Vietnam	<10M
39	btm	Batak Mandailing	Indonesia	<10M
40	mdh	Maguindanaon	Philippines	<10M
41	pmy	Papuan Malay	Indonesia	<10M
42	gor	Gorontalo	Indonesia	<10M
43	jax	Jambi Malay	Indonesia	<10M
44	kjp	Pwo Eastern Karen	Myanmar, Thailand	<10M
45	max	North Moluccan Malay	Indonesia	<10M
46	mfa	Pattani Malay	Thailand	<10M
Not in SEACrowd
47	mfp	Makassar Indonesian	Indonesia	<10M

Table 31: SEA indigenous languages with <10M speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	nut	Nung	Vietnam	<1M
2	kac	Jingpho	Myanmar	<1M
3	tsg	Tausug	Philippines	<1M
4	nij	Ngaju	Indonesia	<1M
5	ljp	Lampung Api	Indonesia	<1M
6	mqy	Manggarai	Indonesia	<1M
7	mrw	Maranao	Philippines	<1M
8	nia	Nias	Indonesia	<1M
9	akb	Batak Angkola	Indonesia	<1M
10	sda	Toraja-Sa’dan	Indonesia	<1M
11	mnw	Mon	Myanmar, Thailand	<1M
12	hni	Hani	Laos, Vietnam	<1M
13	kjg	Khmu	Laos, Thailand, Vietnam	<1M
14	aoz	Uab Meto	Indonesia	<1M
15	blt	Tai Dam	Laos, Vietnam	<1M
16	lus	Mizo Chin	Myanmar	<1M
17	cps	Capiznon	Philippines	<1M
18	btx	Batak Karo	Indonesia	<1M
19	lis	Lisu	Myanmar	<1M
20	msb	Masbatenyo	Philippines	<1M
21	blk	Pa’o	Myanmar, Thailand	<1M
22	tdd	Tai Nüa	Myanmar	<1M
23	day	Land Dayak	Indonesia	<1M
24	xdy	Malayic Dayak	Indonesia	<1M
25	bhp	Bima	Indonesia	<1M
26	ibg	Ibanag	Philippines	<1M
27	zmi	Negeri Sembilan Malay	Malaysia	<1M
28	mdr	Mandar	Indonesia	<1M
29	kge	Komering	Indonesia	<1M
30	bdr	West Coast Bajau	Malaysia	<1M
31	kdt	Kuay	Cambodia, Laos, Thailand	<1M
32	prk	Parauk Wa	Myanmar	<1M
33	sgd	Surigaonon	Philippines	<1M
34	tet	Tetun	East Timor, Indonesia	<1M
35	bto	Rinconada Bikol	Philippines	<1M
36	tdt	Tetun Dili	East Timor	<1M
37	ium	Iu Mien	Laos, Vietnam	<1M
38	krj	Kinaray-a	Philippines	<1M
39	kyk	Kamayo	Philippines	<1M
40	lew	Ledo Kaili	Indonesia	<1M
41	mkn	Kupang Malay	Indonesia	<1M
42	rej	Rejang	Indonesia	<1M
43	mfb	Bangka	Indonesia	<1M
44	rob	Tae’	Indonesia	<1M
45	lbw	Tolaki	Indonesia	<1M
46	knx	Kendayan	Indonesia, Malaysia	<1M
47	gay	Gayo	Indonesia	<1M
48	mnb	Muna	Indonesia	<1M
49	rbl	Miraya Bikol	Philippines	<1M
50	smw	Sumbawa	Indonesia	<1M
51	kxd	Brunei	Brunei	<1M
52	khb	Lü	Laos, Myanmar	<1M
53	lhu	Lahu	Laos, Myanmar	<1M
54	twh	Tai Dón	Laos, Vietnam	<1M
55	ysm	Myanmar Sign Language	Myanmar	<1M
56	dtp	Kadazan Dusun	Malaysia	<1M
57	fbl	West Albay Bikol	Philippines	<1M
58	kvr	Kerinci	Indonesia	<1M
59	pce	Ruching Palaung	Myanmar	<1M
60	mry	Mandaya	Philippines	<1M
61	nbe	Konyak Naga	Myanmar	<1M
62	tcz	Thado Chin	Myanmar	<1M
63	jra	Jarai	Cambodia, Vietnam	<1M
64	xbr	Kambera	Indonesia	<1M
65	mog	Mongondow	Indonesia	<1M
66	pwo	Pwo Western Karen	Myanmar	<1M
67	cja	Western Cham	Cambodia, Vietnam	<1M
68	ahk	Akha	Laos, Myanmar, Thailand	<1M
69	ssb	Southern Sama	Philippines	<1M
70	sxn	Sangir	Indonesia	<1M

Table 32: (1/2) SEA indigenous languages with <1M speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
71	btz	Batak Alas-Kluet	Indonesia	<1M
72	ctd	Tedim Chin	Myanmar	<1M
73	srv	Southern Sorsoganon	Philippines	<1M
74	abl	Lampung Nyo	Indonesia	<1M
75	dnw	Western Dani	Indonesia	<1M
76	ktp	Kaduo	Laos	<1M
77	slp	Lamaholot	Indonesia	<1M
78	rad	Rade	Vietnam	<1M
79	ski	Sika	Indonesia	<1M
80	kpm	Koho	Vietnam	<1M
81	bdq	Bahnar	Vietnam	<1M
82	bdl	Indonesian Bajau	Indonesia	<1M
83	bpr	Koronadal Blaan	Philippines	<1M
84	ccp	Chakma	Myanmar	<1M
85	kne	Kankanaey	Philippines	<1M
86	kyu	Western Kayah	Myanmar	<1M
87	mhy	Ma’anyan	Indonesia	<1M
88	tnt	Tontemboan	Indonesia	<1M
89	pll	Shwe Palaung	Myanmar	<1M
90	daw	Davawenyo	Philippines	<1M
91	cnh	Hakha Chin	Myanmar	<1M
92	syb	Central Subanen	Philippines	<1M
93	rbb	Rumai Palaung	Myanmar	<1M
94	pmf	Pamona	Indonesia	<1M
95	bln	Southern Catanduanes Bikol	Philippines	<1M
96	itv	Itawit	Philippines	<1M
97	pdu	Kayan	Myanmar	<1M
98	mgm	Mambae	East Timor	<1M
99	bhq	Tukang Besi South	Indonesia	<1M
100	sly	Selayar	Indonesia	<1M
101	mvp	Duri	Indonesia	<1M
102	bgz	Banggai	Indonesia	<1M
103	kjc	Coastal Konjo	Indonesia	<1M
104	suc	Western Subanon	Philippines	<1M
105	cyo	Cuyonon	Philippines	<1M
106	khc	Tukang Besi North	Indonesia	<1M
107	lhi	Lahu Shi	Myanmar	<1M
108	mel	Central Melanau	Malaysia	<1M
109	ibl	Ibaloi	Philippines	<1M
110	end	Ende	Indonesia	<1M
111	hvn	Hawu	Indonesia	<1M
112	kkv	Kangean	Indonesia	<1M
113	yka	Yakan	Philippines	<1M
114	ljl	Li’o	Indonesia	<1M
115	mkz	Makasae	East Timor	<1M
116	bkd	Binukid	Philippines	<1M
117	bkr	Bakumpai	Indonesia	<1M
118	ekg	Ekari	Indonesia	<1M
119	hnj	Hmong Njua	Laos, Thailand, Vietnam	<1M
120	kak	Kalanguya	Philippines	<1M
121	kkh	Khün	Myanmar	<1M
122	lbx	Lawangan	Indonesia	<1M
123	mhx	Lhao Vo	Myanmar	<1M
124	mqj	Mamasa	Indonesia	<1M
125	psp	Filipino Sign Language	Philippines	<1M
126	tgn	Tandaganon	Philippines	<1M
Not in SEACrowd
127	rhg	Rohingya	Myanmar	<1M
128	pht	Phu Thai	Laos, Thailand, Vietnam	<1M
129	tvn	Tavoyan	Myanmar	<1M
130	osi	Osing	Indonesia	<1M
131	ilp	Iranun	Philippines	<1M
132	kzs	Sugut Dusun	Malaysia	<1M
133	vkt	Tenggarong Kutai Malay	Indonesia	<1M
134	phu	Phuan	Laos, Thailand	<1M
135	csh	Asho Chin	Myanmar	<1M
136	mlc	Cao Lan	Vietnam	<1M
137	kjk	Highland Konjo	Indonesia	<1M
138	liw	Col	Indonesia	<1M
139	sss	So	Laos, Thailand	<1M
140	dnv	Danu	Myanmar	<1M
141	sdq	Semandang	Indonesia	<1M
142	tjl	Tai Laing	Myanmar	<1M

Table 33: (2/2) SEA indigenous languages with <1M speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	adr	Adonara	Indonesia	<100K
2	sed	Sedang	Vietnam	<100K
3	blf	Buol	Indonesia	<100K
4	tbl	Tboli	Philippines	<100K
5	hre	Hre	Vietnam	<100K
6	rol	Romblomanon	Philippines	<100K
7	akl	Aklanon	Philippines	<100K
8	tdn	Tondano	Indonesia	<100K
9	bps	Sarangani Blaan	Philippines	<100K
10	kqr	Kimaragang	Malaysia	<100K
11	sml	Central Sama	Philippines	<100K
12	txs	Tonsea	Indonesia	<100K
13	stb	Northern Subanen	Philippines	<100K
14	bks	Northern Sorsoganon	Philippines	<100K
15	kei	Kei	Indonesia	<100K
16	klg	Tagakaulo	Philippines	<100K
17	tld	Talaud	Indonesia	<100K
18	atb	Zaiwa	Myanmar	<100K
19	sse	Balangingih Sama	Philippines	<100K
20	tes	Tengger	Indonesia	<100K
21	tyr	Tai Daeng	Laos, Vietnam	<100K
22	cia	Cia-Cia	Indonesia	<100K
23	gbi	Galela	Indonesia	<100K
24	otd	Ot Danum	Indonesia	<100K
25	cts	Northern Catanduanes Bikol	Philippines	<100K
26	loe	Saluan	Indonesia	<100K
27	bno	Bantoanon	Philippines	<100K
28	cmr	Mro-Khimi	Myanmar	<100K
29	ubl	Buhi’non Bikol	Philippines	<100K
30	cjm	Eastern Cham	Vietnam	<100K
31	bkx	Baikeno	East Timor	<100K
32	aaz	Amarasi	Indonesia	<100K
33	bhw	Biak	Indonesia	<100K
34	kqe	Kalagan	Philippines	<100K
35	xnn	Northern Kankanay	Philippines	<100K
36	xsb	Sambal	Philippines	<100K
37	cfm	Falam Chin	Myanmar	<100K
38	lbl	Libon Bikol	Philippines	<100K
39	wlo	Wolio	Indonesia	<100K
40	bth	Biatah Bidayuh	Indonesia, Malaysia	<100K
41	kem	Kemak	East Timor, Indonesia	<100K
42	raw	Rawang	Myanmar	<100K
43	tft	Ternate	Indonesia	<100K
44	zom	Zo	Myanmar	<100K
45	cnk	Khumi Chin	Myanmar	<100K
46	mqx	Mamuju	Indonesia	<100K
47	msm	Agusan Manobo	Philippines	<100K
48	nst	Tangshang Naga	Myanmar	<100K
49	nxg	Ngad’a	Indonesia	<100K
50	obo	Obo Manobo	Philippines	<100K
51	pww	Pwo Northern Karen	Thailand	<100K
52	sya	Siang	Indonesia	<100K
53	tom	Tombulu	Indonesia	<100K
54	xml	Malaysian Sign Language	Malaysia	<100K
55	mbs	Sarangani Manobo	Philippines	<100K
56	mwv	Mentawai	Indonesia	<100K
57	msk	Mansaka	Philippines	<100K
58	smk	Bolinao	Philippines	<100K
59	bfn	Bunak	East Timor, Indonesia	<100K
60	bgi	Bagobo-Klata	Philippines	<100K
61	drg	Rungus	Malaysia	<100K
62	kzf	Da’a Kaili	Indonesia	<100K
63	wew	Wejewa	Indonesia	<100K
64	rog	Northern Roglai	Vietnam	<100K
65	ilk	Bogkalot	Philippines	<100K
66	ktv	Eastern Katu	Vietnam	<100K
67	dnt	Mid Grand Valley Dani	Indonesia	<100K
68	frd	Fordata	Indonesia	<100K
69	mbt	Matigsalug Manobo	Philippines	<100K
70	nxe	Nage	Indonesia	<100K
71	ptt	Enrekang	Indonesia	<100K

Table 34: (1/5) SEA indigenous languages with <100K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
72	tiy	Teduray	Philippines	<100K
73	tjg	Tunjung	Indonesia	<100K
74	wmm	Maiwa	Indonesia	<100K
75	sdo	Bukar-Sadong Bidayuh	Indonesia, Malaysia	<100K
76	kyp	Kang	Laos	<100K
77	tvo	Tidore	Indonesia	<100K
78	hos	Ho Chi Minh City Sign Language	Vietnam	<100K
79	mhs	Buru	Indonesia	<100K
80	sti	Bulo Stieng	Cambodia, Vietnam	<100K
81	law	Lauje	Indonesia	<100K
82	bgs	Tagabawa	Philippines	<100K
83	sjm	Mapun	Philippines	<100K
84	blr	Blang	Myanmar, Thailand	<100K
85	rgs	Southern Roglai	Vietnam	<100K
86	smr	Simeulue	Indonesia	<100K
87	czt	Zotung Chin	Myanmar	<100K
88	kvq	Geba Karen	Myanmar	<100K
89	mtd	Mualang	Indonesia	<100K
90	xxk	Ke’o	Indonesia	<100K
91	tkd	Tukudede	East Timor	<100K
92	kix	Khiamniungan Naga	Myanmar	<100K
93	bsb	Brunei Bisaya	Brunei, Malaysia	<100K
94	dao	Daai Chin	Myanmar	<100K
95	ddg	Fataluku	East Timor	<100K
96	mqn	Moronene	Indonesia	<100K
97	ges	Geser-Gorom	Indonesia	<100K
98	pho	Phunoi	Laos	<100K
99	slm	Pangutaran Sama	Philippines	<100K
100	hro	Haroi	Vietnam	<100K
101	ivv	Ivatan	Philippines	<100K
102	mrh	Mara Chin	Myanmar	<100K
103	btw	Butuanon	Philippines	<100K
104	cma	Maa	Vietnam	<100K
105	sbl	Botolan Sambal	Philippines	<100K
106	cmo	Central Mnong	Cambodia, Vietnam	<100K
107	blz	Balantak	Indonesia	<100K
108	tpu	Tampuan	Cambodia	<100K
109	blj	Bulungan	Indonesia	<100K
110	cgc	Kagayanen	Philippines	<100K
111	clu	Caluyanun	Philippines	<100K
112	cml	Koneq-koneq	Indonesia	<100K
113	gad	Gaddang	Philippines	<100K
114	hlt	Matu Chin	Myanmar	<100K
115	ifk	Tuwali Ifugao	Philippines	<100K
116	ifu	Mayoyao Ifugao	Philippines	<100K
117	knb	Lubuagan Kalinga	Philippines	<100K
118	ksx	Kedang	Indonesia	<100K
119	lcf	Lubu	Indonesia	<100K
120	lsi	Lacid	Myanmar	<100K
121	mba	Higaonon	Philippines	<100K
122	mng	Eastern Mnong	Vietnam	<100K
123	mro	Mru	Myanmar	<100K
124	mta	Cotabato Manobo	Philippines	<100K
125	set	Sentani	Indonesia	<100K
126	tmn	Taman	Indonesia	<100K
127	twu	Termanu	Indonesia	<100K
128	txm	Tomini	Indonesia	<100K
129	ulm	Ulumanda’	Indonesia	<100K
130	wow	Wawonii	Indonesia	<100K
131	sne	Bau Bidayuh	Indonesia, Malaysia	<100K
132	tdf	Talieng	Laos	<100K
133	lbo	Laven	Laos	<100K
134	acn	Ngochang	Myanmar	<100K
135	tlb	Tobelo	Indonesia	<100K
136	ifa	Amganad Ifugao	Philippines	<100K
137	itd	Southern Tidung	Indonesia, Malaysia	<100K
138	pha	Pa-Hng	Vietnam	<100K
139	atd	Ata Manobo	Philippines	<100K
140	bru	Eastern Bru	Laos, Vietnam	<100K
141	kzp	Kaidipang	Indonesia	<100K
142	abx	Inabaknon	Philippines	<100K

Table 35: (2/5) SEA indigenous languages with <100K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
143	aol	Alor	Indonesia	<100K
144	jmd	Yamdena	Indonesia	<100K
145	laa	Southern Subanen	Philippines	<100K
146	lmy	Lamboya	Indonesia	<100K
147	txe	Totoli	Indonesia	<100K
148	oyb	Oy	Laos	<100K
149	mlf	Mal	Laos, Thailand	<100K
150	lnd	Lundayeh	Brunei, Indonesia, Malaysia	<100K
151	prh	Porohanon	Philippines	<100K
152	brb	Brao	Cambodia, Laos, Vietnam	<100K
153	lbn	Rmeet	Laos	<100K
154	ilm	Iranun	Malaysia	<100K
155	ptu	Bambam	Indonesia	<100K
156	vkl	Kulisusu	Indonesia	<100K
157	blw	Balangao	Philippines	<100K
158	bsy	Sabah Bisaya	Malaysia	<100K
159	krr	Krung	Cambodia	<100K
160	dtb	Labuk-Kinabatangan Kadazan	Malaysia	<100K
161	ayz	Mai Brat	Indonesia	<100K
162	bac	Badui	Indonesia	<100K
163	brv	Western Bru	Laos, Thailand	<100K
164	bwp	Mandobo Bawah	Indonesia	<100K
165	dna	Upper Grand Valley Dani	Indonesia	<100K
166	dni	Lower Grand Valley Dani	Indonesia	<100K
167	dtr	Lotud	Malaysia	<100K
168	dun	Dusun Deyah	Indonesia	<100K
169	kje	Kisar	Indonesia	<100K
170	kli	Kalumpang	Indonesia	<100K
171	kod	Kodi	Indonesia	<100K
172	llg	Lole	Indonesia	<100K
173	lrt	Larantuka Malay	Indonesia	<100K
174	mnz	Moni	Indonesia	<100K
175	pea	Peranakan Indonesian	Indonesia	<100K
176	ppk	Uma	Indonesia	<100K
177	prt	Prai	Laos, Thailand	<100K
178	tmm	Tai Thanh	Vietnam	<100K
179	tnw	Tonsawang	Indonesia	<100K
180	twy	Tawoyan	Indonesia	<100K
181	txq	Tii	Indonesia	<100K
182	wlw	Walak	Indonesia	<100K
183	skh	Sikule	Indonesia	<100K
184	lbk	Central Bontok	Philippines	<100K
185	cje	Chru	Vietnam	<100K
186	hnn	Hanunoo	Philippines	<100K
187	tlu	Tulehu	Indonesia	<100K
188	wmh	Waima’a	East Timor	<100K
189	hrk	Haruku	Indonesia	<100K
190	lex	Luang	Indonesia	<100K
191	puo	Puoc	Vietnam	<100K
192	ren	Rengao	Vietnam	<100K
193	alp	Alune	Indonesia	<100K
194	bwe	Bwe Karen	Myanmar	<100K
195	tlt	Sou Nama	Indonesia	<100K
196	zyp	Zyphe Chin	Myanmar	<100K
197	abz	Abui	Indonesia	<100K
198	akg	Anakalangu	Indonesia	<100K
199	had	Hatam	Indonesia	<100K
200	htu	Hitu	Indonesia	<100K
201	nlc	Nalca	Indonesia	<100K
202	pac	Pacoh	Laos, Vietnam	<100K
203	yog	Yogad	Philippines	<100K
204	mxd	Modang	Indonesia	<100K
205	jeh	Jeh	Laos, Vietnam	<100K
206	kyn	Northern Binukidnon	Philippines	<100K
207	phg	Phuong	Vietnam	<100K
208	agn	Agutaynen	Philippines	<100K
209	cnw	Ngawn Chin	Myanmar	<100K
210	ila	Ile Ape	Indonesia	<100K
211	krd	Kairui-Midiki	East Timor	<100K
212	loa	Loloda	Indonesia	<100K
213	mbb	Western Bukidnon Manobo	Philippines	<100K
214	mwq	Müün Chin	Myanmar	<100K
215	nxa	Nauete	East Timor	<100K
216	prf	Paranan	Philippines	<100K

Table 36: (3/5) SEA indigenous languages with <100K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
217	snl	Sangil	Philippines	<100K
218	tby	Tabaru	Indonesia	<100K
219	tea	Temiar	Malaysia	<100K
220	yli	Angguruk Yali	Indonesia	<100K
221	mej	Meyah	Indonesia	<100K
222	mbi	Ilianen Manobo	Philippines	<100K
223	plw	Brooke’s Point Palawano	Philippines	<100K
224	duu	Drung	Myanmar	<100K
225	heg	Helong	Indonesia	<100K
226	mzq	Mori Atas	Indonesia	<100K
227	uhn	Damal	Indonesia	<100K
228	xmz	Mori Bawah	Indonesia	<100K
229	kjm	Kháng	Vietnam	<100K
230	hal	Salang	Laos, Vietnam	<100K
231	idt	Idaté	East Timor	<100K
232	dok	Dondo	Indonesia	<100K
233	gal	Galolen	East Timor, Indonesia	<100K
234	ksc	Southern Kalinga	Philippines	<100K
235	txa	Tombonuo	Malaysia	<100K
236	ngt	Kriang	Laos	<100K
237	kmk	Limos Kalinga	Philippines	<100K
238	alo	Larike-Wakasihu	Indonesia	<100K
239	yno	Yong	Thailand	<100K
240	ril	Riang Lang	Myanmar	<100K
241	atq	Aralle-Tabulahan	Indonesia	<100K
242	cek	Eastern Khumi Chin	Myanmar	<100K
243	cua	Cua	Vietnam	<100K
244	mnx	Sougb	Indonesia	<100K
245	mqs	West Makian	Indonesia	<100K
246	nuf	Nusu	Myanmar	<100K
247	plc	Central Palawano	Philippines	<100K
248	plv	Southwest Palawano	Philippines	<100K
249	rgu	Rikou	Indonesia	<100K
250	szw	Sawai	Indonesia	<100K
251	tdj	Tajio	Indonesia	<100K
252	xkl	Mainstream Kenyah	Indonesia, Malaysia	<100K
253	yin	Riang Lai	Myanmar	<100K
254	lcl	Lisela	Indonesia	<100K
255	lra	Rara Bakati’	Indonesia, Malaysia	<100K
256	bve	Berau Malay	Indonesia	<100K
257	kml	Tanudan Kalinga	Philippines	<100K
258	beu	Blagar	Indonesia	<100K
259	xem	Mateq	Indonesia	<100K
260	lev	Western Pantar	Indonesia	<100K
261	ptn	Patani	Indonesia	<100K
262	oog	Ong	Laos	<100K
263	spr	Saparua	Indonesia	<100K
264	amk	Ambai	Indonesia	<100K
265	ifb	Batad Ifugao	Philippines	<100K
266	aax	Mandobo Atas	Indonesia	<100K
267	bep	Behoa	Indonesia	<100K
268	bvy	Baybayanon	Philippines	<100K
269	csy	Siyin Chin	Myanmar	<100K
270	dbj	Ida’an	Malaysia	<100K
271	emb	Embaloh	Indonesia	<100K
272	iry	Iraya	Philippines	<100K
273	jak	Jakun	Malaysia	<100K
274	jaq	Yaqay	Indonesia	<100K
275	kps	Tehit	Indonesia	<100K
276	kvb	Kubu	Indonesia	<100K
277	kxf	Kawyaw	Myanmar	<100K
278	kyt	Kayagar	Indonesia	<100K
279	lje	Rampi	Indonesia	<100K
280	lur	Loura	Indonesia	<100K
281	mbd	Dibabawon Manobo	Philippines	<100K
282	mbf	Baba Malay	Singapore	<100K
283	mky	East Makian	Indonesia	<100K
284	mvd	Mamboru	Indonesia	<100K
285	ndx	Nduga	Indonesia	<100K
286	pez	Eastern Penan	Brunei, Malaysia	<100K
287	ple	Palu’e	Indonesia	<100K
288	sea	Semai	Malaysia	<100K
289	ssq	So’a	Indonesia	<100K

Table 37: (4/5) SEA indigenous languages with <100K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
290	szb	Ngalum	Indonesia	<100K
291	tbk	Calamian Tagbanwa	Philippines	<100K
292	tbw	Tagbanwa	Philippines	<100K
293	txx	Tatana	Malaysia	<100K
294	wnk	Wanukaka	Indonesia	<100K
295	yva	Yawa	Indonesia	<100K
Not in SEACrowd
296	int	Intha	Myanmar	<100K
297	loc	Inonhan	Philippines	<100K
298	mqg	Kota Bangun Kutai Malay	Indonesia	<100K
299	bfx	Bantayanon	Philippines	<100K
300	tou	Tho	Vietnam	<100K
301	ncq	Northern Katang	Laos	<100K
302	bvu	Bukit Malay	Indonesia	<100K
303	byd	Benyadu’	Indonesia	<100K
304	tsq	Thai Sign Language	Thailand	<100K
305	nyw	Nyaw	Thailand	<100K
306	rir	Ribun	Indonesia	<100K
307	scg	Sanggau	Indonesia	<100K
308	sct	Southern Katang	Laos	<100K
309	stt	Budeh Stieng	Vietnam	<100K
310	tco	Taungyo	Myanmar	<100K
311	vkk	Kaur	Indonesia	<100K
312	hab	Hanoi Sign Language	Vietnam	<100K
313	djo	Jangkang	Indonesia	<100K
314	sbx	Seberuang	Indonesia	<100K
315	lso	Laos Sign Language	Laos	<100K
316	sez	Senthang Chin	Myanmar	<100K
317	soa	Thai Song	Thailand	<100K
318	knl	Keninjal	Indonesia	<100K
319	tth	Upper Ta’oih	Laos, Vietnam	<100K
320	apg	Ampanang	Indonesia	<100K
321	mnn	Southern Mnong	Vietnam	<100K
322	pel	Pekal	Indonesia	<100K
323	zkd	Kadu	Myanmar	<100K
324	bkz	Bungku	Indonesia	<100K
325	mkx	Kinamiging Manobo	Philippines	<100K
326	bnu	Bentong	Indonesia	<100K
327	kxy	Kayong	Vietnam	<100K
328	mhp	Balinese Malay	Indonesia	<100K
329	unz	Unde Kaili	Indonesia	<100K
330	bld	Bolango	Indonesia	<100K
331	kuf	Western Katu	Laos	<100K
332	dnk	Dengka	Indonesia	<100K
333	mvv	Tagal Murut	Indonesia, Malaysia	<100K
334	skn	Kolibugan Subanon	Philippines	<100K
335	szn	Sula	Indonesia	<100K
336	cnb	Uppu Chin	Myanmar	<100K
337	bhv	Bahau	Indonesia	<100K
338	itt	Maeng Itneg	Philippines	<100K
339	hji	Haji	Indonesia	<100K
340	ghk	Geko Karen	Myanmar	<100K
341	kvl	Kayaw	Myanmar	<100K
342	tto	Lower Ta’oih	Laos	<100K
343	bdb	Basap	Indonesia	<100K
344	clj	Laitu Chin	Myanmar	<100K
345	clt	Lautu Chin	Myanmar	<100K
346	dup	Duano	Indonesia, Malaysia	<100K
347	kyb	Butbut Kalinga	Philippines	<100K
348	stg	Trieng	Vietnam	<100K
349	cbw	Kinabalian	Philippines	<100K
350	csv	Sumtu Chin	Myanmar	<100K
351	riu	Riung	Indonesia	<100K
352	srg	Sulod	Philippines	<100K
353	ity	Moyadan Itneg	Philippines	<100K
354	kkg	Mabaka Valley Kalinga	Philippines	<100K
355	bne	Bintauna	Indonesia	<100K
356	nlk	Ninia Yali	Indonesia	<100K
357	hik	Seit-Kaitetu	Indonesia	<100K
358	ksn	Kasiguranin	Philippines	<100K
359	tsl	Ts’ün-Lao	Vietnam	<100K
360	xao	Khao	Vietnam	<100K

Table 38: (5/5) SEA indigenous languages with <100K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	xte	Ketengban	Indonesia	<10K
2	bna	Bonerate	Indonesia	<10K
3	bku	Buhid	Philippines	<10K
4	aws	South Awyu	Indonesia	<10K
5	woo	Manombai	Indonesia	<10K
6	asc	Casuarina Coast Asmat	Indonesia	<10K
7	tih	Timugon Murut	Malaysia	<10K
8	asl	Asilulu	Indonesia	<10K
9	sgb	Mag-antsi Ayta	Philippines	<10K
10	eky	Eastern Kayah	Myanmar, Thailand	<10K
11	ify	Keley-i Kallahan	Philippines	<10K
12	inl	Indonesian Sign Language	Indonesia	<10K
13	kgq	Kamoro	Indonesia	<10K
14	kht	Khamti	Myanmar	<10K
15	kpq	Korupun-Sela	Indonesia	<10K
16	kti	North Muyu	Indonesia	<10K
17	lcp	Western Lawa	Thailand	<10K
18	mtj	Moskona	Indonesia	<10K
19	slu	Selaru	Indonesia	<10K
20	tmw	Temuan	Malaysia	<10K
21	txt	Citak	Indonesia	<10K
22	whk	Wahau Kenyah	Indonesia	<10K
23	txn	West Tarangan	Indonesia	<10K
24	dro	Daro-Matu Melanau	Malaysia	<10K
25	awu	Central Awyu	Indonesia	<10K
26	itb	Binongan Itneg	Philippines	<10K
27	lti	Leti	Indonesia	<10K
28	saj	Sahu	Indonesia	<10K
29	kvv	Kola	Indonesia	<10K
30	kvu	Yinbaw	Myanmar	<10K
31	akc	Mpur	Indonesia	<10K
32	cns	Central Asmat	Indonesia	<10K
33	crw	Chrau	Vietnam	<10K
34	lwl	Eastern Lawa	Thailand	<10K
35	lzn	Lainong Naga	Myanmar	<10K
36	mrz	Marind	Indonesia	<10K
37	row	Dela-Oenale	Indonesia	<10K
38	sfe	Eastern Subanen	Philippines	<10K
39	ttd	Tutong	Brunei	<10K
40	iwo	Morop	Indonesia	<10K
41	twb	Tawbuid	Philippines	<10K
42	bhz	Bada	Indonesia	<10K
43	pwm	Molbog	Malaysia, Philippines	<10K
44	psa	Asue Awyu	Indonesia	<10K
45	ebk	Eastern Bontok	Philippines	<10K
46	tre	East Tarangan	Indonesia	<10K
47	npy	Napu	Indonesia	<10K
48	gdg	Ga’dang	Philippines	<10K
49	gir	Red Gelao	Vietnam	<10K
50	kll	Kagan Kalagan	Philippines	<10K
51	lwt	Lewotobi	Indonesia	<10K
52	moo	Monom	Vietnam	<10K
53	pnp	Pancana	Indonesia	<10K
54	tdr	Todrah	Vietnam	<10K
55	weo	Wemale	Indonesia	<10K
56	woi	Kamang	Indonesia	<10K
57	wrp	Waropen	Indonesia	<10K
58	lha	Laha	Vietnam	<10K
59	kvo	Dobel	Indonesia	<10K
60	mtg	Una	Indonesia	<10K
61	inn	Isinay	Philippines	<10K
62	ihp	Iha	Indonesia	<10K
63	jka	Kaera	Indonesia	<10K
64	myl	Moma	Indonesia	<10K
65	mmn	Minamanwa	Philippines	<10K
66	nxr	Ninggerum	Indonesia	<10K
67	blx	Mag-Indi Ayta	Philippines	<10K
68	duw	Dusun Witu	Indonesia	<10K
69	kgw	Karon Dori	Indonesia	<10K
70	kyo	Klon	Indonesia	<10K
71	lbt	Lachi	Vietnam	<10K
72	mli	Malimpung	Indonesia	<10K
73	nfa	Dhao	Indonesia	<10K
74	pdo	Padoe	Indonesia	<10K
75	raz	Rahambuu	Indonesia	<10K
76	tpg	Kula	Indonesia	<10K
77	urk	Urak Lawoi’	Thailand	<10K
78	wad	Wamesa	Indonesia	<10K
79	wod	Wolani	Indonesia	<10K
80	wul	Silimo	Indonesia	<10K

Table 39: (1/6) SEA indigenous languages with <10K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
81	yac	Pass Valley Yali	Indonesia	<10K
82	yoy	Yoy	Laos, Thailand	<10K
83	and	Ansus	Indonesia	<10K
84	mxn	Moi Kelim	Indonesia	<10K
85	tlv	Taliabu	Indonesia	<10K
86	bty	Bobot	Indonesia	<10K
87	duq	Dusun Malang	Indonesia	<10K
88	ums	Pendau	Indonesia	<10K
89	vbb	Southeast Babar	Indonesia	<10K
90	baj	Barakai	Indonesia	<10K
91	bgr	Bawm Chin	Myanmar	<10K
92	irr	Ir	Laos	<10K
93	nbq	Nggem	Indonesia	<10K
94	bqr	Burusu	Indonesia	<10K
95	kvd	Kui	Indonesia	<10K
96	bny	Bintulu	Malaysia	<10K
97	rka	Kraol	Cambodia	<10K
98	jah	Jah Hut	Malaysia	<10K
99	kys	Baram Kayan	Malaysia	<10K
100	smu	Somray	Cambodia	<10K
101	sza	Semelai	Malaysia	<10K
102	alk	Alak	Laos	<10K
103	anl	Anu-Khongso Chin	Myanmar	<10K
104	bei	Bakati’	Indonesia	<10K
105	irh	Irarutu	Indonesia	<10K
106	kta	Katua	Vietnam	<10K
107	kts	South Muyu	Indonesia	<10K
108	kzi	Kelabit	Indonesia, Malaysia	<10K
109	lmr	Lamalera	Indonesia	<10K
110	mwt	Moken	Myanmar, Thailand	<10K
111	ntx	Tangkhul Naga	Myanmar	<10K
112	ror	Rongga	Indonesia	<10K
113	sdu	Sarudu	Indonesia	<10K
114	slz	Ma’ya	Indonesia	<10K
115	sre	Sara Bakati’	Indonesia	<10K
116	tgb	Tobilung	Malaysia	<10K
117	twe	Teiwa	Indonesia	<10K
118	tyn	Kombai	Indonesia	<10K
119	wah	Watubela	Indonesia	<10K
120	nev	Nyaheun	Laos	<10K
121	klz	Kabola	Indonesia	<10K
122	awy	Edera Awyu	Indonesia	<10K
123	abd	Manide	Philippines	<10K
124	tnm	Tabla	Indonesia	<10K
125	skb	Saek	Laos, Thailand	<10K
126	kvw	Wersing	Indonesia	<10K
127	xod	Kokoda	Indonesia	<10K
128	bpq	Banda Malay	Indonesia	<10K
129	bay	Batuley	Indonesia	<10K
130	kgx	Kamaru	Indonesia	<10K
131	khe	Korowai	Indonesia	<10K
132	lkj	Remun	Malaysia	<10K
133	pku	Paku	Indonesia	<10K
134	saw	Sawi	Indonesia	<10K
135	tcg	Tamagario	Indonesia	<10K
136	pne	Western Penan	Malaysia	<10K
137	xks	Kumbewaha	Indonesia	<10K
138	pgu	Pagu	Indonesia	<10K
139	tpo	Tai Pao	Laos, Vietnam	<10K
140	zrs	Mairasi	Indonesia	<10K
141	kzz	Kalabra	Indonesia	<10K
142	bls	Balaesang	Indonesia	<10K
143	kuv	Kur	Indonesia	<10K
144	ree	Rejang Kayan	Malaysia	<10K
145	abp	Abellen Ayta	Philippines	<10K
146	adn	Adang	Indonesia	<10K
147	ahh	Aghu	Indonesia	<10K
148	bnd	Banda	Indonesia	<10K
149	bnq	Bantik	Indonesia	<10K
150	ckh	Chak	Myanmar	<10K
151	due	Umiray Dumaget Agta	Philippines	<10K
152	eip	Lik	Indonesia	<10K
153	kgr	Abun	Indonesia	<10K
154	kig	Kimaghima	Indonesia	<10K
155	nsy	Nasal	Indonesia	<10K
156	swt	Sawila	Indonesia	<10K
157	tmg	Ternateño	Indonesia	<10K
158	wms	Wambon	Indonesia	<10K
159	mhe	Mah Meri	Malaysia	<10K
160	bgl	Bo	Laos	<10K

Table 40: (2/6) SEA indigenous languages with <10k speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
161	bpv	Bian Marind	Indonesia	<10K
162	gzn	Gane	Indonesia	<10K
163	dmr	East Damar	Indonesia	<10K
164	obk	Southern Bontok	Philippines	<10K
165	bzl	Boano	Indonesia	<10K
166	hbu	Habun	East Timor	<10K
167	zng	Mang	Vietnam	<10K
168	gei	Gebe	Indonesia	<10K
169	spb	Sepa	Indonesia	<10K
170	agv	Remontado Dumagat	Philippines	<10K
171	bzq	Buli	Indonesia	<10K
172	brp	Barapasi	Indonesia	<10K
173	cbl	Bualkhaw Chin	Myanmar	<10K
174	grs	Gresi	Indonesia	<10K
175	jmn	Makuri Naga	Myanmar	<10K
176	kmt	Kemtuik	Indonesia	<10K
177	kwe	Kwerba	Indonesia	<10K
178	sko	Seko Tengah	Indonesia	<10K
179	wrs	Waris	Indonesia	<10K
180	kyi	Kiput	Malaysia	<10K
181	nrm	Narom	Malaysia	<10K
182	klw	Tado	Indonesia	<10K
183	spu	Sapuan	Laos	<10K
184	jei	Yei	Indonesia	<10K
185	sqq	Sou	Laos	<10K
186	awv	Jair Awyu	Indonesia	<10K
187	bup	Busoa	Indonesia	<10K
188	kkl	Kosarek Yale	Indonesia	<10K
189	zka	Kaimbulawa	Indonesia	<10K
190	kjr	Kurudu	Indonesia	<10K
191	alj	Alangan	Philippines	<10K
192	asy	Yaosakor Asmat	Indonesia	<10K
193	dms	Dampelas	Indonesia	<10K
194	enr	Emem	Indonesia	<10K
195	hnu	Hung	Laos, Vietnam	<10K
196	kwt	Kwesten	Indonesia	<10K
197	kyj	Karao	Philippines	<10K
198	lau	Laba	Indonesia	<10K
199	ley	Limola	Indonesia	<10K
200	mqf	Momuna	Indonesia	<10K
201	mqo	Modole	Indonesia	<10K
202	nir	Nimboran	Indonesia	<10K
203	pmo	Pom	Indonesia	<10K
204	sge	Segai	Indonesia	<10K
205	szc	Semaq Beri	Malaysia	<10K
206	tgt	Central Tagbanwa	Philippines	<10K
207	tty	Sikaritai	Indonesia	<10K
208	bgk	Bit	Laos	<10K
209	grm	Kota Marudu Talantang	Malaysia	<10K
210	srl	Isirawa	Indonesia	<10K
211	wbw	Woi	Indonesia	<10K
212	sib	Sebop	Malaysia	<10K
213	bnb	Bookan Murut	Malaysia	<10K
214	llm	Lasalimu	Indonesia	<10K
215	rmm	Roma	Indonesia	<10K
216	pcb	Pear	Cambodia	<10K
217	abc	Ambala Ayta	Philippines	<10K
218	nxx	Nafri	Indonesia	<10K
219	lwh	White Lachi	Vietnam	<10K
220	ury	Orya	Indonesia	<10K
221	irx	Kamberau	Indonesia	<10K
222	atk	Ati	Philippines	<10K
223	bgb	Bobongko	Indonesia	<10K
224	bvz	Bauzi	Indonesia	<10K
225	bzp	Kemberano	Indonesia	<10K
226	cbn	Nyahkur	Thailand	<10K
227	dbf	Edopi	Indonesia	<10K
228	eno	Enggano	Indonesia	<10K
229	mkm	Moklen	Thailand	<10K
230	nxl	South Nuaulu	Indonesia	<10K
231	vko	Kodeoha	Indonesia	<10K
232	wbb	Wabo	Indonesia	<10K
233	yir	North Awyu	Indonesia	<10K
234	zbc	Central Berawan	Malaysia	<10K
235	bya	Batak	Philippines	<10K

Table 41: (3/6) SEA indigenous languages with <10K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
236	bdg	Bonggi	Malaysia	<10K
237	fau	Fayu	Indonesia	<10K
238	ilu	Ili’uun	Indonesia	<10K
239	yet	Yetfa	Indonesia	<10K
240	dmy	Sowari	Indonesia	<10K
241	ddw	Dawera-Daweloor	Indonesia	<10K
242	jhi	Jehai	Malaysia	<10K
243	xmt	Matbat	Indonesia	<10K
244	beg	Belait	Brunei	<10K
245	ivb	Ibatan	Philippines	<10K
246	oia	Oirata	Indonesia	<10K
247	bkl	Berik	Indonesia	<10K
248	duo	Dupaninan Agta	Philippines	<10K
249	kdw	Koneraw	Indonesia	<10K
250	msf	Mekwei	Indonesia	<10K
251	nqm	Ndom	Indonesia	<10K
252	sbg	Moi Lemas	Indonesia	<10K
253	seu	Serui-Laut	Indonesia	<10K
254	tve	Te’un	Indonesia	<10K
255	tzn	Tugun	Indonesia	<10K
256	wng	Wanggom	Indonesia	<10K
257	bnj	Bangon	Philippines	<10K
258	snv	Sa’ban	Indonesia, Malaysia	<10K
259	bdw	Baham	Indonesia	<10K
260	ran	Riantana	Indonesia	<10K
261	rnn	Roon	Indonesia	<10K
262	szp	Suabo	Indonesia	<10K
263	zbe	East Berawan	Malaysia	<10K
264	scb	Chut	Laos, Vietnam	<10K
265	tvm	Tela-Masbuar	Indonesia	<10K
266	udj	Ujir	Indonesia	<10K
267	agy	Southern Alta	Philippines	<10K
268	air	Airoran	Indonesia	<10K
269	aqm	Atohwaim	Indonesia	<10K
270	asi	Buruwai	Indonesia	<10K
271	att	Pamplona Atta	Philippines	<10K
272	bcd	North Babar	Indonesia	<10K
273	bnf	Masiwang	Indonesia	<10K
274	btq	Batek	Malaysia	<10K
275	cth	Thaiphum Chin	Myanmar	<10K
276	dem	Dem	Indonesia	<10K
277	dmg	Upper Kinabatangan	Malaysia	<10K
278	dnu	Danau	Myanmar	<10K
279	etz	Semimi	Indonesia	<10K
280	jbj	Arandai	Indonesia	<10K
281	kbv	Dla	Indonesia	<10K
282	kpu	Kafoa	Indonesia	<10K
283	kvy	Yintale	Myanmar	<10K
284	msg	Moraid	Indonesia	<10K
285	nks	North Asmat	Indonesia	<10K
286	pnx	Phong-Kniang	Laos	<10K
287	sob	Sobei	Indonesia	<10K
288	wgo	Ambel	Indonesia	<10K
289	wno	Wano	Indonesia	<10K
290	xse	Sempan	Indonesia	<10K
291	zbw	West Berawan	Malaysia	<10K
Not in SEACrowd
292	rbk	Northern Bontok	Philippines	<10K
293	kvt	Lahta	Myanmar	<10K
294	lbg	Laopang	Laos	<10K
295	stu	Samtao	Myanmar	<10K
296	kxk	Zayein	Myanmar	<10K
297	iti	Inlaud Itneg	Philippines	<10K
298	nqq	Chen-Kayu Naga	Myanmar	<10K
299	pnc	Pannei	Indonesia	<10K
300	zkn	Kanan	Myanmar	<10K
301	mlz	Malaynon	Philippines	<10K
302	khf	Khuen	Laos	<10K
303	kkx	Kohin	Indonesia	<10K
304	lmj	West Lembata	Indonesia	<10K
305	dkr	Kuijau	Malaysia	<10K
306	ebc	Beginci	Indonesia	<10K
307	mtw	Southern Binukidnon	Philippines	<10K
308	mqk	Rajah Kabunsuwan Manobo	Philippines	<10K
309	csx	Cambodian Sign Language	Cambodia	<10K
310	tis	Masadiit Itneg	Philippines	<10K
311	csj	Songlai Chin	Myanmar	<10K
312	mqc	Mangole	Indonesia	<10K
313	bpz	Bilba	Indonesia	<10K
314	lmf	South Lembata	Indonesia	<10K
315	wha	Sou Upaa	Indonesia	<10K
316	lkc	Kucong	Vietnam	<10K
317	mqa	Maba	Indonesia	<10K
318	lcq	Luhu	Indonesia	<10K
319	mjb	Makalero	East Timor	<10K

Table 42: (4/6) SEA indigenous languages with <10K speakers.

No.	ISO 639-3	Language	Region(s)	Population
Not in SEACrowd
320	krv	Kavet	Cambodia	<10K
321	cey	Ekai Chin	Myanmar	<10K
322	kjt	Phrae Pwo Karen	Thailand	<10K
323	kuk	Kepo’	Indonesia	<10K
324	put	Putoh	Indonesia	<10K
325	rjg	Rajong	Indonesia	<10K
326	sjb	Sajau Basap	Indonesia	<10K
327	tkz	Takua	Vietnam	<10K
328	amv	Ambelau	Indonesia	<10K
329	wlh	Welaun	East Timor, Indonesia	<10K
330	plz	Paluan Murut	Malaysia	<10K
331	jkp	Paku Karen	Myanmar	<10K
332	adb	Atauran	East Timor	<10K
333	nea	Eastern Ngad’a	Indonesia	<10K
334	ntd	Northern Tidung	Malaysia	<10K
335	phh	Phula	Vietnam	<10K
336	reb	Rembong	Indonesia	<10K
337	skx	Seko Padang	Indonesia	<10K
338	swu	Suwawa	Indonesia	<10K
339	tgr	Tareng	Laos	<10K
340	weu	Rawngtu Chin	Myanmar	<10K
341	sau	Saleman	Indonesia	<10K
342	thi	Tai Long	Laos	<10K
343	low	Tampias Lobu	Malaysia	<10K
344	npg	Ponyo-Gongwang Naga	Myanmar	<10K
345	ukk	Muak Sa-aak	Myanmar	<10K
346	tlq	Tai Loi	Laos, Myanmar	<10K
347	hkn	Mel-Khaonh	Cambodia	<10K
348	jkm	Mobwa Karen	Myanmar	<10K
349	lmq	Lamatuka	Indonesia	<10K
350	lvu	Levuka	Indonesia	<10K
351	lwe	Lewoeleng	Indonesia	<10K
352	rtc	Rungtu Chin	Myanmar	<10K
353	ruu	Lanas Lobu	Malaysia	<10K
354	tiu	Adasen	Philippines	<10K
355	umn	Paungnyuan Naga	Myanmar	<10K
356	lhh	Laha	Indonesia	<10K
357	bjx	Vanaw Kalinga	Philippines	<10K
358	bvt	Bati	Indonesia	<10K
359	kqv	Okolod	Indonesia, Malaysia	<10K
360	xkk	Kachok	Cambodia	<10K
361	iwk	I-wak	Philippines	<10K
362	lka	Lakalei	East Timor	<10K
363	bzn	Boano	Indonesia	<10K
364	sbr	Sembakung Murut	Indonesia, Malaysia	<10K
365	bfg	Busang Kayan	Indonesia	<10K
366	hap	Hupla	Indonesia	<10K
367	kxi	Keningau Murut	Malaysia	<10K
368	llq	Lolak	Indonesia	<10K
369	roc	Cacgia Roglai	Vietnam	<10K
370	sls	Singapore Sign Language	Singapore	<10K
371	ste	Liana-Seti	Indonesia	<10K
372	ulu	Uma’ Lung	Indonesia	<10K
373	wli	Waioli	Indonesia	<10K
374	wrx	Wae Rana	Indonesia	<10K
375	xhv	Khua	Laos, Vietnam	<10K
376	tdy	Tadyawan	Philippines	<10K
377	zbt	Batui	Indonesia	<10K
378	sws	Seluwasan	Indonesia	<10K
379	pni	Aoheng	Indonesia	<10K
380	tuj	Tugutil	Indonesia	<10K
381	nps	Nipsan	Indonesia	<10K
382	uan	Kuan	Laos	<10K
383	vbk	Southwestern Bontok	Philippines	<10K
384	dmv	Dumpas	Malaysia	<10K
385	xko	Kiorr	Laos	<10K
386	kve	Kalabakan Murut	Malaysia	<10K
387	mcm	Malaccan Portuguese Creole	Malaysia	<10K
388	ltu	Latu	Indonesia	<10K
389	gef	Gerai	Indonesia	<10K
390	cnc	Côông	Vietnam	<10K
391	bpo	Anasi	Indonesia	<10K
392	hld	Halang Doan	Laos, Vietnam	<10K
393	nxk	Kokak Naga	Myanmar	<10K
394	puj	Punan Tubu	Indonesia	<10K
395	xkn	Kayan River Kayan	Indonesia	<10K
396	ycp	Chepya	Laos	<10K
397	lcs	Lisabata-Nuniali	Indonesia	<10K
398	haf	Haiphong Sign Language	Vietnam	<10K
399	slt	Sila	Laos, Vietnam	<10K

Table 43: (5/6) SEA indigenous languages with <10K speakers.

No.	ISO 639-3	Language	Region(s)	Population
Not in SEACrowd
400	kvh	Komodo	Indonesia	<10K
401	apf	Pahanan Agta	Philippines	<10K
402	bzb	Andio	Indonesia	<10K
403	jal	Yalahatan	Indonesia	<10K
404	mvr	Marau	Indonesia	<10K
405	agz	Mt. Iriga Agta	Philippines	<10K
406	dkk	Dakka	Indonesia	<10K
407	gak	Gamkonora	Indonesia	<10K
408	kmd	Majukayang Kalinga	Philippines	<10K
409	mqp	Manipa	Indonesia	<10K
410	pzn	Jejara Naga	Myanmar	<10K
411	xkd	Mendalam Kayan	Indonesia	<10K
412	xay	Kayan Mahakam	Indonesia	<10K
413	xky	Uma’ Lasan	Indonesia, Malaysia	<10K
414	mqq	Minokok	Malaysia	<10K
415	neo	Ná-Meo	Vietnam	<10K
416	tln	Talondo’	Indonesia	<10K
417	bqy	Kata Kolok	Indonesia	<10K
418	mxr	Murik	Malaysia	<10K
419	nty	Mantsi	Vietnam	<10K
420	tev	Teor	Indonesia	<10K
421	ttp	Tombelala	Indonesia	<10K
422	ayt	Magbukun Ayta	Philippines	<10K
423	ckn	Kaang Chin	Myanmar	<10K
424	cno	Con	Laos	<10K
425	goq	Gorap	Indonesia	<10K
426	hov	Hovongan	Indonesia	<10K
427	lpn	Long Phuri Naga	Myanmar	<10K
428	nlq	Lao Naga	Myanmar	<10K
429	nqy	Akyaung Ari Naga	Myanmar	<10K
430	nuo	Ngoaun	Laos, Vietnam	<10K
431	psg	Penang Sign Language	Malaysia	<10K
432	ues	Kioko	Indonesia	<10K

Table 44: (6/6) SEA indigenous languages with <10K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	sow	Sowanda	Indonesia	<1K
2	duv	Duvle	Indonesia	<1K
3	hmu	Hamap	Indonesia	<1K
4	ktt	Ketum	Indonesia	<1K
5	mpz	Mpi	Thailand	<1K
6	tvw	Sedoa	Indonesia	<1K
7	syo	Su’ung	Cambodia	<1K
8	mgk	Mawes	Indonesia	<1K
9	mss	West Masela	Indonesia	<1K
10	dij	Dai	Indonesia	<1K
11	drn	West Damar	Indonesia	<1K
12	lji	Laiyolo	Indonesia	<1K
13	mth	Munggui	Indonesia	<1K
14	psn	Panasuan	Indonesia	<1K
15	ret	Reta	Indonesia	<1K
16	twg	Tereweng	Indonesia	<1K
17	bpg	Bonggo	Indonesia	<1K
18	agt	Central Cagayan Agta	Philippines	<1K
19	kvz	Tsaukambo	Indonesia	<1K
20	skp	Sekapan	Malaysia	<1K
21	bsm	Busami	Indonesia	<1K
22	bzi	Bisu	Thailand	<1K
23	kzm	Kais	Indonesia	<1K
24	mhz	Mor	Indonesia	<1K
25	nkj	Nakai	Indonesia	<1K
26	pru	Puragi	Indonesia	<1K
27	skv	Skou	Indonesia	<1K
28	laq	Qabiao	Vietnam	<1K
29	ssm	Semnam	Malaysia	<1K
30	slg	Selungai Murut	Indonesia, Malaysia	<1K
31	tpf	Tarpia	Indonesia	<1K
32	vto	Vitou	Indonesia	<1K
33	wsa	Warembori	Indonesia	<1K
34	dgc	Casiguran Dumagat Agta	Philippines	<1K
35	bfe	Betaf	Indonesia	<1K
36	kgb	Kawe	Indonesia	<1K
37	kwh	Kowiai	Indonesia	<1K
38	ppm	Papuma	Indonesia	<1K
39	tdi	Tomadino	Indonesia	<1K
40	tmu	Iau	Indonesia	<1K
41	uka	Kaburi	Indonesia	<1K
42	bkn	Bukitan	Indonesia, Malaysia	<1K
43	imr	Imroing	Indonesia	<1K
44	tgq	Tring	Malaysia	<1K
45	tlk	Taloki	Indonesia	<1K
46	ert	Eritai	Indonesia	<1K
47	lpe	Lepki	Indonesia	<1K
48	vme	East Masela	Indonesia	<1K
49	mxz	Central Masela	Indonesia	<1K
50	aos	Taikat	Indonesia	<1K
51	cog	Chong	Thailand	<1K
52	dpp	Papar	Malaysia	<1K
53	jet	Manem	Indonesia	<1K
54	kag	Kajaman	Malaysia	<1K
55	kgi	Selangor Sign Language	Malaysia	<1K
56	kly	Kalao	Indonesia	<1K
57	knd	Konda	Indonesia	<1K
58	kuc	Kwinsu	Indonesia	<1K
59	lvi	Lavi	Laos	<1K
60	nbn	Kuri	Indonesia	<1K
61	ner	Yahadian	Indonesia	<1K
62	oni	Onin	Indonesia	<1K
63	orz	Ormu	Indonesia	<1K
64	pkt	Maleng	Laos, Vietnam	<1K
65	rth	Ratahan	Indonesia	<1K
66	sbt	Kimki	Indonesia	<1K
67	tcm	Tanahmerah	Indonesia	<1K
68	trt	Tunggare	Indonesia	<1K
69	wtw	Wotu	Indonesia	<1K
70	xkq	Koroni	Indonesia	<1K
71	cwg	Cheq Wong	Malaysia	<1K
72	bpp	Kaure	Indonesia	<1K
73	isd	Isnag	Philippines	<1K
74	pna	Punan Bah-Biau	Malaysia	<1K
75	skz	Sekar	Indonesia	<1K
76	thm	Aheu	Thailand	<1K
77	toy	Topoiyo	Indonesia	<1K
78	dbe	Dabe	Indonesia	<1K
79	bvk	Bukat	Indonesia	<1K
80	dei	Demisa	Indonesia	<1K

Table 45: (1/3) SEA indigenous languages with <1K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
81	jel	Yelmek	Indonesia	<1K
82	nun	Anong	Myanmar	<1K
83	opk	Kopkaka	Indonesia	<1K
84	pas	Papasena	Indonesia	<1K
85	tmj	Samarokena	Indonesia	<1K
86	urn	Uruangnirin	Indonesia	<1K
87	xau	Kauwera	Indonesia	<1K
88	kdy	Keijar	Indonesia	<1K
89	auu	Auye	Indonesia	<1K
90	auw	Awyi	Indonesia	<1K
91	flh	Foau	Indonesia	<1K
92	gop	Yeretuar	Indonesia	<1K
93	jau	Yaur	Indonesia	<1K
94	lhn	Lahanan	Malaysia	<1K
95	pee	Taje	Indonesia	<1K
96	phq	Phana’	Laos	<1K
97	tnz	Ten’edn	Malaysia, Thailand	<1K
98	wru	Waru	Indonesia	<1K
99	sve	Serili	Indonesia	<1K
100	bgv	Warkay-Bipim	Indonesia	<1K
101	bhc	Biga	Indonesia	<1K
102	bqb	Bagusa	Indonesia	<1K
103	bsa	Abinomn	Indonesia	<1K
104	ccm	Malaccan Malay Creole	Malaysia	<1K
105	giq	Green Gelao	Vietnam	<1K
106	kja	Mlap	Indonesia	<1K
107	kzv	Komyandaret	Indonesia	<1K
108	mrf	Elseng	Indonesia	<1K
109	swr	Saweru	Indonesia	<1K
110	tad	Tause	Indonesia	<1K
111	tbp	Diebroud	Indonesia	<1K
112	tmo	Temoq	Malaysia	<1K
113	tyh	O’du	Laos, Vietnam	<1K
114	wuy	Wauyai	Indonesia	<1K
115	xwr	Kwerba Mamberamo	Indonesia	<1K
116	rmh	Murkim	Indonesia	<1K
117	tml	Tamnim Citak	Indonesia	<1K
118	wet	Perai	Indonesia	<1K
119	bqq	Biritai	Indonesia	<1K
120	brs	Baras	Indonesia	<1K
121	bzu	Burmeso	Indonesia	<1K
122	emw	Emplawas	Indonesia	<1K
123	kiq	Kosare	Indonesia	<1K
124	kiy	Kirikiri	Indonesia	<1K
125	kns	Kensiu	Malaysia, Thailand	<1K
126	lcc	Legenyem	Indonesia	<1K
127	mso	Mombum	Indonesia	<1K
128	mvx	Meoswar	Indonesia	<1K
129	sao	Sause	Indonesia	<1K
130	snu	Viid	Indonesia	<1K
131	tlg	Tofanma	Indonesia	<1K
132	kgv	Karas	Indonesia	<1K
133	lnh	Lanoh	Malaysia	<1K
134	asz	As	Indonesia	<1K
135	kbi	Kaptiau	Indonesia	<1K
136	msl	Molof	Indonesia	<1K
137	wfg	Zorop	Indonesia	<1K
138	dmu	Tebi	Indonesia	<1K
139	llk	Lelak	Malaysia	<1K
140	tcq	Kaiy	Indonesia	<1K
141	aqn	Northern Alta	Philippines	<1K
142	bnv	Beneraf	Indonesia	<1K
143	enc	En	Vietnam	<1K
144	erw	Erokwanas	Indonesia	<1K
145	jbr	Jofotek-Bromnya	Indonesia	<1K
146	khh	Kehu	Indonesia	<1K
147	khp	Kapauri	Indonesia	<1K
148	kxn	Kanowit-Tanjong Melanau	Malaysia	<1K
149	mmb	Momina	Indonesia	<1K
150	nec	Nedebang	Indonesia	<1K
151	nyl	Nyeu	Thailand	<1K
152	rac	Rasawa	Indonesia	<1K
153	tnu	Tai Khang	Laos	<1K
154	wai	Wares	Indonesia	<1K
155	yki	Yoke	Indonesia	<1K
156	bed	Bedoanas	Indonesia	<1K
157	mzt	Mintil	Malaysia	<1K
158	agf	Arguni	Indonesia	<1K
159	apx	Aputai	Indonesia	<1K
160	kcd	Ngkâlmpw Kanum	Indonesia	<1K

Table 46: (2/3) SEA indigenous languages with <1K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
161	ugo	Ugong	Thailand	<1K
162	wbe	Waritai	Indonesia	<1K
163	mra	Mlabri	Laos, Thailand	<1K
164	afz	Obokuitai	Indonesia	<1K
165	mgf	Maklew	Indonesia	<1K
166	ttn	Towei	Indonesia	<1K
167	knq	Kintaq	Malaysia	<1K
168	ulf	Usku	Indonesia	<1K
169	awh	Awbono	Indonesia	<1K
170	bti	Burate	Indonesia	<1K
171	byl	Bayono	Indonesia	<1K
172	diy	Diuwe	Indonesia	<1K
173	kpi	Kofei	Indonesia	<1K
174	krz	Sota Kanum	Indonesia	<1K
175	kwr	Kwer	Indonesia	<1K
176	tfo	Tefaro	Indonesia	<1K
177	tkx	Tangko	Indonesia	<1K
178	tti	Tobati	Indonesia	<1K
Not in SEACrowd
179	lcd	Lola	Indonesia	<1K
180	ors	Orang Seletar	Malaysia	<1K
181	kpd	Koba	Indonesia	<1K
182	trx	Tringgus-Sembaan Bidayuh	Malaysia	<1K
183	kqt	Klias River Kadazan	Malaysia	<1K
184	atp	Pudtol Atta	Philippines	<1K
185	tcp	Tawr Chin	Myanmar	<1K
186	kyd	Karey	Indonesia	<1K
187	pyy	Pyen	Myanmar	<1K
188	ttw	Long Wat	Malaysia	<1K
189	xmx	Salawati	Indonesia	<1K
190	ymn	Sunum	Indonesia	<1K
191	wkd	Mo	Indonesia	<1K
192	abf	Abai Sungai	Malaysia	<1K
193	esy	Eskayan	Philippines	<1K
194	kzb	Kaibobo	Indonesia	<1K
195	njs	Nisa	Indonesia	<1K
196	nni	North Nuaulu	Indonesia	<1K
197	whu	Wahau Kayan	Indonesia	<1K
198	xke	Kereho	Indonesia	<1K
199	lce	Sekak	Indonesia	<1K
200	sdx	Sibu Melanau	Malaysia	<1K
201	bfk	Ban Khor Sign Language	Thailand	<1K
202	kax	Kao	Indonesia	<1K
203	srk	Serudung Murut	Malaysia	<1K
204	pud	Punan Aput	Indonesia	<1K
205	bgy	Benggoi	Indonesia	<1K
206	kzd	Kadai	Indonesia	<1K
207	kvp	Kompane	Indonesia	<1K
208	auq	Anus	Indonesia	<1K
209	azt	Faire Atta	Philippines	<1K
210	hud	Huaulu	Indonesia	<1K
211	lgh	Laghuu	Vietnam	<1K
212	tip	Trimuris	Indonesia	<1K
213	tyj	Tai Yo	Laos, Vietnam	<1K
214	tys	Tày Sa Pa	Vietnam	<1K
215	mqi	Mariri	Indonesia	<1K
216	pdn	Fedan	Indonesia	<1K
217	mnq	Minriq	Malaysia	<1K
218	daz	Dao	Indonesia	<1K
219	gnq	Gana	Malaysia	<1K
220	lrn	Lorang	Indonesia	<1K
221	bsu	Bahonsuai	Indonesia	<1K
222	puc	Punan Merap	Indonesia	<1K
223	rmx	Romam	Vietnam	<1K
224	tyl	Thu Lao	Vietnam	<1K
225	yrs	Yarsun	Indonesia	<1K
226	atl	Mt. Iraya Agta	Philippines	<1K
227	puf	Punan Merah	Indonesia	<1K
228	umi	Ukit	Malaysia	<1K
229	jvd	Javindo	Indonesia	<1K
230	srt	Sauri	Indonesia	<1K

Table 47: (3/3) SEA indigenous languages with <1K speakers.

No.	ISO 639-3	Language	Region(s)	Population
In SEACrowd
1	mnu	Mer	Indonesia	<100
2	itx	Itik	Indonesia	<100
3	kxq	Smärky Kanum	Indonesia	<100
4	lix	Liabuku	Indonesia	<100
5	awr	Awera	Indonesia	<100
6	bdx	Budong-Budong	Indonesia	<100
7	ire	Yeresiam	Indonesia	<100
8	tds	Doutai	Indonesia	<100
9	mrx	Dineor	Indonesia	<100
10	amq	Amahai	Indonesia	<100
11	kzu	Kayupulau	Indonesia	<100
12	mok	Morori	Indonesia	<100
13	plh	Paulohi	Indonesia	<100
14	sgu	Salas	Indonesia	<100
15	aip	Burumakok	Indonesia	<100
16	dbn	Duriankere	Indonesia	<100
17	dul	Inagta Alabat	Philippines	<100
18	moq	Mor	Indonesia	<100
19	naa	Namla	Indonesia	<100
20	mvs	Massep	Indonesia	<100
21	aem	Arem	Laos, Vietnam	<100
22	mqr	Mander	Indonesia	<100
23	xkw	Kembra	Indonesia	<100
24	kkb	Kwerisa	Indonesia	<100
25	atz	Arta	Philippines	<100
26	ibh	Bih	Vietnam	<100
27	khd	Bädi Kanum	Indonesia	<100
28	nul	Nusa Laut	Indonesia	<100
29	scq	Chung	Cambodia	<100
30	mqt	Mok	Myanmar, Thailand	<10
31	btj	Bacanese Malay	Indonesia	<10
32	wor	Woria	Indonesia	<10
33	spi	Saponi	Indonesia	<10
34	dsn	Dusner	Indonesia	<10
35	lgi	Lengilu	Indonesia	<10
36	btn	Ratagnon	Philippines	<10
37	tni	Tandia	Indonesia	<10
38	huw	Hukumina	Indonesia	<10
39	kzl	Kayeli	Indonesia	<10
40	sxm	Samre	Cambodia, Thailand	<10
41	hpo	Hpon	Myanmar	<10
42	mpy	Mapia	Indonesia	<10
43	nil	Nila	Indonesia	<10
44	sbo	Sabüm	Malaysia	<10
45	srw	Serua	Indonesia	<10
46	tas	Tay Boi	Vietnam	<10
47	xbn	Kenaboi	Malaysia	<10
48	xxt	Tambora	Indonesia	<10
Not in SEACrowd
49	orn	Orang Kanaq	Malaysia	<100
50	lva	Makuva	East Timor	<100
51	spg	Sihan	Malaysia	<100
52	ibu	Ibu	Indonesia	<100
53	pnm	Punan Batu	Malaysia	<100
54	csd	Chiangmai Sign Language	Thailand	<100
55	ays	Sorsogon Ayta	Philippines	<100
56	lio	Liki	Indonesia	<100
57	pey	Petjo	Indonesia	<100
58	hti	Hoti	Indonesia	<100
59	huk	Hulung	Indonesia	<100
60	ism	Masimasi	Indonesia	<100
61	kzx	Kamarian	Indonesia	<100
62	pns	Ponosakan	Indonesia	<100
63	agk	Katubung Agta	Philippines	<10
64	nae	Naka’ela	Indonesia	<10
65	atm	Ata	Philippines	<10
66	ihb	Iha Based Pidgin	Indonesia	<10
67	tvy	Timor Pidgin	East Timor	<10
68	duy	Dicamay Agta	Philippines	<10
69	dyg	Villa Viciosa Agta	Philippines	<10
70	lox	Loun	Indonesia	<10
71	onx	Onin Based Pidgin	Indonesia	<10
72	tcl	Taman	Myanmar	<10
73	vms	Moksela	Indonesia	<10
74	wea	Wewaw	Myanmar	<10

Table 48: SEA indigenous languages with <100 speakers.

Model	Gini $\downarrow$
Commercial
GPT-4	0.155
Command-R	0.184
English
Mistral	0.159
Llama3	0.131
Falcon	0.238
Multilingual
mT0	0.131
BLOOMZ	0.228
BactrianX-Llama	0.163
AYA-23	0.183
AYA-101	0.095
SEA regional
SEA-LION	0.204
SeaLLM v2.5	0.116
Sailor	0.145
SEA country
Cendol-mT5	0.378
Cendol-Llama2	0.267
Merak v4	0.199
WangchanX-Llama3	0.153
Malaysian Llama3	0.179

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Abstract

1 Introduction

2 SEACrowd

Consolidating datasheets

Standardizing dataloaders

2.1 SEACrowd Catalogue & Data Hub

2.2 Datasets in SEACrowd

3 SEACrowd Benchmarks

3.1 Datasets

NLP

Speech

VL

3.2 Baselines

NLP

Speech

VL

3.3 Experimental Settings

4 Result & Analysis

4.1 State-of-the-Art Models on SEA languages

LLMs

Speech models

VLMs

4.2 Generation Quality in SEA Languages: Translationese vs. Natural Language

Classifying Translationese in SEA Languages

Generation Quality of LLMs

5 Discussions

5.1 Resource Gaps in SEA

Coverage

Quality

Cultural Relevance

5.2 Conclusion & Future Work

Potential utility

Resource equity

Acknowledgments

Limitations

Ethics Statement

References

Appendix A Summary of SEACrowd

Appendix B Related Work

SEA data resources

Open-source Community Initiatives in NLP

Appendix C Contributing to SEACrowd

C.1 Open Contributions

C.2 Measuring Contributions

Appendix D Progression of SEACrowd

D.1 Timeline

D.2 Contribution Progress

Appendix E Reviewing SEACrowd’s Submissions

E.1 Datasheet Reviewing

E.2 Dataloader Reviewing

Appendix F Schemas in SEACrowd

F.1 NLP

F.2 Speech

F.3 VL

Appendix G Supplementary Details for SEA Evaluation

G.1 Datasets

G.2 Baselines

G.3 Prompts

G.4 Evaluation Results

G.5 Language Equity Results

Appendix H Supplementary Details for Translationese Classifier

H.1 Training & Evaluation Data

H.2 Experiments

Classical ML

Encoder LM

Appendix I Supplementary Details for SEA Language Prioritization

Appendix J Contributor Demographics

Appendix K Languages Under Study

Appendix L Amount of Contributions by Co-Authors

SEACrowd: A Multilingual Multimodal Data Hub
and Benchmark Suite for Southeast Asian Languages