SEACrowd: A Multilingual Multimodal Data Hub
and Benchmark Suite for Southeast Asian Languages
Abstract
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub111https://seacrowd.github.io/seacrowd-catalogue/ that fills the resource gap by providing standardized corpora222https://github.com/SEACrowd/seacrowd-datahub/ in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
SEACrowd: A Multilingual Multimodal Data Hub
and Benchmark Suite for Southeast Asian Languages
Holy Lovenia★,1,2 Rahmad Mahendra★,3,2 Salsabil Maulana Akbar★,2 Lester James V. Miranda★,4 Jennifer Santoso★,5 Elyanah Aco★,6 Akhdan Fadhilah★,7 Jonibek Mansurov★,8 Joseph Marvin Imperial★,9,10 Onno P. Kampman★,11 Joel Ruben Antony Moniz★,6 Muhammad Ravi Shulthan Habibi★,3,2 Frederikus Hudi★,12,13 Railey Montalan★,1 Ryan Ignatius6 Joanito Agili Lopo14 William Nixon15 Börje F. Karlsson16 James Jaya6 Ryandito Diandaru6 Yuze Gao6 Patrick Amadeus15 Bin Wang6 Jan Christian Blaise Cruz17 Chenxi Whitehouse18 Ivan Halim Parmonangan19 Maria Khelli15 Wenyu Zhang6 Lucky Susanto20 Reynard Adha Ryanda21 Sonny Lazuardi Hermawan22 Dan John Velasco17 Muhammad Dehan Al Kautsar15 Willy Fitra Hendria6 Yasmin Moslem23 Noah Flynn24 Muhammad Farid Adilazuarda8 Haochen Li6 Johanes Lee15 R. Damanhuri25 Shuo Sun6 Muhammad Reza Qorib26 Amirbek Djanibekov8 Wei Qi Leong1 Quyet V. Do27 Niklas Muennighoff28 Tanrada Pansuwan18 Ilham Firdausi Putra6 Yan Xu29,27 Ngee Chia Tai1 Ayu Purwarianti6,30 Sebastian Ruder31 William Tjhi1 Peerat Limkonchotiwat★,32 Alham Fikri Aji★,8 Sedrick Keh★,33 Genta Indra Winata★,2 Ruochen Zhang★,34 Fajri Koto★,8,2 Zheng-Xin Yong★,34 Samuel Cahyawijaya★,27,2 1AI Singapore 2IndoNLP 3Universitas Indonesia 4Allen Institute for Artificial Intelligence 5RevComm, Inc. 6Independent Researcher 7Tohoku University 8MBZUAI 9University of Bath 10National University Philippines 11MOH Office for Healthcare Transformation (MOHT) 12NAIST 13Works Applications Lab 14Universitas Gadjah Mada 15Institut Teknologi Bandung 16Beijing Academy of Artificial Intelligence (BAAI) 17Samsung Research Philippines 18University of Cambridge 19Queensland University of Technology 20Monash University Indonesia 21Imperial College London 22Independent Design Engineer 23Bering Lab 24Amazon 25Universitas Diponegoro 26NUS 27HKUST 28Contextual AI 29Huawei Noah’s Ark Lab 30Prosa.ai 31Cohere 32VISTEC 33Toyota Research Institute 34Brown University ★Major contributors
1 Introduction
Despite the Southeast Asia (SEA) region being home to 1,300 indigenous languages (18% of the world’s languages) and 671 million people (8.75% of the world’s population), the representation of texts, images, and audio datasets from this region is significantly lacking in machine learningmodels. This deficiency negatively impacts the model quality for SEA languages. The language coverage of SEA languages in two common pre-training resources, Common Crawl333https://commoncrawl.github.io/cc-crawl-statistics/plots/languages and C4 Xue et al. (2021), is extremely limited, with only 2.36% (in 11 languages) and 10.62% (in 11 languages), respectively. In modalities beyond text, the representation is even more limited. For instance, Common Voice, one of the largest multilingual speech corpora, includes only 6 SEA indigenous languages Conneau et al. (2021); Ardila et al. (2020), and LAION-5B, one of the largest multilingual vision-language (VL) corpora, includes only 12 SEA indigenous languages Schuhmann et al. (2022). While datasets for other SEA indigenous languages may exist, they are often scattered, insufficiently documented, or varied in quality and formatting, thereby making access and usage challenging Cahyawijaya et al. (2023a); Joshi et al. (2020); Aji et al. (2023).
In terms of evaluation, the sparse availability of high-quality test sets for these languages also complicates evaluating models for SEA languages. Despite there being 1,300+ languages in the SEA region, prior works Winata et al. (2023); Cahyawijaya et al. (2021); Koto and Koto (2020); Zhang et al. (2024); Wang et al. (2024); Nguyen et al. (2023); Leong et al. (2023); Yong et al. (2023) have only evaluated fewer than 10 SEA languages collectively. The actual performance of current models on most SEA languages remains largely unknown.
Moreover, the dominance of Anglocentric training data potentially results in cultural bias when generating texts, images, or audio in underrepresented SEA languages Søgaard (2022); Talat et al. (2022). Further, Durmus et al. (2023); AlKhamissi et al. (2024); Cahyawijaya et al. (2024a) have shown that the learned representations in large language models (LLMs) often fail to reflect local cultural values in SEA Koto et al. (2024); Liu et al. (2024); Adilazuarda et al. (2024). This raises concerns about the ability of current LLMs to generate natural, high-quality texts for this region. In addition, the discrepancy in language support creates language barriers in technological access and risks marginalizing minority groups who do not speak the dominant language.
![Refer to caption](extracted/5712155/latex/figures/datasets_in_seacrowd.png)
In this work, we investigate the current AI progress for SEA languages by addressing the challenges of resources, evaluation, and generation quality. Our contributions are three-fold:
-
•
We bridge the resource gap by centralizing and standardizing 500 corpora in nearly 1,000 SEA languages in SEACrowd, a comprehensive and standardized resource center, across 3 modalities: text, image, and audio.
-
•
We close the evaluation gap in SEA languages with the SEACrowd Benchmarks, which cover 38 SEA indigenous languages on 13 tasks across 3 modalities, providing insights into the performance of a diverse spectrum of AI models. Further, our study reveals that the generative outputs of existing LLMs exhibit a closer resemblance to translationese rather than natural data in 9 SEA languages.
-
•
We offer insights and strategies for the future development of AI in SEA, aiming to promote a sustainable and prosperous future through continuous AI advancements.
2 SEACrowd
SEACrowd represents the first comprehensive AI dataset collection initiative for SEA, developed through a collaborative effort among researchers and engineers primarily based in the SEA region. As addressed in §1, resource scarcity and the scattered nature of the data are crucial challenges in SEA. SEACrowd addresses these issues through two primary contributions: 1) consolidating datasheets to enhance data discoverability; and 2) standardizing dataloaders for easier use, especially in multiple dataset loading. We also follow data provenance practices (Longpre et al., 2023) to preserve the proprietary rights of dataset owners.
Consolidating datasheets
We invited contributors to submit datasheet forms Gebru et al. (2021) for publicly available datasets across all modalities including text, audio, and image in SEA languages and/or cultures. These datasheets include detailed information about each dataset, such as data subset(s), description, task, language, license, URL access, annotation method(s), annotation validation, relevant publications, publication venue, and data splits. For each submission, we manually verify and correct it as necessary to ensure datasheet accuracy.
Standardizing dataloaders
For each approved datasheet, we created a standardized dataloader wrapper to facilitate ready-to-use data access since only 38.4% of the consolidated data sources were originally hosted on Hugging Face444https://huggingface.co/. To support diverse task types, we carefully designed the standardized seacrowd schema to support different data structures and modalities (see Appendix F). We also adhere to data provenance practices (Longpre et al., 2023) and document the relevant metadata (e.g., license) in the dataloaders. Furthermore, we engaged with data owners and successfully converted three private datasets into public ones.
These efforts have culminated in 498 datasheets in SEACrowd Catalogue and 399 dataloaders in SEACrowd Data Hub (§2.1). Notably, our centralized data repository covers 1,000 SEA languages, underscoring the extensive linguistic diversity captured by SEACrowd. We elaborate on the SEACrowd dataset statistics in §2.2. SEACrowd’s contribution guidelines, progression details, and reviewing procedure are in Appendix C, D, and E.
2.1 SEACrowd Catalogue & Data Hub
SEACrowd comprises two interconnected platforms: SEACrowd Catalogue555SEACrowd Catalogue is also present in csv format. and SEACrowd Data Hub. These platforms work in tandem to consolidate the datasheet submissions and provide a standardized pipeline for SEACrowd. Specifically, Catalogue houses the datasheets (metadata), while Data Hub stores the standardized dataloaders and the seacrowd library666All codes are available under Apache License 2.0. for the schemas and configurations (Appendix F). These systems share information on the datasheets and dataloaders, allowing users to seamlessly explore and utilize them.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
Model | Gini |
Commercial | |
GPT-4 | 0.155 |
Command-R | 0.184 |
English | |
Mistral | 0.159 |
Llama3 | 0.131 |
Falcon | 0.238 |
Multilingual | |
mT0 | 0.131 |
BLOOMZ | 0.228 |
BactrianX-Llama | 0.163 |
AYA-23 | 0.183 |
AYA-101 | 0.095 |
SEA regional | |
SEA-LION | 0.204 |
SeaLLM v2.5 | 0.116 |
Sailor | 0.145 |
SEA country | |
Cendol-mT5 | 0.378 |
Cendol-Llama2 | 0.267 |
Merak v4 | 0.199 |
WangchanX-Llama3 | 0.153 |
Malaysian Llama3 | 0.179 |
2.2 Datasets in SEACrowd
SEACrowd consolidates 498 datasheets with diverse tasks in SEA languages and provides standardized access through dataloaders to 399 of them. As shown in Figure 1, approximately 81% of the datasets in SEACrowd are textual data, with the remaining 8% and 11% being VL and speech, respectively. The complete list of SEA indigenous languages covered by SEACrowd and their mapping to the relevant SEA regions are provided in Appendix K. Around 53% of the datasets have a commercially permissive license.
A total of 83 tasks are provided in SEACrowd with a breakdown of 66 in NLP (e.g., abusive language detection, intent classification, instruction tuning, named entity recognition, etc.), 10 in VL (image-to-text generation, sign language recognition, video captioning, etc.), and 7 in speech (e.g., automatic speech recognition, text-to-speech, speech emotion recognition, and others). These tasks are then standardized into 20 dataloader schemas described in Appendix F. Further discussion regarding resources in SEACrowd is in §5.1.
3 SEACrowd Benchmarks
To understand the capability of state-of-the-art models, we conduct comprehensive evaluations of existing LLMs, VLMs, and speech models from various architectures and training approaches. To construct a benchmark suite777https://github.com/SEACrowd/seacrowd-experiments, we select a subset of the dataset that has been manually annotated and/or validated from the data presented in §2.2. More details regarding the data subsets, baselines, and prompts used for the evaluations are given in Appendix G.1, G.2, and G.3.
3.1 Datasets
NLP
Our natural language understanding (NLU) benchmark consists of 131 data subsets and 7 tasks: sentiment analysis, topic classification, natural language inference (NLI), commonsense reasoning, exam-style multiple-choice question answering (QA), culture understanding, and reading comprehension. It covers English (eng) and 33 SEA indigenous languages.
We utilize 100 data subsets for the natural language generation (NLG) benchmark, which covers machine translation (MT) between English and SEA languages from both directions, summarization, as well as extractive or abstractive question answering, covering 27 SEA indigenous languages.
Speech
We employ 19 automatic speech recognition (ASR) data subsets to evaluate the capability of speech models in 15 SEA indigenous languages.
VL
We assess the models on image captioning using four data subsets in 4 SEA indigenous languages, i.e., Filipino (fil), Indonesian (ind), Thai (tha), and Vietnamese (vie). This disparity in the evaluation scale is due to the fact that only a few datasets in SEACrowd are VL datasets, and even fewer are annotated by humans.
3.2 Baselines
Complete details regarding the model architectures, model sizes, seen languages, corresponding publications, and other aspects are in Appendix G.2.
NLP
To evaluate the zero-shot performance of instruction-tuned LLMs on SEA languages, we benchmark two commercial, i.e., GPT-4 OpenAI et al. (2024) and Command-R888https://docs.cohere.com/docs/command-r, and 17 open-source baselines, the majority of which are 7B-13B parameters. We categorize the open-source baselines according to the language(s) coverage in pre-training and/or instruction tuning, i.e., 1) English: Llama3 Touvron et al. (2023), Mistral Jiang et al. (2023), and Falcon Almazrouei et al. (2023); 2) Multilingual: AYA-101, AYA-23 Üstün et al. (2024), mT0, BLOOMZ Muennighoff et al. (2022), and BactrianX-Llama Li et al. (2023a); 3) SEA regional: SEA-LION Singapore (2023), Sailor Dou et al. (2024), and SeaLLM Nguyen et al. (2023); and 4) SEA country-specific: Cendol-mT5, Cendol-Llama2 Cahyawijaya et al. (2024b), and Merak Ichsan (2023) from Indonesia, WangchanX-Llama3 Phatthiyaphaibun et al. (2024) from Thailand, and Malaysian-Llama3999https://huggingface.co/mesolitica/malaysian-llama-3-8b-instruct-16k from Malaysia.
Speech
We evaluate the zero-shot performance of state-of-the-art multilingual pre-trained speech models in transcribing speech in SEA languages. Specifically, we consider Whisper v3 Radford et al. (2023), MMS 1B Pratap et al. (2024), and Seamless M4T v2 Communication et al. (2023), which have shown proficiency in accurately transcribing multiple languages without fine-tuning. Additionally, we include models that are fine-tuned on specific language(s), SEA or English, based on 1) Wav2Vec2 XLSR Conneau et al. (2021) and 2) XLS-R Babu et al. (2021), known for their cross-lingual speech representation learning by pre-training on raw speech waveforms across diverse languages, with XLS-R offering broader language coverage, and 3) Whisper, which leverages weakly supervised pre-training on spectrograms of speech in diverse languages. The specific fine-tuned models are evaluated: XLSR on ind, jav, sun; XLSR and Whisper on Indonesian (ind); XLSR and Whisper on Thai (tha); XLS-R on Tagalog (tgl); XLS-R on Burmese (mya); XLS-R and Whisper on Khmer (khm); and XLSR on English (eng). See Appendix G.2 for details.
VL
We consider state-of-the-art VLMs primarily trained on English pre-training and instruction-following data: LLaVA Liu et al. (2023b, a), InstructBLIP Dai et al. (2024), and Idefics2 Laurençon et al. (2024), and VLMs trained in a multilingual manner: mBLIP Geigle et al. (2023) and PaliGemma Gemma Team et al. (2024), to assess their image captioning ability in SEA languages.
3.3 Experimental Settings
We conduct all evaluations in a zero-shot fashion. We employ 3 prompt templates in English for each NLU task and 1 for each NLG task. We utilize the weighted F1 score to measure the model performance on NLU tasks and n-gram reference-based metrics, i.e., chrF++ Popović (2015, 2017) and ROUGE-L Lin (2004), on NLG tasks. As for VL, aside from a prompt template in English, we also use a prompt template in the respective SEA indigenous language per data subset. We report CIDEr Vedantam et al. (2015) for the image captioning task. For ASR, we use word error rate (WER) for languages with Latin script and character error rate (CER) for those with non-Latin script.
4 Result & Analysis
![Refer to caption](x6.png)
4.1 State-of-the-Art Models on SEA languages
LLMs
Figure 2(a) and 2(b) illustrate the overall model performance of the LLM baselines in SEA languages for both NLU tasks and NLG tasks. In our NLU evaluation, AYA-101, a large multilingual instruction-tuned language model covering 101 languages, demonstrates the best zero-shot performance. It is followed by the commercial baselines, which achieve a median of 0.6 weighted F1-score. Sailor and SeaLLM, models specifically trained with SEA languages, also display competitive performance. Similarly, mT0 exhibits strong generalization abilities due to its exposure to 100 languages in pre-training, including those from the SEA region Muennighoff et al. (2022). In contrast, most English and SEA country-specific baselines perform less effectively, likely due to their narrow focus on English or a limited set of SEA languages, such as Indonesian languages for Cendol and Thai for WangchanX-Llama3. Similar and consistent trends are observed on MT task, while the baselines’ poorer scores on abstractive/extractive QA and summarization indicate their ineffectiveness in producing acceptable outputs in SEA languages for these tasks, which is especially pronounced in the open-source baselines. Appendix G.4 describes the performance of LLMs per language.
To analyze the equality in model performance across SEA languages, following Khanuja et al. (2023), we utilize the Gini coefficient—originally used to observe income equality Dorfman (1979)—weighted by demand and parameterized by . Here, corresponds to a demographic notion of demand, considering language population size, while does not take population size into account Blasi et al. (2022). Table 1 shows that models trained on more SEA languages, such as multilingual and SEA regional baselines, generally exhibit greater language equity. For instance, although Command-R and GPT-4 are competitive performance-wise against AYA-101 and mT0, AYA-101 and mT0 demonstrate higher equality across all SEA languages under study. This trend is consistent across different (see Appendix G.5).
![Refer to caption](x7.png)
Speech models
Figure 3 presents the off-the-shelf speech model performance on ASR across languages in SEA, measured by the error rate percentage. 9 of the 15 SEA languages in our speech evaluation belong to the Austronesian language family. The other 6 are khm and vie, which belong to Austro-Asiatic, cnh and mya belong to Sino-Tibetan, and tha and vie belong to the Kra-Dai language family. The multilingual pre-trained baselines have a competitive generalization capability across languages, although it varies by language. For instance, Whisper v3 demonstrates significantly higher effectiveness for national languages such as ind, zlm, fil, tha, and vie, while performing less optimally for other indigenous languages. Conversely, Seamless M4T v2 shows a more balanced performance across the languages. Regarding fine-tuned baselines, error rates decrease for their seen languages. The fine-tuned Whisper models, however, manage to better optimize for the target language while retaining their original capabilities in other SEA languages compared to their Wav2Vec2 XLSR and XLS-R counterparts, despite both having been pre-trained in a multilingual manner. This observation aligns with the findings of Rouditchenko et al. (2023), who find that the number of hours seen per language and language family during pre-training is predictive of how the models compare, in which Whisper’s pre-training data duration for these four language families exceeds that of XLSR.
VLMs
Figure 4 depicts the zero-shot performance of off-the-shelf VLMs on image captioning in SEA indigenous languages. Despite the capability of LLMs for zero-shot cross-lingual generalization Huang et al. (2021); Täckström et al. (2012); Neubig and Hu (2018); Artetxe et al. (2020), VLMs trained only in English (i.e., InstructBLIP, LLaVA, and Idefics2) fail to exhibit this capability, struggling to generate adequate image captions in SEA languages. Multilingual VL pre-training is crucial to achieving aligned multilingual representations Burns et al. (2020); Li et al. (2023b); Huang et al. (2021). For instance, PaliGemma and mBLIP generate better image captions in tha and fil when prompted in the relevant SEA languages.
Model | Natural outputs |
SEA-LION | 58.57% |
AYA-23 | 43.57% |
Sailor | 37.86% |
Cendol-Llama2 | 37.37% |
Malaysian Llama3 | 36.90% |
WangchanX-Llama3 | 30.24% |
Falcon | 29.52% |
BactrianX-Llama | 28.10% |
SeaLLM | 27.38% |
Merak | 26.19% |
BLOOMZ | 25.00% |
Cendol-MT5 | 24.05% |
Command-R | 20.95% |
mT0-XL | 19.76% |
Mistral | 19.52% |
GPT-4 | 16.67% |
Llama3 | 14.05% |
AYA-101 | 8.33% |
Language | Natural outputs |
Indonesian (ind) | 41.58% |
Vietnamese (vie) | 37.31% |
Thai (tha) | 34.21% |
Khmer (khm) | 29.21% |
Lao (lao) | 28.42% |
Malay (zlm) | 22.24% |
Burmese (mya) | 19.47% |
Filipino (fil) | 12.22% |
English (eng)† | 8.95% |
![Refer to caption](x8.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
However, when prompted in eng, the performance of these multilingual baselines varies notably. PaliGemma’s performance collapses completely, while mBLIP’s performance shows both increases and decreases across different SEA languages. This raises the question of whether the multilingual VLMs can maintain consistent performance across different languages used in the instructions and the tasks. It highlights the need for further research into the mechanisms that drive these variations and how to achieve robust multilingual performance in VLMs across diverse linguistic contexts. Understanding these dynamics is crucial for improving VLMs’ generalization capabilities and ensuring equitable performance across all languages, despite most related works focusing on monolingual visual instruction tuning Liu et al. (2023b); Gong et al. (2023); Zhu et al. (2024).
4.2 Generation Quality in SEA Languages: Translationese vs. Natural Language
Classifying Translationese in SEA Languages
To analyze the generation quality of LLMs in SEA languages, we build a text classifier to discriminate between translationese and natural texts Riley et al. (2020). We construct a translationese classification training and testing dataset using 49 and 62 data subsets, respectively, covering approximately 39.9k and 51.5k sentences across English (eng) and 8 SEA languages: Indonesian (ind), Khmer (khm), Lao (lao), Burmese (mya), Filipino (fil), Thai (tha), Vietnamese (vie), and Malay (zlm). The training and test data are detailed in Appendix H.1.
We fine-tune a classifier from mDeBERTaV3 He et al. (2020, 2022)101010https://huggingface.co/microsoft/mdeberta-v3-base using these data and achieve 79.08% accuracy on the test set in predicting translationese across these 9 languages. The detailed results and ablation studies of our translationese classifier experiments are provided in Appendix H.2. This classifier enables us to assess the generation quality of LLMs by distinguishing between translationese and naturally occurring text, providing insights into the models’ performance in producing authentic language output.
Generation Quality of LLMs
We evaluate the generation quality of LLMs in 9 SEA languages by generating answers to natural, general, and safety questions from Sea-Bench Nguyen et al. (2023). As shown in Table 2(a), LLMs with extensive language coverage but less focus on SEA languages, e.g., AYA-101 Üstün et al. (2024), GPT-4 OpenAI et al. (2024), mT0 Muennighoff et al. (2023); Xue et al. (2021), and Llama3 AI@Meta (2024), tend to produce natural sentences less than 20% of the time. In contrast, models with narrower language coverage but a greater focus on SEA languages, such as Cendol-Llama2 Cahyawijaya et al. (2024b), Sailor Dou et al. (2024), AYA-23 Aryabumi et al. (2024), and SEA-LION Singapore (2023), generate natural sentences over 35% of the time.
However, even the LLM with the least translationese generation, SEA-LION, only produces natural SEA sentences 57.71% of the time, highlighting a significant quality gap in generating natural sentences in SEA languages. As displayed in Table 2(b), the translationese issue varies across SEA languages. Languages such as Tagalog (tgl), Burmese (mya), and Malay (zlm) have more severe translationese problems, with existing LLMs producing natural sentences only 11.58%, 19.47%, and 22.24% of the time, respectively. This underscores the need for further improvements in LLMs to more effectively address the linguistic diversity and complexity of SEA languages.
5 Discussions
5.1 Resource Gaps in SEA
Coverage
SEACrowd covers 980 out of the 1,308 languages spoken in SEA (74.9%). Despite this high coverage, language representation in SEACrowd exhibits a very long-tail distribution, with over 700 languages having only 1 or 2 datasets, and only 23 languages having 20 datasets or more. These less represented languages typically exist only in the form of lexicons Asgari et al. (2020); List et al. (2022) or unlabeled data Leong et al. (2022); Kudugunta et al. (2024); Nguyen et al. (2024). Existing tasks in SEACrowd still cover only a small portion of languages. For instance, sentiment analysis data is available for only 22 languages, and named entity recognition (NER) data is available for just 17 languages. Furthermore, for modalities beyond text, SEA resources are extremely underrepresented. Approximately 90% of SEA indigenous languages lack both speech and VL datasets.
![Refer to caption](x11.png)
![Refer to caption](x12.png)
Quality
78.7% of the datasets in SEACrowd are published in peer-reviewed venues, and most of the data has undergone external validation. The overall quality of the datasets in SEACrowd is depicted in Figure 5(b). We compile the reported data construction methods by the authors, considering both the data collection method (i.e., data source) and label annotation validation (i.e., quality control). Nearly 19% of the datasets in SEACrowd have machine-generated and machine-translated annotations, while more than 80% were obtained from online texts (e.g., web crawling) and expert generation. In terms of label annotation validation, 62.4% of the datasets have been fully manually checked, while the remaining portion is partially validated and automatically checked. Note that these statistics only provide an initial indication of dataset collection quality on the surface and do not necessarily reflect the exact quality. Only a few datasets (6%) in SEACrowd report their detailed quality metrics (e.g., inter-annotator agreement scores). A deeper investigation is required for future work.
Cultural Relevance
The resource gap in SEA extends to the cultural aspect, where misrepresentation can lead to offensive behaviors, e.g., cultural appropriation and stereotyping Evans et al. (2020); Glotov (2023). As a proxy of the cultural relevance of SEA datasets, we manually curated 259 data subsets used in SEACrowd evaluation based on their data source. Specifically, we categorize them whether they are 1) translated from another language, 2) crawled from local sources, or 3) hand-crafted to capture cultural relevance. In Figure 5(c), approximately 70% lack cultural relevance, as many are machine-translated from English sources. About 20% are taken from local news, social media, or other local outlets, which potentially contain some culturally relevant data. Only the remaining 10% are designed to consider cultural relevance, derived from studies highlighting serious deficiencies in cultural understanding by LLMs for underrepresented languages Kabra et al. (2023); Koto et al. (2023a); Wibowo et al. (2023); Liu et al. (2024); Koto et al. (2024).
5.2 Conclusion & Future Work
Southeast Asia is home to highly diverse languages and cultures; the majority of its people do not use English as their primary language. The utility of English-first AI is limited for the majority of Southeast Asian users, especially in critical sectors like healthcare and education. Through SEACrowd, we have explored the AI landscape in SEA and bridged the gaps in resources, evaluation, and naturalness analysis of AI models in SEA languages. Further, our initiative has nurtured an open-source research community, which will actively continue to add and maintain datasheets and dataloaders, as well as drive AI research and developments in SEA.
Nonetheless, AI development in SEA requires concentrated efforts by a range of stakeholders, who may prioritize differently when it comes to incorporating the region’s 1,300+ languages into AI models. Moving forward, our work suggests AI development in SEA should prioritize two key metrics: 1) potential utility and 2) resource equity.111111https://github.com/SEACrowd/globalutility
Potential utility
Potential utility is defined as the gap between current utility and ideal utility, in which model capability acts as a proxy for utility. Based on potential utility, unsurprisingly the development of the national languages (except for English and Chinese used in Singapore), i.e., Indonesian (ind), Burmese (mya), Vietnamese (vie), Thai (tha), Filipino (fil), Khmer (khm), Malay (zlm), and Lao (lao) in Figure 6, will bring the biggest benefit. Among them, we identify notable gaps in the naturalness of Malay, Burmese, and Filipino AI-generated outputs (§4.2). Focused efforts in resource building for these languages may move the needle the most for utility. Beyond the national languages, growing local languages or dialects with large speaker bases, e.g., Javanese (jav), Sundanese (sun), and Hmong (hmn), is key.
Resource equity
Resource equity is defined as the gap between existing and ideal resource availability (Figure 6). We found that many local languages or dialects still fall short of the expected level of resources. These include Northeastern Thai (tts), Northern Thai (nod), Hmong Do (hmv), Southern Thai (sou), Cebuano (ceb), Ilocano (ilo), and others. Efforts to narrow these gaps would not only help preserve these languages but also ensure the continuation of the cultural heritage of the speakers of these languages. More details on SEA language prioritization for different weightings of demand can be found in Appendix I.
To improve these metrics, governments, and industry leaders in the region should invest in R&D activities to improve regional language capability for both the national languages and local dialects. This could include funding for open data collection and collaborations with local communities to address the resource gap in local languages. This also requires long-term sustainable strategies, such as catalyzing profitable use cases based on inclusive AI models, promoting fair and responsible compensation schemes for data workers, and orchestrating win-win exemplar collaborations between data owners, AI, and application developers.
Acknowledgments
We would like to thank our amazing contributors: Joshua Spergel, Tiezheng Yu, Parinthapat Pengpun, Bin Wang, Ishan Jindal, Muhammad Satrio, Jipeng Zhang, Bhavish Pahwa, Haryo Akbarianto Wibowo, Hiroki Nomoto, Yohanes Sigit Purnomo W.P., Ahmad Fathan Hidayatullah, Bryan Wilie, Ruhiyah Faradishi Widiaputri, Rafif Rabbani, Fawwaz Mayda, Manoj Khatri, Supryadi Supryadi, Virach Sornlertlamvanich, Pavaris Ruangchutiphophan, Erland Hilman Fuadi, Mega Fransiska, Richardy Sapan, and Camilla Johnine Cosme for their hard work in submitting datasheets and implementing dataloaders for SEACrowd.
This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme; PhD Fellowship Award, the Hong Kong University of Science and Technology; and PF20-43679 Hong Kong PhD Fellowship Scheme, Research Grant Council, Hong Kong. JMI is funded by National University Philippines and the UKRI Centre for Doctoral Training in Accountable, Responsible and Transparent AI [EP/S023437/1] of the University of Bath. In addition, we would like to express our gratitude to Cohere For AI sfor providing research grants that enabled us to perform our experiments using a commercial baseline, specifically Command-R.
Limitations
While our work covers nearly 1,000 SEA languages, many dialects, which are considered as belonging to a parent language, are missing from our evaluation benchmark. For instance, for the Malay language, only Standard Malay (zsm) is evaluated, but not other dialects such as Sarawak Malay (zlm-sar). Furthermore, the majority of our datasets also do not contain code-switched texts, which is a common linguistic phenomenon of SEA language usage (Aji et al., 2023). Moreover, the language coverage of different evaluation tasks varies significantly. For instance, NLP tasks cover 34 languages in total, whereas VL tasks only cover 4 languages.
Ethics Statement
In developing an evaluation benchmark for SEA languages, we have taken several steps to ensure ethical considerations are addressed comprehensively. First, the data used for this benchmark is sourced from publicly available resources, ensuring compliance with legal and ethical standards regarding data privacy. Where applicable, explicit consent was obtained from data contributors. Furthermore, all the datasets and resources utilized in this benchmark are used in accordance with their respective licenses. Second, our benchmark aims to be inclusive, representing a wide range of SEA languages, including those that are underrepresented in current linguistic resources. Lastly, our research process, including data collection, benchmark development, and evaluation methodologies, is entirely open-sourced and is documented transparently to enable reproducibility and accountability.
References
- Adelani et al. (2022a) David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu. 2022a. A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
- Adelani et al. (2024) David Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba Alabi, Yanke Mao, Haonan Gao, and En-Shiun Lee. 2024. SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julian’s, Malta. Association for Computational Linguistics.
- Adelani et al. (2022b) David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, Mboning Tchiaze Elvis, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia, and Joyce Nakatumba-Nabende. 2022b. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Adelani et al. (2021) David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei. 2021. MasakhaNER: Named entity recognition for African languages. Transactions of the Association for Computational Linguistics, 9:1116–1131.
- Adelani et al. (2023) David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode, Tolulope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko, Abeeb Afolabi, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Ogbu, Chinedu Mbonu, Chiamaka Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola Awosan, Tadesse Kebede, Toadoum Sari Sakayo, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Kanda Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. 2023. MasakhaNEWS: News topic classification for African languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 144–159, Nusa Dua, Bali. Association for Computational Linguistics.
- Adilazuarda et al. (2023) Muhammad Farid Adilazuarda, Samuel Cahyawijaya, and Ayu Purwarianti. 2023. The obscure limitation of modular multilingual language models. ICLR Tiny Papers 2023.
- Adilazuarda et al. (2024) Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Ashutosh Dwivedi, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards measuring and modeling "culture" in llms: A survey. Preprint, arXiv:2403.15412.
- AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
- Aji et al. (2023) Alham Fikri Aji, Jessica Zosa Forde, Alyssa Marie Loo, Lintang Sutawika, Skyler Wang, Genta Indra Winata, Zheng-Xin Yong, Ruochen Zhang, A. Seza Doğruöz, Yin Lin Tan, and Jan Christian Blaise Cruz. 2023. Current status of NLP in south East Asia with insights from multilingualism and language diversity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract, pages 8–13, Nusa Dua, Bali. Association for Computational Linguistics.
- Aji et al. (2022) Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, and Sebastian Ruder. 2022. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
- AlKhamissi et al. (2024) Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. Preprint, arXiv:2402.13231.
- Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance.
- Ardila et al. (2020) Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
- Artetxe et al. (2020) Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
- Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open weight releases to further multilingual progress. Preprint, arXiv:2405.15032.
- Asai et al. (2023) Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila B Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2023. BUFFET: Benchmarking large language models for cross-lingual few-shot transfer. Preprint, arXiv:2305.14857.
- Asgari et al. (2020) Ehsaneddin Asgari, Fabienne Braune, Benjamin Roth, Christoph Ringlstetter, and Mohammad Mofrad. 2020. UniSent: Universal adaptable sentiment lexica for 1000+ languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4113–4120, Marseille, France. European Language Resources Association.
- Astuti et al. (2023) Laksmita Widya Astuti, Yunita Sari, and Suprapto. 2023. Code-mixed sentiment analysis using transformer for twitter social media data. International Journal of Advanced Computer Science and Applications, 14(10).
- Babu et al. (2021) Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. 2021. Xls-r: Self-supervised cross-lingual speech representation learning at scale. Preprint, arXiv:2111.09296.
- Bandarkar et al. (2023) Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2023. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
- Blasi et al. (2022) Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
- Burns et al. (2020) Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, and Bryan A Plummer. 2020. Learning to scale multilingual representations for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 197–213. Springer.
- Cahyawijaya et al. (2022) Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, and Ayu Purwarianti. 2022. Nusacrowd: A call for open and reproducible nlp research in indonesian languages. Preprint, arXiv:2207.10524.
- Cahyawijaya et al. (2024a) Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, and Pascale Fung. 2024a. High-dimension human value representation in large language models. arXiv preprint arXiv:2404.07900.
- Cahyawijaya et al. (2023a) Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, and Ayu Purwarianti. 2023a. NusaCrowd: Open source initiative for Indonesian NLP resources. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13745–13818, Toronto, Canada. Association for Computational Linguistics.
- Cahyawijaya et al. (2023b) Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Linuwih, Bryan Wilie, Galih Muridan, Genta Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2023b. NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945, Nusa Dua, Bali. Association for Computational Linguistics.
- Cahyawijaya et al. (2024b) Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. 2024b. Cendol: Open instruction-tuned generative large language models for indonesian languages. Preprint, arXiv:2404.06138.
- Cahyawijaya et al. (2021) Samuel Cahyawijaya, Genta Indra Winata, Bryan Wilie, Karissa Vincentio, Xiaohong Li, Adhiguna Kuncoro, Sebastian Ruder, Zhi Yuan Lim, Syafri Bahar, Masayu Khodra, Ayu Purwarianti, and Pascale Fung. 2021. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Catapang and Visperas (2023) Jasper Kyle Catapang and Moses Visperas. 2023. Emotion-based morality in Tagalog and English scenarios (EMoTES-3K): A parallel corpus for explaining (im)morality of actions. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 1–6, Tokyo, Japan. Association for Computational Linguistics.
- Communication et al. (2023) Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, and Skyler Wang. 2023. Seamlessm4t: Massively multilingual & multimodal machine translation. Preprint, arXiv:2308.11596.
- Conneau et al. (2021) Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430.
- Conneau et al. (2022) Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, and Melvin Johnson. 2022. XTREME-S: Evaluating Cross-lingual Speech Representations. In Proc. Interspeech 2022, pages 3248–3252.
- Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
- Costa-jussà et al. (2024) Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang, and N. L. L. B. Team. 2024. Scaling neural machine translation to 200 languages. Nature.
- Dabre et al. (2022) Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh Khapra, and Pratyush Kumar. 2022. IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1849–1863, Dublin, Ireland. Association for Computational Linguistics.
- Dac Lai et al. (2023) Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2023. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2307.
- Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
- Dorfman (1979) Robert Dorfman. 1979. A formula for the gini coefficient. The review of economics and statistics, pages 146–149.
- Dou et al. (2024) Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Wei Lu, and Min Lin. 2024. Sailor: Open language models for south-east asia. Preprint, arXiv:2404.03608.
- Dryer and Haspelmath (2013) Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online (v2020.3). Zenodo.
- Durmus et al. (2023) Esin Durmus, Karina Nguyen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, et al. 2023. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
- Eberhard et al. (2021) David M. Eberhard, Gary F. Simons, and Charles D. Fennig. 2021. Ethnologue: Languages of the World. Twenty-fourth edition. Dallas, Texas: SIL International.
- Ebrahimi et al. (2022) Abteen Ebrahimi, Manuel Mager, Arturo Oncevay, Vishrav Chaudhary, Luis Chiruzzo, Angela Fan, John Ortega, Ricardo Ramos, Annette Rios, Ivan Vladimir Meza Ruiz, Gustavo Giménez-Lugo, Elisabeth Mager, Graham Neubig, Alexis Palmer, Rolando Coto-Solano, Thang Vu, and Katharina Kann. 2022. AmericasNLI: Evaluating zero-shot natural language understanding of pretrained multilingual models in truly low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6279–6299, Dublin, Ireland. Association for Computational Linguistics.
- Elias (2018) Alexander Elias. 2018. Lio and the central flores languages. Leiden: Leiden University Master thesis.
- Evans et al. (2020) Leanne M Evans, Crystasany R Turner, and Kelly R Allen. 2020. " good teachers" with" good intentions": Misappropriations of culturally responsive pedagogy. Journal of Urban Learning, Teaching, and Research, 15(1):51–73.
- Federmann et al. (2022) Christian Federmann, Tom Kocmi, and Ying Xin. 2022. NTREX-128 – news test references for MT evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 21–24, Online. Association for Computational Linguistics.
- Gebru et al. (2021) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Communications of the ACM, 64(12):86–92.
- Geigle et al. (2023) Gregor Geigle, Abhay Jain, Radu Timofte, and Goran Glavaš. 2023. mblip: Efficient bootstrapping of multilingual vision-llms. arXiv, abs/2307.06930.
- Gemma Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Open models based on gemini research and technology. Preprint, arXiv:2403.08295.
- Glotov (2023) Sergei Glotov. 2023. Intercultural film literacy education against cultural misrepresentation: Finnish visual art teachers’ perspectives. Journal of Media Literacy Education, 15(1):31–43.
- Gong et al. (2023) Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
- Hammarström et al. (2024) Harald Hammarström, Robert Forkel, Martin Haspelmath, and Sebastian Bank. 2024. Glottolog 5.0. leipzig: Max planck institute for evolutionary anthropology.
- Hasan et al. (2021) Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
- He et al. (2022) Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2022. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
- He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- Huang et al. (2021) Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander Hauptmann. 2021. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2443–2459, Online. Association for Computational Linguistics.
- Huynh et al. (2022) Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2022. ViNLI: A Vietnamese corpus for studies on open-domain natural language inference. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3858–3872, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Ichsan (2023) Muhammad Ichsan. 2023. Merak-7b: The llm for bahasa indonesia. Hugging Face Repository.
- Imperial et al. (2019) Joseph Marvin Imperial, Jeyrome Orosco, Shiela Mae Mazo, and Lany Maceda. 2019. Sentiment analysis of typhoon related tweets using standard and bidirectional recurrent neural networks. arXiv preprint arXiv:1908.01765.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
- Jiang et al. (2022) Shengyi Jiang, Sihui Fu, Nankai Lin, and Yingwen Fu. 2022. Pretrained models and evaluation data for the khmer language. Tsinghua Science and Technology, 27(4):709–718.
- Joshi et al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Juan et al. (2015) Sarah Samson Juan, Laurent Besacier, Benjamin Lecouteux, and Mohamed Dyab. 2015. Using resources from a closely-related language to develop asr for a very under-resourced language: A case study for iban. In Proceedings of INTERSPEECH, Dresden, Germany.
- Kabra et al. (2023) Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, and Graham Neubig. 2023. Multi-lingual and multi-cultural figurative language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada. Association for Computational Linguistics.
- Kakwani et al. (2020) Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4948–4961, Online. Association for Computational Linguistics.
- Karo et al. (2022) Ichwanul Muslim Karo Karo, Mohd Farhan Md Fudzee, Shahreen Kasim, and Azizul Azhar Ramli. 2022. Sentiment analysis in karonese tweet using machine learning. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 10(1):219–231.
- Khanuja et al. (2023) Simran Khanuja, Sebastian Ruder, and Partha Talukdar. 2023. Evaluating the diversity, equity, and inclusion of NLP technology: A case study for Indian languages. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1763–1777, Dubrovnik, Croatia. Association for Computational Linguistics.
- Koto et al. (2023a) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023a. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore. Association for Computational Linguistics.
- Koto et al. (2023b) Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. 2023b. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
- Koto et al. (2022) Fajri Koto, Timothy Baldwin, and Jey Han Lau. 2022. Cloze evaluation for deeper understanding of commonsense stories in Indonesian. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 8–16, Dublin, Ireland. Association for Computational Linguistics.
- Koto and Koto (2020) Fajri Koto and Ikhwan Koto. 2020. Towards computational linguistics in Minangkabau language: Studies on sentiment analysis and machine translation. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, pages 138–148, Hanoi, Vietnam. Association for Computational Linguistics.
- Koto et al. (2024) Fajri Koto, Rahmad Mahendra, Nurul Aisyah, and Timothy Baldwin. 2024. Indoculture: Exploring geographically-influenced cultural commonsense reasoning across eleven indonesian provinces. Preprint, arXiv:2404.01854.
- Kudugunta et al. (2024) Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2024. Madlad-400: a multilingual and document-level large audited dataset. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
- Kumar et al. (2022) Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh Mishra, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra, and Pratyush Kumar. 2022. IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5363–5394, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Laurençon et al. (2024) Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models? Preprint, arXiv:2405.02246.
- Le and Luu (2023) Thang Le and Anh Luu. 2023. A parallel corpus for Vietnamese central-northern dialect text transfer. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13839–13855, Singapore. Association for Computational Linguistics.
- Leong et al. (2022) Colin Leong, Joshua Nemecek, Jacob Mansdorfer, Anna Filighera, Abraham Owodunni, and Daniel Whitenack. 2022. Bloom library: Multimodal datasets in 300+ languages for a variety of downstream tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8608–8621, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Leong et al. (2023) Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto, Hamsawardhini Rengarajan, Kengatharaiyer Sarveswaran, and William Chandra Tjhi. 2023. Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models. arXiv preprint arXiv:2309.06085.
- Li et al. (2023a) Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. 2023a. Bactrian-x: A multilingual replicable instruction-following model with low-rank adaptation. arXiv preprint arXiv:2305.15011.
- Li et al. (2023b) Zejun Li, Zhihao Fan, Jingjing Chen, Qi Zhang, Xuanjing Huang, and Zhongyu Wei. 2023b. Unifying cross-lingual and cross-modal modeling towards weakly supervised multilingual vision-language pre-training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5939–5958, Toronto, Canada. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Lin et al. (2022) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, and Xian Li. 2022. Few-shot learning with multilingual generative language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- List et al. (2022) Johann-Mattis List, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch, and Russell D. Gray. 2022. Lexibank, a public repository of standardized wordlists with computed phonological and lexical features. Scientific Data, 9(1):316.
- Liu et al. (2024) Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. 2024. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings. Preprint, arXiv:2309.08591.
- Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023a. Improved baselines with visual instruction tuning.
- Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual instruction tuning. In NeurIPS.
- Longpre et al. (2021) Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. Transactions of the Association for Computational Linguistics, 9:1389–1406.
- Longpre et al. (2023) Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2023. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787.
- Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Mager et al. (2021) Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, and Katharina Kann, editors. 2021. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas. Association for Computational Linguistics, Online.
- Mahendra et al. (2021) Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. 2021. IndoNLI: A natural language inference dataset for Indonesian. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10511–10527, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Muennighoff et al. (2023) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. 2023. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
- Muennighoff et al. (2022) Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. 2022. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Muzad and Rahutomo (2016) Aad Muzad and Faisal Rahutomo. 2016. Korpus berita daring bahasa indonesia dengan depth first focused crawling. Prosiding Sentrinov (Seminar Nasional Terapan Riset Inovatif), 2(1):11–20.
- Neubig and Hu (2018) Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 875–880, Brussels, Belgium. Association for Computational Linguistics.
- Nguyen et al. (2020) Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan Nguyen. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2595–2605, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Nguyen et al. (2024) Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4226–4237, Torino, Italia. ELRA and ICCL.
- Nguyen et al. (2023) Xuan-Phi Nguyen, Wenxuan Zhang, Li Xin, Mahani Aljunied, Weiwen Xu, Hou Pong Chan, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing. 2023. Seallms - large language models for southeast asia. Preprint, arXiv:arXiv:2312.00738.
- OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Palen-Michel and Lignos (2023) Chester Palen-Michel and Constantine Lignos. 2023. LR-sum: Summarization for less-resourced languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6829–6844, Toronto, Canada. Association for Computational Linguistics.
- Phatthiyaphaibun et al. (2023) Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai natural language processing in python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore. Association for Computational Linguistics.
- Phatthiyaphaibun et al. (2024) Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, and Sarana Nutanong. 2024. Wangchanlion and wangchanx mrc eval. Preprint, arXiv:2403.16127.
- Ponti et al. (2020) Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. Association for Computational Linguistics.
- Popović (2015) Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Popović (2017) Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark. Association for Computational Linguistics.
- Pratap et al. (2024) Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. 2024. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
- Project (2024) The Joshua Project. 2024. The joshua project.
- Purwarianti and Crisdayanti (2019) Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti. 2019. Improving bi-lstm performance for indonesian sentiment analysis using paragraph vector. In 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pages 1–5. IEEE.
- Purwarianti et al. (2007) Ayu Purwarianti, Masatoshi Tsuchiya, and Seiichi Nakagawa. 2007. A machine learning approach for Indonesian question answering system. In Artificial Intelligence and Applications, pages 573–578.
- Putra et al. (2024) I Made Suwija Putra, Daniel Siahaan, and Ahmad Saikhu. 2024. Snli indo: A recognizing textual entailment dataset in indonesian derived from the stanford natural language inference dataset. Data in Brief, 52:109998.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
- Riccosan and Saputra (2023) Riccosan and Karen Etania Saputra. 2023. Multilabel multiclass sentiment and emotion dataset from indonesian mobile application review. Data in Brief, 50:109576.
- Riley et al. (2020) Parker Riley, Isaac Caswell, Markus Freitag, and David Grangier. 2020. Translationese as a language in “multilingual” NMT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online. Association for Computational Linguistics.
- Rizqullah et al. (2023) Muhammad Razif Rizqullah, Ayu Purwarianti, and Alham Fikri Aji. 2023. Qasina: Religious domain question answering using sirah nabawiyah. In 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), pages 1–6. IEEE.
- Rouditchenko et al. (2023) Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, and James Glass. 2023. Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages. In Proc. INTERSPEECH 2023, pages 2268–2272.
- Ruder et al. (2023) Sebastian Ruder, Jonathan H Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel Sarr, Xinyi Wang, et al. 2023. Xtreme-up: A user-centric scarce-data benchmark for under-represented languages. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1856–1884.
- Sanh et al. (2021) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization. Preprint, arXiv:2110.08207.
- Sani et al. (2012) Auliya Sani, Sakriani Sakti, Graham Neubig, Tomoki Toda, Adi Mulyanto, and Satoshi Nakamura. 2012. Towards language preservation: Preliminary collection and vowel analysis of indonesian ethnic speech data. In 2012 International Conference on Speech Database and Assessments, pages 118–122.
- Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
- Setya and Mahendra (2018) Ken Nabila Setya and Rahmad Mahendra. 2018. Semi-supervised textual entailment on indonesian wikipedia data. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 416–427. Springer.
- Singapore (2023) AI Singapore. 2023. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion.
- Singh et al. (2024) Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, et al. 2024. Aya dataset: An open-access collection for multilingual instruction tuning. arXiv preprint arXiv:2402.06619.
- Søgaard (2022) Anders Søgaard. 2022. Should we ban English NLP for a year? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5254–5260, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Sutoyo et al. (2022) Rhio Sutoyo, Said Achmad, Andry Chowanda, Esther Widhi Andangsari, and Sani M. Isa. 2022. Prdect-id: Indonesian product reviews dataset for emotions classification tasks. Data in Brief, 44:108554.
- Täckström et al. (2012) Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 477–487, Montréal, Canada. Association for Computational Linguistics.
- Talat et al. (2022) Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim Masoud, Margaret Mitchell, Dragomir Radev, Shanya Sharma, Arjun Subramonian, Jaesung Tae, Samson Tan, Deepak Tunuguntla, and Oskar Van Der Wal. 2022. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41, virtual+Dublin. Association for Computational Linguistics.
- Thapliyal et al. (2022) Ashish V. Thapliyal, Jordi Pont Tuset, Xi Chen, and Radu Soricut. 2022. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 715–729, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Tran et al. (2021) Khanh Quoc Tran, Phap Ngoc Trinh, Khoa Nguyen-Anh Tran, An Tran-Hoai Le, Luan Van Ha, and Kiet Van Nguyen. 2021. An empirical investigation of online news classification on an open-domain, large-scale and high-quality dataset in vietnamese. In New Trends in Intelligent Software Methodologies, Tools and Techniques, pages 367–379. IOS Press.
- Üstün et al. (2024) Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. 2024. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
- Van Nguyen et al. (2022) Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, and Ngan Luu-Thuy Nguyen. 2022. New vietnamese corpus for machine reading comprehension of health news articles. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 21(5).
- Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Wang et al. (2023) Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F Chen. 2023. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. arXiv preprint arXiv:2309.04766.
- Wang et al. (2024) Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F. Chen. 2024. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. NAACL.
- Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Wibowo et al. (2023) Haryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya, Radityo Eko Prasojo, and Alham Fikri Aji. 2023. Copal-id: Indonesian language reasoning with local culture and nuances. arXiv preprint arXiv:2311.01012.
- Wilie et al. (2020) Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. 2020. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
- Winata et al. (2023) Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. 2023. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
- Winata et al. (2024) Genta Indra Winata, Ruochen Zhang, and David Ifeoluwa Adelani. 2024. Miners: Multilingual language models as semantic retrievers. arXiv preprint arXiv:2406.07424.
- Workshop et al. (2022) BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Yong et al. (2023) Zheng Xin Yong, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Long Phan, Rowena Garcia, Thamar Solorio, and Alham Aji. 2023. Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63, Singapore. Association for Computational Linguistics.
- Zhang et al. (2023a) Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, and Alham Aji. 2023a. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, Singapore. Association for Computational Linguistics.
- Zhang et al. (2023b) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2023b. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Advances in Neural Information Processing Systems, volume 36, pages 5484–5505. Curran Associates, Inc.
- Zhang et al. (2024) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2024. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems, 36.
- Zhu et al. (2024) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ICLR.
Appendix A Summary of SEACrowd
Benchmark | # Languages | # Indigenous SEA Languages | # Datasets | # Tasks |
SEACrowd (ours)† | 39 | 38 | 254 | 13 (11 text, 1 speech, 1 vision) |
NusaCrowd† Cahyawijaya et al. (2023a) | 19 | 19 | 137 | 12 (11 text, 1 speech) |
BUFFET Asai et al. (2023) | 54 | N/A | 15 | 8 (8 text) |
XTREME-UP Ruder et al. (2023) | 88 | 11 | 269 | 9 (7 text, 1 speech, 1 vision) |
Addressing the resource gaps and challenges in AI development for Southeast Asian (SEA) languages is essential for our region’s sustainable and prosperous future. The lack of representation of SEA languages in machine learning pre-training models severely impacts their quality. Additionally, the scarcity of high-quality datasets and evaluation tools further hinders progress in AI for SEA languages. The dominance of English-centric training data introduces cultural biases and fails to capture the local values and nuances of SEA cultures. To overcome these obstacles, SEACrowd provides a comprehensive and standardized resource center, along with evaluation tasks, for nearly 1,000 SEA indigenous and non-indigenous languages across various modalities. SEACrowd closes the resource and evaluation gaps, enabling researchers and developers to improve the performance of AI models for SEA languages.
The journey does not end here. Concrete next steps are essential to drive AI advancement in Southeast Asia. Strategic investments in research and development, collaborations with local communities, and efforts toward language preservation are imperative. Governments, industry leaders, and stakeholders must prioritize the development of national and under-resourced local languages to ensure resource equity and promote inclusivity in AI technology. By taking bold actions, such as funding initiatives for data collection and model training, establishing partnerships with local communities, and focusing on language preservation, we can unlock the full potential of Southeast Asian languages. Not only this will spur economic growth but also preserve the region’s rich cultural heritage.
In conclusion, developing AI for Southeast Asian languages is not a mere necessity but an opportunity to create a sustainable and prosperous future. By addressing resource gaps, accurately evaluating models, and fostering inclusive AI development, we can harness the power of SEA languages to drive long-term economic growth while preserving our region’s cultural diversity.
Appendix B Related Work
SEA data resources
LLM research efforts for SEA languages are limited by the lack of available datasets and benchmarks. Up to this day, resources for SEA NLP tasks are concentrated on relatively higher-resource SEA indigenous languages, such as Indonesian (Mahendra et al., 2021; Wilie et al., 2020; Cahyawijaya et al., 2021, 2023a) and Vietnamese (Nguyen et al., 2020; Huynh et al., 2022; Le and Luu, 2023; Van Nguyen et al., 2022). NusaCrowd Cahyawijaya et al. (2023a) introduce the first multimodal benchmark for Indonesian languages, including text and speech. Ruder et al. (2023) introduce a multimodal benchmark encompassing 11 indigenous languages from SEA, spanning a wide array of languages totaling 88.
Additionally, Asai et al. (2023) present an LLM benchmark for cross-lingual few-shot transfer, comprising 15 distinct tasks and 54 languages sourced from varied multilingual datasets. Furthermore, Dou et al. (2024) find that publicly available pre-training data for SEA languages suffer from quality issues such as textual duplicates and excessive occurrences of Unicode escapes. On the other hand, pre-trained LLMs specifically for SEA languages suffer from limited language coverage; for instance, Cendol Cahyawijaya et al. (2024b), Sailor Dou et al. (2024), SEA-LION (Singapore, 2023), and SeaLLMs Nguyen et al. (2023) have only covered up to 11 different SEA languages, including English and Chinese.
Open-source Community Initiatives in NLP
Open-source and open-science communities play a crucial role in engaging native speakers to curate large-scale multilingual NLP resources. In the past, collaborative efforts have been organized to collect data and train multilingual language models either on a global scale (Workshop et al., 2022; Singh et al., 2024; Üstün et al., 2024) or on a regional level, e.g., Masakhane for African languages (Adelani et al., 2021, 2022b, 2022a, 2023), AI4Bharat for Indian languages (Kakwani et al., 2020; Kumar et al., 2022; Dabre et al., 2022, inter alia), and AmericasNLP for Latin American languages Mager et al. (2021); Ebrahimi et al. (2022). In the SEA region, there have been community-based initiatives, e.g., IndoNLP, PyThaiNLP, and RojakNLP, to study NLP on Indonesian languages (Aji et al., 2022; Wilie et al., 2020; Cahyawijaya et al., 2021, 2023a), Thai language (Phatthiyaphaibun et al., 2023), and the code-switching phenomenon in SEA (Aji et al., 2023; Yong et al., 2023; Winata et al., 2024), respectively.
Submission | Points | Max points |
Public datasheet | 2+bonus | 6 |
Dataloader | 3 | 6 if difficult |
Private datasheet | 1 | - |
Access to private data | 4+bonus | 10 if high-quality |
Datasheet review | 1 | 1 |
Dataloader review | 2 | 4 if difficult |
Private datasheet review | 0.5 | - |
Private data contact | 1 | 5 if succeeds |
Appendix C Contributing to SEACrowd
C.1 Open Contributions
We identify four tasks for open contribution in SEACrowd.121212Landing page: https://github.com/SEACrowd. These tasks and the workflow of SEACrowd are heavily influenced by and extended upon NusaCrowd Cahyawijaya et al. (2023a, 2022), a collaborative effort to pool data resources for Indonesian NLP.
![Refer to caption](extracted/5712155/latex/figures/initiative/seacrowd-catalogue.png)
-
•
Submitting Metadata for Existing Public Datasets. Contributors can submit detailed datasheets for existing datasets through this form.131313Public datasheet form: https://form.jotform.com/team/232952680898069/seacrowd-sea-datasets. Contributors must provide important information such as data license, size, language and dialect, annotation method, and so on. The approved datasheets, as well as under review datasheets, will show up and be indexed in a monitor spreadsheet and the SEACrowd Catalogue (Figure 7).
-
•
Building a Dataloader. From the approved datasheets from the previous task, contributors can further contribute by building a HuggingFace dataset loader to ensure that all datasets in SEACrowd are standardized in terms of formatting and usage. Contributors can follow a dataloader guide and examples available141414Dataloader guide: https://github.com/SEACrowd/seacrowd-datahub/blob/master/DATALOADER.md. in the SEACrowd Data Hub. Dataloader maintainers and reviewers also monitor the self-assigned dataloader issues after 2 weeks of inactivity and ping contributors in case of a blocking impediment.
-
•
Identifying Private AI Datasets for SEA Languages, Cultures, and/or Regions. Unfortunately, a number of prior works involving SEA languages are still not publicly available. These may be due to several different reasons, including (but not limited to): non-release contracts related to funding, inclusion of private and personally identifiable data, and the use of explicitly private data such as those used by for-profit companies.
In this task, contributors can search for works that contain private data and fill out a corresponding record form.151515Papers with private dataset form: https://form.jotform.com/team/232952680898069/seacrowd-paper-with-private-dataset. The SEACrowd team then attempts to contact the original data owners and negotiate the open-sourcing of their resources.
-
•
Opening a Private AI Dataset of SEA. If a contributor has previous work with closed data (or has been contacted by the SEACrowd team regarding closed-source data), they can decide to release their resources and register them in the collection via the public datasheet form. The resource will still be owned by the original contributor and is still tied to the contributor’s previous work, as SEACrowd simply catalogs it and records its now open-source license.
C.2 Measuring Contributions
![Refer to caption](extracted/5712155/latex/figures/initiative/seacrowd-timeline.png)
To be considered as a co-author, 20 contribution points are required.161616Submissions past the deadlines (see Appendix D.1) are still recorded, but contribution points are no longer given. To monitor how many points the contributors have obtained, the contribution point tracking is provided and updated regularly. The purpose of the point system is not to barrier collaboration but to reward rare and high-quality dataset entries. Table 4 describes the contribution points.171717Contribution point guidelines: https://github.com/SEACrowd/seacrowd-datahub/blob/master/POINTS.md. A bonus of 1 point is given if the dataset modality is speech or vision. We also provide a bonus based on the language rarity in terms of available resources as defined by Joshi et al. (2020)181818https://microsoft.github.io/linguisticdiversity/assets/lang2tax.txt, consisting of 1 point for languages in level 1 and 2, and 2 points for languages in level 0 or absent from the list. For other contributions not mentioned in Table 4 (e.g., maintenance, design, experiment, paper writing, etc.), the amount of contribution points is adjusted to the bulk and the complexity of the relevant work.
Appendix D Progression of SEACrowd
D.1 Timeline
SEACrowd released the open call for contributions on 1 November 2023. This lasted until 31 March 2024, for datasheet submissions, and until 15 May 2024 for both dataloaders and private dataset submissions. SEACrowd contributors have a biweekly discussion regarding the challenges they face while contributing, the next steps they should take to proceed, and/or experiment and research ideas for the paper. The detailed timeline can be seen in Figure 8.
D.2 Contribution Progress
Figure 9 shows the number of submissions for public datasheets, dataloader pull requests, and papers with private datasets in SEACrowd.
Appendix E Reviewing SEACrowd’s Submissions
We provide the complete reviewing guidelines in our Data Hub.191919Reviewer SOP: https://github.com/SEACrowd/seacrowd-datahub/blob/master/REVIEWING.md
E.1 Datasheet Reviewing
The datasheet reviewing standard operating procedure (SOP) ensures the integrity and completeness of datasets submitted to SEACrowd. It outlines procedures for verifying dataset availability, avoiding duplicates, and ensuring correctness and relevance to the SEA region. The SOP includes FAQs addressing common issues such as dataset duplicates and incorrect information, along with an approval checklist covering aspects like data availability, dataset splits, and licensing. Reviewers are instructed on how to handle various scenarios, including correcting errors and determining points allocation for multiple contributors. For instance, if the datasheet submitted has incorrect or missing information, the reviewer can either ask the contributor to fix it (with some guidance) or fix it themself. Upon completion of the review, reviewers update the status, add notes and points, and await the generation of a GitHub issue for the approved datasheet.
![Refer to caption](extracted/5712155/latex/figures/initiative/seacrowd-contribution-progress.png)
E.2 Dataloader Reviewing
The dataloader reviewing SOP governs the review process for dataloaders in SEACrowd, ensuring adherence to the data structure and seacrowd schema and config standards. It specifies checks for metadata correctness, subset implementation, test script passing, and adherence to coding conventions. Additionally, it outlines dataloader config rules based on dataset types and provides guidelines for multilingual datasets. The SOP emphasizes the importance of reviewer collaboration, with each dataloader requiring two reviewers per submitted pull request, and outlines the approval and reviewer assignment process, either by allocation or by self-assignment based on availability and promptness.
Appendix F Schemas in SEACrowd
Schemas define and format the attributes of the dataset returned by a dataloader. For each dataloader, we implement 2 schema types: the source schema and the seacrowd schema. The source schema presents the dataset in a format similar to its original structure, while the seacrowd schema standardizes the data structure across similar tasks.
Subset ID | Language | Region | # Samples |
Sentiment Analysis *_seacrowd_text | |||
lazada_review_filipino | fil | Philippines | 1001 |
gklmip_sentiment | mya | Myanmar | 716 |
indolem_sentiment | ind | Indonesia | 1011 |
id_sentiment_analysis | ind | Indonesia | 10806 |
karonese_sentiment | btx | Indonesia | 1000 |
wisesight_thai_sentiment | tha | Thailand | 2671 |
wongnai_reviews | tha | Thailand | 6203 |
typhoon_yolanda_tweets | fil | Philippines | 153 |
smsa | ind | Indonesia | 500 |
prdect_id_sentiment | ind | Indonesia | 5400 |
id_sent_emo_mobile_apps_sentiment | ind | Indonesia | 21696 |
shopee_reviews_tagalog | fil | Philippines | 2250 |
nusatranslation_senti_abs | abs | Indonesia | 500 |
nusatranslation_senti_btk | btx | Indonesia | 1200 |
nusatranslation_senti_bew | bew | Indonesia | 1200 |
nusatranslation_senti_bhp | bhp | Indonesia | 500 |
nusatranslation_senti_jav | jav | Indonesia | 1200 |
nusatranslation_senti_mad | mad | Indonesia | 1200 |
nusatranslation_senti_mak | mak | Indonesia | 1200 |
nusatranslation_senti_min | min | Indonesia | 1200 |
nusatranslation_senti_mui | mui | Indonesia | 500 |
nusatranslation_senti_rej | rej | Indonesia | 500 |
nusatranslation_senti_sun | sun | Indonesia | 1200 |
nusax_senti_ind | ind | Indonesia | 400 |
nusax_senti_ace | ace | Indonesia | 400 |
nusax_senti_jav | jav | Indonesia | 400 |
nusax_senti_sun | sun | Indonesia | 400 |
nusax_senti_min | min | Indonesia | 400 |
nusax_senti_bug | bug | Indonesia | 400 |
nusax_senti_bbc | bbc | Indonesia | 400 |
nusax_senti_ban | ban | Indonesia | 400 |
nusax_senti_nij | nij | Indonesia | 400 |
nusax_senti_mad | mad | Indonesia | 400 |
nusax_senti_bjn | bjn | Indonesia | 400 |
nusax_senti_eng | eng | Non-indigenous | 400 |
indonglish | ind | Indonesia | 1011 |
Subset ID | Language | Region | # Samples |
NLI *_seacrowd_pairs | |||
indonli | ind | Indonesia | 5183 |
wrete | ind | Indonesia | 100 |
snli_indo | ind | Indonesia | 9823 |
myxnli | mya | Myanmar | 5010 |
xnli.tha | tha | Thailand | 5010 |
xnli.vie | vie | Vietnam | 5010 |
Subset ID | Language | Region | # Samples |
Topic Classification *_seacrowd_text | |||
gklmip_newsclass | khm | Cambodia | 1436 |
indonesian_news_dataset | ind | Indonesia | 2627 |
uit_vion | vie | Vietnam | 26000 |
sib_200_ace_Arab | ace | Indonesia | 204 |
sib_200_ace_Latn | ace | Indonesia | 204 |
sib_200_ban_Latn | ban | Indonesia | 204 |
sib_200_bjn_Arab | bjn | Indonesia | 204 |
sib_200_bjn_Latn | bjn | Indonesia | 204 |
sib_200_bug_Latn | bug | Indonesia | 204 |
sib_200_ceb_Latn | ceb | Philippines | 204 |
sib_200_ilo_Latn | ilo | Philippines | 204 |
sib_200_ind_Latn | ind | Indonesia | 204 |
sib_200_jav_Latn | jav | Indonesia | 204 |
sib_200_kac_Latn | kac | Myanmar | 204 |
sib_200_khm_Khmr | khm | Cambodia | 204 |
sib_200_lao_Laoo | lao | Laos | 204 |
sib_200_lus_Latn | lus | Myanmar | 204 |
sib_200_min_Arab | min | Indonesia | 204 |
sib_200_min_Latn | min | Indonesia | 204 |
sib_200_mya_Mymr | mya | Myanmar | 204 |
sib_200_pag_Latn | pag | Philippines | 204 |
sib_200_shn_Mymr | shn | Myanmar | 204 |
sib_200_sun_Latn | sun | Indonesia | 204 |
sib_200_tgl_Latn | fil | Philippines | 204 |
sib_200_tha_Thai | tha | Thailand | 204 |
sib_200_vie_Latn | vie | Non-indigenous | 204 |
sib_200_war_Latn | war | Philippines | 204 |
sib_200_zsm_Latn | zsm | Malaysia | 204 |
nusaparagraph_topic_btk | btx | Indonesia | 500 |
nusaparagraph_topic_bew | bew | Indonesia | 800 |
nusaparagraph_topic_bug | bug | Indonesia | 300 |
nusaparagraph_topic_jav | jav | Indonesia | 800 |
nusaparagraph_topic_mad | mad | Indonesia | 700 |
nusaparagraph_topic_mak | mak | Indonesia | 700 |
nusaparagraph_topic_min | min | Indonesia | 800 |
nusaparagraph_topic_mui | mui | Indonesia | 400 |
nusaparagraph_topic_rej | rej | Indonesia | 350 |
nusaparagraph_topic_sun | sun | Indonesia | 900 |
Subset ID | Language | Region | # Samples |
Commonsense Reasoning *_seacrowd_text/qa | |||
emotes_3k_tgl | fil | Philippines | 2905 |
emotes_3k_eng | eng | Non-indigenous | 2905 |
indo_story_cloze | ind | Indonesia | 1135 |
xstorycloze_id | ind | Indonesia | 1511 |
xstorycloze_my | mya | Myanmar | 1511 |
Subset ID | Language | Region | # Samples |
Standard Testing QA *_seacrowd_qa | |||
indommlu_ind | ind | Indonesia | 14979 |
indommlu_ban | ban | Indonesia | 14979 |
indommlu_mad | mad | Indonesia | 14979 |
indommlu_mak | mak | Indonesia | 14979 |
indommlu_sun | sun | Indonesia | 14979 |
indommlu_jav | jav | Indonesia | 14979 |
indommlu_bjn | bjn | Indonesia | 14979 |
indommlu_abl | abl | Indonesia | 14979 |
indommlu_nij | nij | Indonesia | 14979 |
seaeval_cross_mmlu_ind | ind | Indonesia | 150 |
seaeval_cross_mmlu_vie | vie | Vietnam | 150 |
seaeval_cross_mmlu_zlm | zsm | Malaysia | 150 |
seaeval_cross_mmlu_fil | fil | Philippines | 150 |
seaeval_cross_logiqa_ind | ind | Indonesia | 176 |
seaeval_cross_logiqa_vie | vie | Vietnam | 176 |
seaeval_cross_logiqa_zlm | zsm | Malaysia | 176 |
seaeval_cross_logiqa_fil | fil | Philippines | 176 |
m3exam_jav | jav | Indonesia | 371 |
m3exam_tha | tha | Thailand | 2168 |
m3exam_vie | vie | Vietnam | 1789 |
okapi_m_arc_ind | ind | Indonesia | 1170 |
okapi_m_arc_vie | vie | Vietnam | 1170 |
Cultural QA *_seacrowd_qa | |||
copal_colloquial | ind | Indonesia | 559 |
xcopa_tha | tha | Thailand | 500 |
xcopa_vie | vie | Vietnam | 500 |
xcopa_ind | ind | Indonesia | 500 |
seaeval_sg_eval_eng | eng | Non-indigenous | 103 |
seaeval_ph_eval_eng | eng | Non-indigenous | 100 |
mabl_ind | ind | Indonesia | 1140 |
mabl_jav | jav | Indonesia | 600 |
mabl_sun | sun | Indonesia | 600 |
Reading Comprehension QA *_seacrowd_qa | |||
belebele_ceb_latn | ceb | Philippines | 900 |
belebele_ilo_latn | ilo | Philippines | 900 |
belebele_ind_latn | ind | Indonesia | 900 |
belebele_jav_latn | jav | Indonesia | 900 |
belebele_kac_latn | kac | Myanmar | 900 |
belebele_khm_khmr | khm | Cambodia | 900 |
belebele_lao_laoo | lao | Laos | 900 |
belebele_mya_mymr | mya | Myanmar | 900 |
belebele_shn_mymr | shn | Myanmar | 900 |
belebele_sun_latn | sun | Indonesia | 900 |
belebele_tgl_latn | fil | Philippines | 900 |
belebele_tha_thai | tha | Thailand | 900 |
belebele_vie_latn | vie | Vietnam | 900 |
belebele_war_latn | war | Philippines | 900 |
belebele_zsm_latn | zsm | Malaysia | 900 |
Subset ID | Language | Region | # Samples |
Extractive & Abstractive QA *_seacrowd_qa | |||
facqa | ind | Indonesia | 311 |
iapp_squad | tha | Thailand | 739 |
qasina | ind | Indonesia | 500 |
mkqa_khm | khm | Cambodia | 10000 |
mkqa_zsm | zsm | Malaysia | 10000 |
mkqa_tha | tha | Thailand | 10000 |
mkqa_vie | vie | Vietnam | 10000 |
Subset ID | Language | Region | # Samples |
Summarization *_seacrowd_t2t | |||
lr_sum_ind | ind | Indonesia | 500 |
lr_sum_vie | vie | Vietnam | 1460 |
lr_sum_lao | lao | Laos | 1496 |
lr_sum_tha | tha | Thailand | 500 |
lr_sum_khm | khm | Cambodia | 486 |
lr_sum_mya | mya | Myanmar | 990 |
xl_sum_mya | mya | Myanmar | 570 |
xl_sum_ind | ind | Indonesia | 4780 |
xl_sum_tha | tha | Thailand | 826 |
xl_sum_vie | vie | Vietnam | 4013 |
F.1 NLP
-
•
Unlabeled text (SSP). This schema could be used for language modeling in self-supervised pre-training. It consists of (id, text), where id denotes a unique row identifier of the dataset and text denotes an input text.
-
•
Single-label text classification (TEXT). This schema could be used for sentiment analysis, emotion classification, legal classification, and others. It consists of (id, text, label), where id denotes a unique row identifier of the dataset, text denotes an input text, and label denotes a deterministic target variable.
-
•
Multi-label text classification (TEXT MULTI). This schema could be used for hate speech detection and aspect-based sentiment analysis. It consists of (id, text, labels), where id denotes a unique row identifier of the dataset, text denotes an input text, and labels denotes a list of deterministic target variables.
-
•
Text-to-text (T2T). This schema could be used for machine translation, summarization, and paraphrasing. It consists of (id, text_1, text_2, text_1_name, text_2_name), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and text_1_name and text_2_name denote the names of the input text pair (e.g., ind and jav for translation input text pairs, or document and summary for summarization input text pairs).
-
•
Sequence labeling (SEQ LABEL). This schema could be used for named entity recognition (NER), POS tagging, and others. It consists of (id, tokens, labels), where id denotes a unique row identifier of the dataset, tokens denotes a list of tokens of an input text, and labels denotes a list of targets for the tokens.
-
•
Question answering (QA). This schema could be used for extractive QA, multiple-choice QA, and others. It consists of (id, question_id, document_id, question, type, choices, context, answer), where id denotes a unique row identifier of the dataset, question_id denotes a unique identifier of the question, document_id denotes a unique identifier of the context document, question denotes an input question to be answered, type denotes the type of the QA task (e.g., extractive, multiple-choice, open-generative, closed-generative, etc.), choices denotes a list of answer choices (if required), context denotes a passage that serves as the background information of the question (if required), and answer denotes the gold answer to the question (if required).
-
•
Single-label text pair classification (PAIRS). This could be used for textual entailment and next-sentence prediction. It consists of (id, text_1, text_2, label), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and label denotes the target variable.
-
•
Single-label text pair classification with continuous values or regression (PAIRS SCORE). This could be used for answer grading and semantic textual similarity. It consists of (id, text_1, text_2, label), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and label denotes a target variable as a continuous value.
-
•
Multi-label text pair classification (PAIRS MULTI). This could be used for morphological inflection. It consists of (id, text_1, text_2, labels), where id denotes a unique row identifier of the dataset, text_1 and text_2 denote an input text pair, and labels denotes a list of target variables.
-
•
Knowledge base (KB). This schema could be used for constituency parsing, dependency parsing, coreference resolution, dialogue systems, and other tasks with complex structures. It consists of (id, passages, entities, events, coreferences, relations). Considering its intricate structure, we encourage readers to take a look at the implementation of the knowledge base schema.
-
•
Tree (TREE). This schema could be used for constituency parsing, this schema assumes a document with subnode elements and a tree hierarchy. It consists of (id, passage, nodes), where id denotes a unique row identifier of the dataset, passage denotes the passage to that particular id; this passage consist of (id, type, text, offsets), nodes denotes the nodes to that particular id; this nodes consists of (id, type, text, offsets, subnodes).
-
•
Conversational Chat (CHAT). This schema could be used for conversational chat and/or multi-turn conversation. It consists of (id, input, output, meta), where id denotes a unique row identifier of the dataset, input denotes a sequence that consists of content and role as an input prompt and the role of the entity inputting the prompt, output denotes an answer from that input prompt, and meta denotes relevant details to allow some flexibility of the schema (if required).
-
•
End-to-end Task Oriented Dialogue (TOD). This schema could be used for end-to-end task-oriented dialogue. It consists of (dialogue_idx, dialogue), where dialogue_idx denotes a unique row identifier of the dialogue, dialogue denotes some core details such as turn label, system utterance, turn idx, belief state (consist of slots and act), user utterance, and system acts.
Subset ID | Language | Region | # Samples |
Image Captioning *_seacrowd_imtext | |||
xm3600_fil | fil | Philippines | 2760 |
xm3600_id | ind | Indonesia | 2775 |
xm3600_th | tha | Thailand | 2798 |
xm3600_vi | vie | Vietnam | 2855 |
F.2 Speech
-
•
Speech-text (SPTEXT). This could be used for speech recognition, text-to-speech (TTS) or speech synthesis, and speech-to-text translation. It consists of (id, path, audio, text, speaker_id, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, text denotes an input text, speaker_id denotes a unique identifier of the speaker, metadata denotes relevant details such as the age and gender of the speaker (if required).
-
•
Speech-to-speech (S2S). This could be used for speech-to-speech translation. It consists of (id, path_1, audio_1, text_1, metadata_1, path_2, audio_2, text_2, metadata_2), where id denotes a unique row identifier of the dataset, path_1 and path_2 denote the file path to a respective input audio source, audio_1 and audio_2 denote the audio data loaded from the corresponding path, text_1 and text_2 denote input texts, and metadata_1 and metadata_2 denote relevant details such as the age of the speaker and their gender (if required).
-
•
Speech Classification (SPEECH). This schema could be used for speech classification, speech-language identification, and speech-emotion recognition for single-label use only. It consists of (id, path, audio, speaker_id, labels, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, speaker_id denotes a unique identifier of the speaker, labels denotes the label of that particular speech (only can be single-label), metadata denotes relevant details such as the age and gender of the speaker (if required).
-
•
Speech Classification for Multilabel (SPEECH MULTILABEL). This schema could be used for speech classification, speech-language identification, and speech-emotion recognition for multi-label use only. It consists of (id, path, audio, speaker_id, labels, metadata), where id denotes a unique row identifier of the dataset, path denotes the file path to an input audio source, audio denotes the audio data loaded from the corresponding path, speaker_id denotes a unique identifier of the speaker, labels denotes the sequence of labels of that particular speech (only can be multi-label), metadata denotes relevant details such as the age and gender of the speaker (if required).
Subset ID | Language | Region | # Samples |
ASR *_seacrowd_sptext | |||
asr_ibsc | iba | Brunei | 473 |
commonvoice_120_ind | ind | Indonesia | 3647 |
commonvoice_120_tha | tha | Thailand | 10964 |
commonvoice_120_cnh | cnh | Myanmar | 763 |
commonvoice_120_vie | vie | Vietnam | 1302 |
fleurs_ind | ind | Indonesia | 687 |
fleurs_jav | jav | Indonesia | 728 |
fleurs_tha | tha | Thailand | 1021 |
fleurs_lao | lao | Laos | 405 |
fleurs_mya | mya | Myanmar | 880 |
fleurs_khm | khm | Cambodia | 771 |
fleurs_vie | vie | Vietnam | 857 |
fleurs_zlm | zlm | Malaysia | 749 |
fleurs_fil | fil | Philippines | 964 |
fleurs_ceb | ceb | Philippines | 541 |
indspeech_newstra_ethnicsr_nooverlap_jav | jav | Indonesia | 1000 |
indspeech_newstra_ethnicsr_nooverlap_sun | sun | Indonesia | 1000 |
indspeech_newstra_ethnicsr_nooverlap_ban | ban | Indonesia | 1000 |
indspeech_newstra_ethnicsr_nooverlap_btk | btx | Indonesia | 1000 |
Subset ID | Language | Region | # Samples | |
Eng XX | XX Eng | |||
MT (Eng XX) *_seacrowd_t2t | ||||
lio_and_central_flores_eng_ljl | lio_and_central_flores_ljl_eng | ljl | Indonesia | 1658 |
flores200_eng_Latn_ace_Latn | flores200_ace_Latn_eng_Latn | ace | Indonesia | 1012 |
flores200_eng_Latn_ban_Latn | flores200_ban_Latn_eng_Latn | ban | Indonesia | 1012 |
flores200_eng_Latn_bjn_Latn | flores200_bjn_Latn_eng_Latn | bjn | Indonesia | 1012 |
flores200_eng_Latn_bug_Latn | flores200_bug_Latn_eng_Latn | bug | Indonesia | 1012 |
flores200_eng_Latn_ceb_Latn | flores200_ceb_Latn_eng_Latn | ceb | Philippines | 1012 |
flores200_eng_Latn_ilo_Latn | flores200_ilo_Latn_eng_Latn | ilo | Philippines | 1012 |
flores200_eng_Latn_ind_Latn | flores200_ind_Latn_eng_Latn | ind | Indonesia | 1012 |
flores200_eng_Latn_jav_Latn | flores200_jav_Latn_eng_Latn | jav | Indonesia | 1012 |
flores200_eng_Latn_kac_Latn | flores200_kac_Latn_eng_Latn | kac | Myanmar | 1012 |
flores200_eng_Latn_khm_Khmr | flores200_khm_Khmr_eng_Latn | khm | Cambodia | 1012 |
flores200_eng_Latn_lao_Laoo | flores200_lao_Laoo_eng_Latn | lao | Laos | 1012 |
flores200_eng_Latn_lus_Latn | flores200_lus_Latn_eng_Latn | lus | Myanmar | 1012 |
flores200_eng_Latn_min_Latn | flores200_min_Latn_eng_Latn | min | Indonesia | 1012 |
flores200_eng_Latn_mya_Mymr | flores200_mya_Mymr_eng_Latn | mya | Myanmar | 1012 |
flores200_eng_Latn_pag_Latn | flores200_pag_Latn_eng_Latn | pag | Philippines | 1012 |
flores200_eng_Latn_shn_Mymr | flores200_shn_Mymr_eng_Latn | shn | Myanmar | 1012 |
flores200_eng_Latn_sun_Latn | flores200_sun_Latn_eng_Latn | sun | Indonesia | 1012 |
flores200_eng_Latn_tha_Thai | flores200_tha_Thai_eng_Latn | tha | Thailand | 1012 |
flores200_eng_Latn_vie_Latn | flores200_vie_Latn_eng_Latn | vie | Vietnam | 1012 |
flores200_eng_Latn_war_Latn | flores200_war_Latn_eng_Latn | war | Philippines | 1012 |
flores200_eng_Latn_zsm_Latn | flores200_zsm_Latn_eng_Latn | zsm | Malaysia | 1012 |
ntrex_128_eng-US_ind | ntrex_128_ind_eng-US | ind | Indonesia | 1997 |
ntrex_128_eng-US_mya | ntrex_128_mya_eng-US | mya | Myanmar | 1997 |
ntrex_128_eng-US_fil | ntrex_128_fil_eng-US | fil | Philippines | 1997 |
ntrex_128_eng-US_khm | ntrex_128_khm_eng-US | khm | Cambodia | 1997 |
ntrex_128_eng-US_lao | ntrex_128_lao_eng-US | lao | Laos | 1997 |
ntrex_128_eng-US_zlm | ntrex_128_zlm_eng-US | zsm | Malaysia | 1997 |
ntrex_128_eng-US_tha | ntrex_128_tha_eng-US | tha | Thailand | 1997 |
ntrex_128_eng-US_vie | ntrex_128_vie_eng-US | vie | Vietnam | 1997 |
ntrex_128_eng-US_hmv | ntrex_128_hmv_eng-US | hmv | Vietnam | 1997 |
nusax_mt_eng_ind | - | ind | Indonesia | 400 |
nusax_mt_eng_ace | nusax_mt_ace_eng | ace | Indonesia | 400 |
nusax_mt_eng_jav | nusax_mt_jav_eng | jav | Indonesia | 400 |
nusax_mt_eng_sun | nusax_mt_sun_eng | sun | Indonesia | 400 |
nusax_mt_eng_min | nusax_mt_min_eng | min | Indonesia | 400 |
nusax_mt_eng_bug | nusax_mt_bug_eng | bug | Indonesia | 400 |
nusax_mt_eng_bbc | nusax_mt_bbc_eng | bbc | Indonesia | 400 |
nusax_mt_eng_ban | nusax_mt_ban_eng | ban | Indonesia | 400 |
nusax_mt_eng_nij | nusax_mt_nij_eng | nij | Indonesia | 400 |
nusax_mt_eng_mad | nusax_mt_mad_eng | mad | Indonesia | 400 |
nusax_mt_eng_bjn | nusax_mt_bjn_eng | bjn | Indonesia | 400 |
F.3 VL
-
•
Image-text (IMTEXT). This schema could be used for image captioning, text-to-image generation, and vision-language pre-training. It consists of (id, text, image_paths, metadata), where id denotes a unique row identifier of the dataset, text denotes an input text, image_paths denotes a list of paths to the input image sources, and metadata denotes relevant details such as visual concepts and labels (if required).
-
•
General Image Classification (IMAGE). This schema could be used for image classification both single-label and multi-label. It consists of (id, labels, image_path, metadata), where id denotes a unique row identifier of the dataset, labels denotes the label of that particular image (can be single-label and multi-label), image_path denotes a list of paths to the input image sources, and metadata denotes relevant details such as visual concepts and labels (if required).
-
•
Image Question Answering (IMQA). This schema could be used for image/visual question answering. It consists of (id, question_id, document_id, questions, type, choices, context, answer, image_paths, meta), where id denotes a unique row identifier of the dataset, question_id denotes a unique identifier of the question, document_id denotes a unique identifier of the context document, question denotes an input question to be answered, type denotes the type of the QA task (e.g., extractive, multiple-choice, open-generative, closed-generative, etc.), choices denotes a list of answer choices (if required), context denotes a passage that serves as the background information of the question (if required), and answer denotes the gold answer to the question (if required), image_path denotes a list of paths to the input image sources, and metadata denotes relevant details to allow some flexibility of the schema (if required).
-
•
General Video-to-Text (VIDEO). This schema could be used for video-to-text retrieval and video captioning. It consists of (id, video_path, text, metadata), where id denotes a unique row identifier of the dataset, video_path denotes the file path to an input video source, text denotes the text associated with that particular frame/video, metadata denotes relevant details such as the resolution, duration, and FPS of the video (if required).
Appendix G Supplementary Details for SEA Evaluation
Model | |||||
Commercial | |||||
GPT-4 | 0.199 | 0.192 | 0.155 | 0.118 | 0.066 |
Command-R | 0.201 | 0.198 | 0.185 | 0.168 | 0.126 |
English | |||||
Mistral | 0.161 | 0.160 | 0.159 | 0.162 | 0.150 |
Llama3 | 0.138 | 0.137 | 0.131 | 0.129 | 0.113 |
Falcon | 0.274 | 0.272 | 0.238 | 0.250 | 0.211 |
Multilingual | |||||
mT0 | 0.151 | 0.148 | 0.131 | 0.112 | 0.074 |
BLOOMZ | 0.238 | 0.236 | 0.228 | 0.217 | 0.167 |
BactrianX-Llama | 0.163 | 0.162 | 0.163 | 0.168 | 0.149 |
AYA-23 | 0.183 | 0.182 | 0.183 | 0.179 | 0.135 |
AYA-101 | 0.112 | 0.109 | 0.095 | 0.085 | 0.069 |
SEA regional | |||||
SEA-LION | 0.250 | 0.242 | 0.204 | 0.164 | 0.102 |
SeaLLM v2.5 | 0.137 | 0.133 | 0.116 | 0.097 | 0.069 |
Sailor | 0.152 | 0.151 | 0.145 | 0.139 | 0.113 |
SEA country | |||||
Cendol-mT5 | 0.407 | 0.404 | 0.378 | 0.328 | 0.200 |
Cendol-Llama2 | 0.294 | 0.290 | 0.267 | 0.232 | 0.149 |
Merak v4 | 0.209 | 0.207 | 0.199 | 0.190 | 0.155 |
WangchanX-Llama3 | 0.163 | 0.161 | 0.153 | 0.150 | 0.131 |
Malaysian Llama3 | 0.181 | 0.181 | 0.179 | 0.176 | 0.143 |
G.1 Datasets
Table 5, 6, 7, 8, and 9 provide the details of data subsets used in the NLU evaluation. Sentiment analysis dataset is originally from NusaX Winata et al. (2023), NusaTranslation Cahyawijaya et al. (2023b), SentiTaglish202020https://huggingface.co/datasets/ccosme/SentiTaglishProductsAndServices, SmSA Purwarianti and Crisdayanti (2019), PRDECT-ID Sutoyo et al. (2022), code-mixed Indonesian-English sentiment Astuti et al. (2023), Karonese tweet sentiment Karo et al. (2022), Typhoon Yolanda sentiment Imperial et al. (2019), GKLMIP Khmer sentiment Jiang et al. (2022), Wisesight sentiment corpus212121https://github.com/PyThaiNLP/wisesight-sentiment, Filipino-Tagalog product reviews Sentiment222222https://github.com/EricEchemane/Filipino-Tagalog-Product-Reviews-Sentiment-Analysis, and multilabel sentiment of Indonesian mobile apps review Riccosan and Saputra (2023).
Topic classification dataset is originally from NusaParagraph Cahyawijaya et al. (2023b), UIT-ViON Tran et al. (2021), SIB-200 Adelani et al. (2024), GKLMIP Khmer news Jiang et al. (2022), and Indonesian news Muzad and Rahutomo (2016). Natural Language Inference dataset is originally from IndoNLI Mahendra et al. (2021), WreTe Setya and Mahendra (2018), SNLI Indo Putra et al. (2024), MyXNLI232323https://huggingface.co/datasets/akhtet/myXNLI, and XNLI Conneau et al. (2018). Commonsense reasoning dataset is originally from XStoryCloze Lin et al. (2022), IndoCloze Koto et al. (2022), and EMoTES-3K Catapang and Visperas (2023).
Open domain QA dataset is originally from IndoMMLU Koto et al. (2023b), SeaEval Wang et al. (2023), M3Exam Zhang et al. (2023b), and Okapi Dac Lai et al. (2023). Cultural QA dataset is originally from COPAL-ID Wibowo et al. (2023), XCOPA Ponti et al. (2020), SeaEval Wang et al. (2023), and Multilingual Fig-QA Kabra et al. (2023). The reading comprehension dataset is originally from Belebele Bandarkar et al. (2023).
Table 10, 11, and 14 provide the details of data subsets used in the NLG evaluation. The summarization dataset is originally from LR-Sum Palen-Michel and Lignos (2023) and XL-Sum Hasan et al. (2021). The machine translation dataset is originally from Lio and the Central Flores corpus Elias (2018), Flores-200 Costa-jussà et al. (2024) and NTREX-128 Federmann et al. (2022). Question answering dataset is originally from FacQA Purwarianti et al. (2007), QASiNa Rizqullah et al. (2023), MKQA Longpre et al. (2021), and Open Thai Wikipedia QA dataset242424https://zenodo.org/records/4539916.
Table 12 and 13 provide the details of data subsets used in the VL and speech evaluation. The image captioning dataset is originally from XM3600 Thapliyal et al. (2022). Speech recognition dataset is originally from INDspeech NEWSTRA Ethnic collection Sani et al. (2012), ASR Iban Juan et al. (2015), FLEURS Conneau et al. (2022), and Common Voice Ardila et al. (2020).
G.2 Baselines
Table 20, 21, and 22 report the details of baseline models used in SEACrowd evaluation (§3). For each baseline model, we provide information regarding the model size, origin base model, seen languages in the training corpora use, and the URL where the models can be downloaded. In principle, this work does not aim to acquire and fit all available SEA-trained LLMs over the Internet, as this is computationally expensive. Rather, we want to initiate the exploration of select publicly available models to serve as baselines for the evaluation of foundational capabilities on SEA languages through benchmarking on NLU, NLG, speech, and vision tasks aggregated via SEACrowd.
Across the various models explored, as listed in the tables, we prioritized the diversity of model variation in terms of scale, openness, and coverage of SEA languages. In NLP tasks, we covered five language model families for the main experiments, namely English-only, multilingual, regional, and country-specific models. Instruction-tuned LLMs demonstrate the ability to generalize to unseen tasks Wei et al. (2021); Sanh et al. (2021); Ouyang et al. (2022). When these LLMs are based on a multilingual foundation, they have shown proficiency in generalizing across multiple languages Muennighoff et al. (2022); Adilazuarda et al. (2023); Zhang et al. (2023a). For NLU, we compute the weighted F1-score and obtain the answers via log-likelihood for open-source baselines or string matching for commercial baselines.
For the speech benchmark, only two model families are available: multilingual models and models fine-tuned on specific SEA languages. For vision tasks, we covered English-only and one multilingual model. These models utilize a visual backbone pre-trained on image-text alignment, e.g., CLIP Radford et al. (2021), to project image features into the input space of an existing pre-trained LM. In summary, we mostly explored open models readily accessible on HuggingFace but also included commercial models such as GPT-4 and Whisper V3 for performance benchmarking, reproducibility, and extension by future works.
Model | Hyperparameter | Value |
Logistic Regression | max_iter | 100 |
C | np.linspace(0.001, 10, 100) | |
Naive Bayes | alpha | np.linspace(0.001, 1, 50) |
distribution | MultinomialNB | |
SVM | C | 1 |
kernel | ["rbf", "linear"] |
G.3 Prompts
Tables 23, 24, and 25 describe the handwritten prompt templates used in NLU, NLG, and VL evaluation (§3). For all tasks, we used a zero-shot prompting procedure to serve as the baseline setup. Due to the task complexity and distribution of workload from volunteer contributors with available computing resources, we limited the experiment procedure for some setups to ensure the acquisition of results in line with target release dates. For NLU, we explored three prompt styles for each dataset from core tasks, including commonsense reasoning, question-answering, and NLI. For more challenging tasks requiring more intensive computing power such as NLG and VL, we used only one uniform prompt style, but we also explored prompts translated into SEA languages, i.e., Filipino, Indonesian, Thai, and Vietnamese for VL.
Model | 3-label | HT vs. MT-Nat | MT vs. HT-Nat | Nat vs. HT-MT |
LR (TF-IDF) | 39.73 | 53.03 | 56.01 | 75.20 |
LR (BoW) | 45.63 | 55.90 | 61.39 | 75.60 |
NB (TF-IDF) | 33.43 | 49.53 | 50.55 | 73.05 |
NB (BoW) | 33.70 | 49.10 | 50.64 | 71.26 |
SVM (TF-IDF) | 39.55 | 52.63 | 55.10 | 76.40 |
SVM (BoW) | 46.84 | 56.85 | 61.40 | 75.65 |
mDeBERTa | 51.51 | 64.77 | 59.16 | 79.08 |
G.4 Evaluation Results
G.5 Language Equity Results
Table 15 presents the language equity of LLMs used in the evaluation across different weights of the number of language speakers in the Gini coefficient calculation.
Country | Affiliation | Origin |
Indonesia | 16 | 31 |
Malaysia | 0 | 1 |
Philippines | 3 | 7 |
Singapore | 13 | 2 |
Thailand | 1 | 2 |
Vietnam | 0 | 1 |
Australia | 1 | 0 |
Brazil/Sweden | 0 | 1 |
Canada | 1 | 0 |
China | 2 | 8 |
Egypt | 0 | 1 |
Germany | 0 | 2 |
Hong Kong | 2 | 0 |
India | 0 | 1 |
Ireland | 1 | 0 |
Japan | 3 | 0 |
The Netherlands | 0 | 1 |
UAE | 5 | 0 |
UK | 4 | 0 |
USA | 9 | 1 |
Uzbekistan | 0 | 2 |
Appendix H Supplementary Details for Translationese Classifier
H.1 Training & Evaluation Data
We manually select and validate the text collection method of each data subset for training and evaluating the translationese classifier, in Tables 28 and 29, respectively. This validation is done by checking the relevant publication, domain, and annotation method. If the texts in the data subsets are a product of machine or human translation, we regard them as translationese. We label data subsets with human-generated texts as natural data.
H.2 Experiments
We aim to assess the capability of ML models to differentiate between human-generated/natural samples (Nat), human-translated samples (HT), and machine-translated samples (MT). Our approach involves training classifiers using classical ML techniques and fine-tuning mDeBERTa models to enhance learning. Furthermore, we experiment by combining two label classes into one to evaluate the predictive difficulty of distinguishing between these labels. This analysis provides valuable insights into the relative similarity of the samples across these categories. The following section provides a comprehensive overview of our methodology for this study.
Classical ML
We use three classical machine learning methods: 1) Logistic Regression (LR), 2) Naive Bayes (NB), and 3) Support Vector Machine (SVM) with two different features, including TF-IDF and Bag-of-words (BoW). We run hyper-parameter tuning with grid search to find the best hyper-parameters for each method on validation set, and report the results on test set in Table 16.
Encoder LM
We explore fine-tuning encoder-only LM for developing a translationese classifier. We utilize mDeBERTa-v3base model252525https://huggingface.co/microsoft/mdeberta-v3-base He et al. (2020, 2022)–a multilingual encoder-only LM–as our backbone. We trained the model with AdamW Loshchilov and Hutter (2019) optimizer using a learning rate of 1e-5, batch size of 256, and warming up steps of 500 for a maximum of 10 epochs. We apply an early stopping of 3 epochs based on the validation accuracy. We show results in Table 17.
No. | Name | C. Points |
1 | Holy Lovenia | 549 |
2 | Samuel Cahyawijaya | 480 |
3 | Rahmad Mahendra | 317 |
4 | Salsabil Maulana Akbar | 243 |
5 | Lester James V. Miranda | 234 |
6 | Zheng-Xin Yong | 164 |
7 | Jennifer Santoso | 164 |
8 | Elyanah Aco | 158 |
9 | Akhdan Fadhilah | 157 |
10 | Jonibek Mansurov | 132 |
11 | Fajri Koto | 121 |
12 | Joseph Marvin Imperial | 118 |
13 | Ruochen Zhang | 114 |
14 | Genta Indra Winata | 108 |
15 | Onno P. Kampman | 107 |
16 | Joel Ruben Antony Moniz | 93 |
17 | Muhammad Ravi Shulthan Habibi | 92 |
18 | Frederikus Hudi | 83 |
19 | Sedrick Keh | 81 |
20 | Alham Fikri Aji | 80 |
21 | Railey Montalan | 78 |
22 | Peerat Limkonchotiwat | 72 |
23 | Ryan Ignatius | 56 |
24 | Joanito Agili Lopo | 50 |
25 | William Nixon | 50 |
26 | Börje F. Karlsson | 49 |
27 | James Jaya | 48 |
28 | Ryandito Diandaru | 48 |
29 | Yuze Gao | 48 |
30 | William Tjhi | 46 |
31 | Patrick Amadeus | 46 |
32 | Bin Wang | 44 |
33 | Jan Christian Blaise Cruz | 43 |
34 | Chenxi Whitehouse | 36 |
35 | Ivan Halim Parmonangan | 36 |
36 | Maria Khelli | 36 |
37 | Sebastian Ruder | 35 |
38 | Wenyu Zhang | 34 |
39 | Lucky Susanto | 33 |
40 | Reynard Adha Ryanda | 32 |
41 | Sonny Lazuardi Hermawan | 30 |
42 | Dan John Velasco | 29 |
43 | Muhammad Dehan Al Kautsar | 29 |
44 | Willy Fitra Hendria | 29 |
45 | Yasmin Moslem | 29 |
46 | Noah Flynn | 28 |
47 | Muhammad Farid Adilazuarda | 27 |
48 | Haochen Li | 27 |
49 | Johanes Lee | 27 |
50 | R. Damanhuri | 27 |
51 | Shuo Sun | 27 |
52 | Muhammad Reza Qorib | 26 |
53 | Amirbek Djanibekov | 25 |
54 | Wei Qi Leong | 25 |
55 | Quyet V. Do | 24 |
56 | Niklas Muennighoff | 24 |
57 | Tanrada Pansuwan | 22 |
58 | Ilham Firdausi Putra | 21 |
59 | Yan Xu | 21 |
60 | Ayu Purwarianti | 20 |
61 | Ngee Chia Tai | 20 |
Appendix I Supplementary Details for SEA Language Prioritization
Based on the results of the global utility metric Blasi et al. (2022), we provide the top-20 SEA indigenous languages to be prioritized based on their demand (i.e., the number of SEA language speakers) and current utility (Figure 10) or resource availability (Figure 11).262626https://github.com/SEACrowd/globalutility While the current utility, also known as the model capability, is relative to the model performance on eng, the resource availability is relative to 500, which is approximately the number of datasets in Korean language available in HuggingFace. The Korean language is chosen as the pivot because it is considered a higher-resource language than most by Joshi et al. (2020).
Appendix J Contributor Demographics
Table 18 describes the geographical distribution of the authors in SEACrowd.
Appendix K Languages Under Study
Table 30-48 present the list of SEA indigenous languages covered by SEACrowd. Information regarding the ISO 639-3 code, language name, region, and population is obtained from Eberhard et al. (2021); Hammarström et al. (2024); Project (2024); Dryer and Haspelmath (2013) and Wikipedia272727https://www.wikipedia.org/.
Appendix L Amount of Contributions by Co-Authors
Table 19 provides a list of co-authors sorted by their amount of contributions in SEACrowd. The full details of their contributions can be seen in our contribution tracking.
Model name | Model size | Backbone | Seen langs | URL |
Commercial | ||||
GPT-4 | N/A | GPT-4 | N/A | https://openai.com/index/gpt-4/. We used turbo-2024-04-09 for NLU and gpt-4o-2024-05-13 for NLG. |
Command-R | 36B | Command-R | 2 SEA langs (vie, ind), 22 non-SEA langs | https://cohere.com/blog/command-r |
English | ||||
Mistral | 7B | Mistral | N/A | mistralai/Mistral-7B-Instruct-v0.3 |
Llama3 | 8B | Llama3 | N/A | meta-llama/Meta-Llama-3-8B-Instruct |
Falcon | 7B | Falcon | 0 SEA langs (mainly English) | tiiuae/falcon-7b-instruct |
Multilingual | ||||
mT0 | 3B | mT5 | 2 SEA langs (vie, ind), 43 non-SEA langs | bigscience/mt0-xl |
BLOOMZ | 7B | BLOOM | 2 SEA langs (vie, ind), 43 non-SEA langs | bigscience/bloomz-3b |
BactrianX-Llama | 7B | Llama | 6 SEA langs (ind, vie, khm, mya, tha, tgl, vie), 46 non-SEA langs | MBZUAI/bactrian-x-llama-7b-merged |
AYA-23 | 8B | Command | 2 SEA langs (ind, vie), 21 non-SEA langs | CohereForAI/aya-23-8B |
AYA-101 | 13B | T5 | 9 SEA langs (ind, vie, tha, zsm, mya, ceb, fil, jav, sun), 92 non-SEA langs | CohereForAI/aya-101 |
SEA regional | ||||
SEA-LION | 7B | MPT | 8 SEA langs (ind, vie, tha, tgl, zsm, khm, lao, mya), 3 non-SEA langs | aisingapore/sea-lion-7b-instruct |
SeaLLM v2.5 | 7B | SeaLLM | 8 SEA langs (ind, vie, tha, tgl, zsm, khm, lao, mya) | SeaLLMs/SeaLLM-7B-v2.5 |
Sailor | 7B | Qwen 1.5 | 5 SEA langs (ind, vie, lao, zlm, tha), 2 non-SEA langs | sail/Sailor-7B-Chat |
SEA country | ||||
Cendol-mT5 | 3B | mT5 | 1 SEA lang (ind), 18 local Indonesian langs | indonlp/cendol-mt5-xl |
Cendol-Llama2 | 7B | Llama2 | 1 SEA lang (ind), 18 local Indonesian langs | indonlp/cendol-llama2-7b |
Merak v4 | 7B | Llama2 | 1 SEA lang (ind) | Ichsan2895/Merak-7B-v4 |
WangchanX-Llama3 | 8B | Llama3 | 4 SEA langs (ind, vie, tha, mya) and 26 non-SEA langs | airesearch/LLaMa3-8b-WangchanX-sft-Demo |
Malaysian Llama3 | 8B | Llama3 | 1 SEA lang (zlm) | mesolitica/malaysian-llama-3-8b-instruct-16k |
Model name | Model size | Backbone | Seen langs | URL |
Multilingual | ||||
Whisper v3 | 1.54B | Whisper v3 | 89 non-SEA & 9 SEA (ind, jav, lao, zlm, mya, tgl, tha, sun, vie) | openai/whisper-large-v3 |
\hdashlineMMS 1B | 1B | MMS | 993 non-SEA & 205 SEA (abp, ace, acn, agn, ahk, akb, alj, alp, amk, aoz, atb, atq, ayz, ban, bbc, bcl, bdg, bdq, bep, bgr, bhz, bkd, blt, blx, blz, bno, bpr, bps, bru, btd, bts, btx, bvz, bzi, ceb, cek, cfm, cgc, cmr, cnh, ctd, dbj, dnt, dnw, dtp, eip, frd, gbi, gor, had, hap, hil, hlt, hnn, hvn, iba, ifa, ifb, ifk, ifu, ify, ilo, ind, itv, jav, jmd, kac, kak, kdt, khg, khm, kje, kjg, klw, kmd, kml, knb, kne, kpq, kps, kqe, kqr, krj, krr, kvw, kxf, kxm, kyb, kyo, kyu, kzf, lao, law, lbw, lcp, lew, lex, lhu, lis, lje, ljp, llg, lnd, lsi, mad, mak, mbb, mbt, mej, mhx, mhy, min, mkn, mnb, mnw, mnx, mog, mqf, mqj, mqn, mrw, mtd, mtj, mvp, mwq, mwv, mya, myl, nfa, nia, nij, nlc, nlk, nod, npy, nst, obo, pag, pam, pce, pez, plw, pmf, ppk, prf, prk, prt, pse, ptu, pww, raw, rej, rgu, rhg, ril, rol, saj, sas, sbl, sda, sea, sgb, shn, sjm, slu, sml, sne, suc, sun, sxn, sya, sza, tbk, tbl, tby, tcz, tdj, tes, tgl, tha, tih, tlb, tnt, tom, tvw, twb, twe, twu, txa, txq, ubl, urk, ury, vie, war, wlo, xdy, xmm, xsb, xte, yka, yli, yva, zlm, zyp) | facebook/mms-1b-all |
\hdashlineSeamless M4T v2 | 2.3B | Seamless | 83 non-SEA & 9 SEA (ind, jav, khm, lao, mya, tgl, tha, vie, zlm) | facebook/seamless-m4t-v2-large |
Fine-tuned on specific language(s) | ||||
XLSR English | 300M | Wav2Vec2 | 46 non-SEA & 7 SEA (ceb, cnh, ind, lao, tam, tgl, vie) & fine-tuning language(s) | jonatasgrosman/wav2vec2-large-xlsr-53-english |
XLSR Ind-Jav-Sun | indonesian-nlp/wav2vec2-indonesian-javanese-sundanese | |||
XLSR Indonesian | Galuh/wav2vec2-large-xlsr-indonesian | |||
XLSR Thai | wannaphong/wav2vec2-large-xlsr-53-th-cv8-newmm | |||
XLS-R Tagalog | sil-ai/wav2vec2-bloom-speech-tgl | |||
XLS-R Burmese | sil-ai/wav2vec2-bloom-speech-mya | |||
XLS-R Khmer | vitouphy/wav2vec2-xls-r-300m-khmer | |||
\hdashlineWhisper Indonesian | 1.54B | Whisper | 89 non-SEA & 9 SEA (ind, jav, lao, msa, mya, tgl, tha, sun, vie) | cahya/whisper-large-id |
Whisper Thai | biodatlab/whisper-th-large-v3-combined | |||
Whisper Khmer | ksoky/whisper-large-khmer-asr |
Model name | Model size | Backbone | Pre-training images | URL |
English | ||||
LLaVA 1.5 | N/A | N/A | N/A | N/A |
LLaVA 1.6 | 7B | Mistral-7B | N/A | liuhaotian/llava-v1.6-mistral-7b |
Idefics2 | 8B | Mistral-7B-v0.1 | 1.5B | HuggingFaceM4/idefics2-8b |
PaliGemma | 2B | Gemma-2B | N/A | google/paligemma-3b-pt-224 |
Multilingual | ||||
mBLIP | N/A | blip2-flan-t5-xl | N/A | Gregor/mblip-mt0-xl |
No. | Prompt template |
Sentiment Analysis | |
1 | Classify the sentiment of the text below.\n[INPUT] => Sentiment ([OPTIONS]): [LABEL_CHOICE] |
2 | Predict the sentiment of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE] |
3 | [INPUT]\nWhat would be the sentiment of the text above? [OPTIONS]? [LABEL_CHOICE] |
Topic Classification | |
1 | Classify the topic of the text below.\n[INPUT] => Topic ([OPTIONS]): [LABEL_CHOICE] |
2 | Predict the topic of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE] |
3 | [INPUT]\nWhat would be the topic of the text above? [OPTIONS]? [LABEL_CHOICE] |
Commonsense Reasoning *_seacrowd_text | |
1 | Classify the morality of the text below.\n[INPUT] => Morality ([OPTIONS]): [LABEL_CHOICE] |
2 | Predict the morality of the following text.\nText: [INPUT]\nAnswer with [OPTIONS]: [LABEL_CHOICE] |
3 | [INPUT]\nWhat would be the morality of the text above? [OPTIONS]? [LABEL_CHOICE] |
Commonsense Reasoning *_seacrowd_qa | |
1 | Question: [QUESTION]\nWhat reply makes more sense to answer this question?\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE] |
2 | Based on the the following question: "[QUESTION]" and choices: [ANSWER_CHOICE the correct answer is: [LABEL_CHOICE] |
3 | Question: [QUESTION]\nChoices: [ANSWER_CHOICES]\nThe correct answer to the given question is: [LABEL_CHOICE] |
All QAs | |
1 | Refer to the passage below and answer the following question:\nPassage: [CONTEXT]\nQuestion: [QUESTION]\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE] |
2 | [CONTEXT]\nBased on the above text, [QUESTION]\nChoices: [ANSWER_CHOICES]\nAnswer: [LABEL_CHOICE] |
3 | [CONTEXT]\nQuestion: [QUESTION]\nChoices:[ANSWER_ CHOICES]\nReferring to the passage above, the correct answer to the given question is: [LABEL_CHOICE] |
NLI | |
1 | Hypothesis: [INPUT_A]\nPremise: [INPUT_B]\nQuestion: What is the relation between the hypothesis and the premise? [OPTIONS]? [LABEL_CHOICE] |
2 | Given the following premise and hypothesis:\nHypothesis: [INPUT_A]\nPremise: [INPUT_B]\nDetermine the logical relationship (([OPTIONS])): [LABEL_CHOICE] |
3 | Choose the most appropriate relationship ([OPTIONS]) between the premise and hypothesis:\nRelationship between "[INPUT_B]" and "[INPUT_A]": [LABEL_CHOICE] |
No. | Prompt template |
Machine Translation (MT) | |
1 | Translate the following text from [SOURCE] to [TARGET]. Give your translation directly.\nText: [INPUT]\nTranslation: |
Summarization | |
1 | Write a summary from the following text.\nText: [INPUT]\nSummary: |
Abstractive & Extractive QA | |
1 | Refer to the passage below and answer the following question:\nPassage: [CONTEXT]\nQuestion: [QUESTION]\nAnswer: |
Lang. | Prompt template |
Image Captioning | |
eng | Caption the following image in [LANGUAGE]. |
fil | Ilarawan ang sumusunod na larawan. |
ind | Deskripsikan gambar berikut. |
abl | abs | ace | ban | bbc | bew | bhp | bjn | btx | bug | ceb | eng | fil | ilo | ind | jav | kac | khm | lao | lus | mad | mak | min | mui | mya | nij | pag | rej | shn | sun | tha | vie | war | zsm | Overall | |
GPT-4 | 63.3 | 39.0 | 39.3 | 60.3 | 7.1 | 68.5 | 2.8 | 60.4 | 27.8 | 40.4 | 85.6 | 52.1 | 55.9 | 69.5 | 60.7 | 59.7 | 30.8 | 66.4 | 51.8 | 70.0 | 37.1 | 44.3 | 57.9 | 71.8 | 47.6 | 40.2 | 79.4 | 34.0 | 21.7 | 58.5 | 59.6 | 56.1 | 84.9 | 61.6 | 51.9 |
Command-R | 50.1 | 80.8 | 57.6 | 62.8 | 47.4 | 81.8 | 58.2 | 57.1 | 57.3 | 57.9 | 66.7 | 69.4 | 51.1 | 56.8 | 58.3 | 61.2 | 36.5 | 41.5 | 33.8 | 63.9 | 61.9 | 58.4 | 66.4 | 81.7 | 34.8 | 53.3 | 75.6 | 69.6 | 35.4 | 63.2 | 42.7 | 55.9 | 67.6 | 55.7 | 58.0 |
Mistral | 36.7 | 53.6 | 46.4 | 49.6 | 33.0 | 59.3 | 44.3 | 44.6 | 44.3 | 48.8 | 53.5 | 69.2 | 48.4 | 49.1 | 52.5 | 46.7 | 33.2 | 29.8 | 30.7 | 56.1 | 45.7 | 44.8 | 51.2 | 62.6 | 27.4 | 40.1 | 69.2 | 48.6 | 31.9 | 48.3 | 40.8 | 45.2 | 54.4 | 49.6 | 46.8 |
Llama3 | 37.3 | 40.3 | 43.2 | 48.9 | 34.8 | 44.5 | 32.6 | 42.2 | 38.5 | 42.9 | 51.2 | 59.5 | 45.2 | 46.7 | 49.2 | 44.4 | 28.5 | 34.6 | 30.3 | 46.8 | 39.0 | 38.0 | 43.6 | 49.2 | 35.2 | 39.6 | 60.5 | 38.5 | 31.1 | 45.2 | 43.8 | 45.5 | 50.3 | 49.0 | 42.6 |
Falcon | 21.1 | 63.2 | 13.3 | 19.0 | 23.0 | 37.9 | 62.1 | 15.6 | 31.9 | 15.7 | 19.5 | 43.7 | 25.1 | 18.8 | 30.8 | 27.0 | 14.2 | 10.2 | 12.7 | 15.0 | 30.3 | 32.3 | 23.6 | 37.0 | 18.0 | 23.0 | 18.8 | 36.0 | 14.1 | 28.2 | 15.9 | 18.8 | 19.1 | 17.4 | 25.1 |
mT0 | 37.6 | 63.6 | 43.7 | 51.2 | 37.0 | 66.1 | 38.4 | 43.6 | 41.3 | 50.3 | 62.5 | 49.4 | 41.0 | 59.0 | 47.2 | 56.0 | 40.9 | 57.5 | 61.2 | 57.0 | 46.7 | 45.8 | 52.6 | 68.8 | 45.9 | 40.9 | 62.6 | 47.8 | 47.0 | 58.8 | 41.8 | 41.4 | 61.4 | 49.4 | 50.5 |
BLOOMZ | 25.6 | 66.5 | 28.4 | 34.2 | 35.8 | 53.9 | 48.0 | 30.4 | 36.3 | 33.3 | 30.9 | 51.7 | 28.9 | 27.8 | 44.7 | 38.2 | 23.1 | 18.9 | 23.6 | 28.1 | 37.8 | 34.5 | 39.9 | 60.2 | 23.0 | 34.6 | 33.1 | 42.2 | 19.8 | 41.3 | 25.9 | 34.8 | 32.1 | 34.3 | 35.3 |
BactrianX-Llama | 24.9 | 48.6 | 21.2 | 28.5 | 26.9 | 33.4 | 45.9 | 22.8 | 31.4 | 22.7 | 27.9 | 45.6 | 32.0 | 24.3 | 38.3 | 30.0 | 19.9 | 17.0 | 20.7 | 21.0 | 30.0 | 28.8 | 26.2 | 35.7 | 22.8 | 27.2 | 26.5 | 29.2 | 20.5 | 30.2 | 24.5 | 27.1 | 28.3 | 31.5 | 28.6 |
AYA-23 | 43.3 | 21.2 | 26.9 | 35.0 | 24.3 | 31.2 | 16.8 | 30.9 | 25.1 | 26.5 | 36.0 | 50.8 | 33.5 | 32.7 | 46.8 | 36.9 | 20.5 | 15.1 | 22.0 | 27.4 | 31.0 | 31.7 | 27.3 | 35.5 | 23.7 | 37.3 | 32.6 | 22.8 | 20.8 | 34.9 | 32.7 | 44.8 | 37.1 | 47.9 | 31.3 |
AYA-101 | 42.5 | 64.3 | 71.2 | 65.2 | 58.8 | 68.2 | 43.3 | 63.5 | 52.7 | 60.7 | 71.7 | 62.8 | 52.8 | 65.0 | 54.2 | 62.6 | 43.1 | 62.2 | 67.8 | 71.8 | 56.9 | 49.0 | 69.3 | 70.2 | 51.5 | 57.2 | 75.7 | 52.9 | 53.8 | 67.2 | 49.5 | 48.0 | 70.5 | 56.4 | 59.8 |
SEA-LION | 10.3 | 62.3 | 13.5 | 16.5 | 21.3 | 35.3 | 60.3 | 13.4 | 31.8 | 15.2 | 13.6 | 26.6 | 20.6 | 10.2 | 27.6 | 21.4 | 8.7 | 16.8 | 15.2 | 12.5 | 26.8 | 28.3 | 22.8 | 34.6 | 23.0 | 16.0 | 14.4 | 34.1 | 9.7 | 23.4 | 16.3 | 14.7 | 14.2 | 13.3 | 21.9 |
SeaLLM v2.5 | 50.7 | 55.1 | 34.5 | 43.4 | 36.3 | 53.9 | 53.2 | 45.8 | 45.8 | 37.7 | 47.6 | 42.5 | 52.6 | 44.7 | 53.4 | 49.8 | 27.4 | 42.6 | 50.3 | 45.8 | 48.7 | 49.8 | 46.8 | 58.4 | 41.0 | 39.1 | 55.7 | 47.8 | 28.7 | 50.1 | 49.0 | 54.5 | 55.4 | 60.6 | 47.0 |
Sailor | 50.4 | 59.2 | 43.8 | 55.5 | 44.1 | 61.5 | 43.9 | 50.5 | 44.8 | 45.7 | 45.6 | 63.0 | 40.2 | 45.0 | 51.3 | 53.1 | 29.9 | 32.7 | 53.9 | 53.9 | 47.6 | 46.5 | 52.8 | 63.9 | 28.1 | 52.7 | 59.3 | 42.2 | 26.7 | 54.0 | 46.3 | 47.7 | 49.2 | 52.1 | 48.1 |
Cendol-mT5 | 15.0 | 98.5 | 38.3 | 42.3 | 84.7 | 99.4 | 95.6 | 33.3 | 92.6 | 68.6 | 14.1 | 38.7 | 23.8 | 12.2 | 33.4 | 50.5 | 10.4 | 20.3 | 15.3 | 9.6 | 76.5 | 70.2 | 65.2 | 99.6 | 16.6 | 52.6 | 12.8 | 98.9 | 7.2 | 56.6 | 26.4 | 14.7 | 15.1 | 15.9 | 44.8 |
Cendol-Llama2 | 17.5 | 80.0 | 30.8 | 33.5 | 60.6 | 49.3 | 73.4 | 27.9 | 45.1 | 32.3 | 18.7 | 36.8 | 21.4 | 17.8 | 37.4 | 35.1 | 14.7 | 13.2 | 15.9 | 15.0 | 46.3 | 38.1 | 37.1 | 51.6 | 19.9 | 40.3 | 17.7 | 47.7 | 16.5 | 38.5 | 20.6 | 17.3 | 18.5 | 18.4 | 32.5 |
Merak | 37.0 | 68.6 | 37.7 | 48.3 | 36.4 | 66.1 | 60.1 | 41.4 | 50.4 | 47.8 | 42.4 | 59.6 | 37.9 | 39.7 | 48.5 | 48.4 | 27.9 | 24.2 | 28.0 | 44.3 | 51.7 | 51.0 | 50.5 | 70.3 | 27.2 | 40.0 | 58.6 | 57.9 | 28.6 | 50.8 | 29.3 | 35.3 | 43.7 | 47.1 | 45.2 |
WangchanX-Llama3 | 38.4 | 59.3 | 26.8 | 35.2 | 35.0 | 43.3 | 56.9 | 31.6 | 38.3 | 31.2 | 32.3 | 57.6 | 36.6 | 29.3 | 45.0 | 38.7 | 23.7 | 24.3 | 25.1 | 26.6 | 40.4 | 41.4 | 34.8 | 43.6 | 31.6 | 37.0 | 31.2 | 42.9 | 23.5 | 39.8 | 36.5 | 38.4 | 31.3 | 37.0 | 36.6 |
Malaysian Llama3 | 38.9 | 62.3 | 38.1 | 41.9 | 39.2 | 46.9 | 58.3 | 39.5 | 40.5 | 35.9 | 37.8 | 55.5 | 34.5 | 33.1 | 48.6 | 42.6 | 24.7 | 18.9 | 20.4 | 33.6 | 42.1 | 41.0 | 42.5 | 48.5 | 22.2 | 39.6 | 46.8 | 41.1 | 19.6 | 44.0 | 33.7 | 34.6 | 37.7 | 49.9 | 39.2 |
Overall | 35.6 | 60.4 | 36.4 | 42.9 | 38.1 | 55.6 | 49.7 | 38.6 | 43.1 | 39.7 | 42.1 | 51.9 | 37.9 | 37.9 | 46.0 | 44.6 | 25.5 | 30.3 | 32.1 | 38.8 | 44.3 | 43.0 | 45.0 | 58.0 | 30.0 | 39.5 | 46.1 | 46.4 | 25.4 | 46.3 | 35.3 | 37.5 | 42.8 | 41.5 | 41.4 |
ace | ban | bbc | bjn | bug | ceb | fil | hmv | ilo | ind | jav | kac | khm | lao | ljl | lus | mad | min | mya | nij | pag | shn | sun | tha | vie | war | zsm | Overall | |
GPT-4 | 5.8 | 6.0 | 7.4 | 4.7 | 5.6 | 13.7 | 9.5 | 8.5 | 14.2 | 3.7 | 6.8 | 7.4 | 2.7 | 3.4 | 2.7 | 11.3 | 3.7 | 6.3 | 2.8 | 4.2 | 10.4 | 3.0 | 6.1 | 2.1 | 10.0 | 13.2 | 5.0 | 6.7 |
Command-R | 19.6 | 26.1 | 16.4 | 30.0 | 16.0 | 44.3 | 52.5 | 16.8 | 29.4 | 57.9 | 32.6 | 8.8 | 8.7 | 14.2 | 6.0 | 19.5 | 17.2 | 31.6 | 9.5 | 18.4 | 20.4 | 8.9 | 27.5 | 24.3 | 46.8 | 34.4 | 50.1 | 25.5 |
Mistral | 12.4 | 15.0 | 10.0 | 13.9 | 11.1 | 28.5 | 37.2 | 10.2 | 15.9 | 28.6 | 15.4 | 7.3 | 8.7 | 10.8 | 4.2 | 11.7 | 9.5 | 18.0 | 5.7 | 12.4 | 17.5 | 9.5 | 14.8 | 15.1 | 25.1 | 22.4 | 31.1 | 15.6 |
Llama3 | 11.0 | 12.3 | 8.1 | 13.8 | 7.6 | 25.1 | 33.2 | 7.6 | 18.4 | 21.9 | 17.0 | 4.8 | 6.5 | 5.8 | 3.2 | 9.6 | 8.5 | 16.4 | 4.5 | 9.5 | 11.8 | 6.3 | 15.1 | 9.6 | 21.7 | 20.5 | 25.2 | 13.2 |
Falcon | 7.3 | 9.5 | 8.2 | 8.3 | 7.9 | 18.6 | 23.6 | 6.6 | 9.7 | 15.3 | 7.7 | 6.0 | 3.1 | 3.1 | 4.2 | 9.3 | 6.6 | 11.8 | 1.8 | 8.7 | 12.9 | 4.5 | 7.7 | 2.4 | 13.5 | 13.5 | 17.0 | 9.2 |
mT0 | 4.8 | 5.6 | 3.7 | 5.7 | 3.1 | 4.6 | 6.8 | 4.5 | 3.8 | 29.3 | 5.8 | 2.1 | 4.3 | 6.1 | 1.7 | 3.4 | 3.6 | 6.5 | 5.0 | 3.5 | 3.6 | 3.5 | 6.8 | 9.4 | 19.6 | 6.1 | 9.1 | 6.4 |
BLOOMZ | 3.8 | 4.6 | 2.8 | 5.3 | 2.9 | 4.1 | 5.1 | 3.4 | 4.2 | 32.3 | 4.9 | 3.0 | 1.5 | 2.4 | 1.5 | 4.0 | 2.7 | 5.7 | 1.2 | 3.2 | 4.9 | 2.6 | 4.6 | 3.3 | 24.1 | 5.4 | 10.1 | 5.7 |
BactrianX-Llama | 10.9 | 11.6 | 8.9 | 12.3 | 8.8 | 22.0 | 32.1 | 8.5 | 12.1 | 25.1 | 11.4 | 6.9 | 6.4 | 8.2 | 4.1 | 10.9 | 8.7 | 14.1 | 4.3 | 8.4 | 15.2 | 8.0 | 11.4 | 10.8 | 19.4 | 16.6 | 23.4 | 12.6 |
AYA-23 | 9.3 | 10.5 | 8.0 | 11.6 | 6.9 | 14.2 | 17.5 | 5.6 | 8.3 | 18.3 | 11.3 | 5.7 | 4.0 | 5.9 | 2.7 | 8.1 | 7.6 | 12.2 | 3.3 | 9.0 | 8.8 | 6.5 | 10.4 | 6.8 | 24.3 | 10.6 | 17.7 | 9.8 |
AYA-101 | 26.4 | 26.8 | 14.6 | 21.6 | 12.6 | 49.3 | 46.6 | 33.3 | 25.8 | 49.5 | 38.8 | 12.2 | 25.9 | 37.2 | 4.4 | 17.8 | 13.4 | 29.7 | 17.6 | 13.2 | 23.3 | 20.4 | 35.6 | 22.2 | 36.5 | 36.9 | 41.9 | 27.2 |
SEA-LION | 7.2 | 8.1 | 6.5 | 9.3 | 5.8 | 12.5 | 17.1 | 4.9 | 7.0 | 13.9 | 7.9 | 5.3 | 7.0 | 9.6 | 2.0 | 7.6 | 6.0 | 9.5 | 4.8 | 6.6 | 8.4 | 4.9 | 8.0 | 5.9 | 21.2 | 10.3 | 14.1 | 8.6 |
SeaLLM v2.5 | 15.2 | 20.2 | 11.7 | 19.5 | 11.5 | 37.1 | 49.1 | 14.5 | 26.8 | 43.0 | 26.6 | 7.5 | 17.8 | 22.2 | 4.7 | 15.1 | 12.2 | 26.8 | 9.2 | 14.6 | 19.2 | 9.4 | 22.0 | 21.6 | 36.7 | 28.8 | 45.7 | 21.8 |
Sailor | 19.2 | 24.5 | 15.3 | 23.1 | 14.6 | 29.0 | 39.7 | 8.6 | 13.5 | 46.8 | 30.6 | 7.1 | 12.5 | 24.4 | 6.2 | 10.5 | 16.0 | 28.8 | 5.8 | 19.1 | 16.5 | 9.0 | 26.7 | 22.0 | 41.1 | 21.5 | 49.9 | 21.6 |
Cendol-mT5 | 8.3 | 11.4 | 14.2 | 11.6 | 6.9 | 7.2 | 8.4 | 4.7 | 5.5 | 35.8 | 17.5 | 4.0 | 6.3 | 8.5 | 2.0 | 5.2 | 6.1 | 10.5 | 2.9 | 8.8 | 6.6 | 4.1 | 17.1 | 5.5 | 4.4 | 6.4 | 20.5 | 9.3 |
Cendol-Llama2 | 8.6 | 10.0 | 14.4 | 19.3 | 6.6 | 6.9 | 8.2 | 6.4 | 6.4 | 36.1 | 19.1 | 5.5 | 3.0 | 4.3 | 4.1 | 4.5 | 14.1 | 22.0 | 1.9 | 17.5 | 5.4 | 4.8 | 17.3 | 3.4 | 8.1 | 7.6 | 22.0 | 10.6 |
Merak | 7.4 | 10.3 | 6.7 | 11.3 | 7.1 | 8.2 | 12.8 | 6.3 | 6.7 | 29.5 | 9.6 | 3.7 | 3.8 | 5.9 | 3.2 | 8.0 | 6.5 | 12.5 | 2.4 | 8.0 | 8.2 | 5.6 | 10.6 | 5.9 | 7.2 | 7.4 | 20.4 | 8.7 |
WangchanX-Llama3 | 19.8 | 24.4 | 14.3 | 28.9 | 13.4 | 42.2 | 48.6 | 12.7 | 29.4 | 50.1 | 29.4 | 7.7 | 18.1 | 19.7 | 6.0 | 17.6 | 15.6 | 30.0 | 10.4 | 18.1 | 22.4 | 13.9 | 28.0 | 25.1 | 39.2 | 35.5 | 45.4 | 24.7 |
Malaysian Llama3 | 15.2 | 17.3 | 12.3 | 22.2 | 11.1 | 19.7 | 24.0 | 8.7 | 12.6 | 38.6 | 19.4 | 7.2 | 6.7 | 9.0 | 5.9 | 10.6 | 12.4 | 23.5 | 4.2 | 14.3 | 13.9 | 8.3 | 19.0 | 14.2 | 17.3 | 15.6 | 44.4 | 15.8 |
Overall | 11.8 | 14.1 | 10.2 | 15.1 | 8.9 | 21.5 | 26.2 | 9.5 | 13.9 | 32.0 | 17.3 | 6.2 | 8.2 | 11.2 | 3.8 | 10.3 | 9.5 | 17.6 | 5.4 | 11.0 | 12.8 | 7.4 | 16.0 | 11.7 | 23.1 | 17.4 | 27.4 | 14.1 |
Lang. | Subset | Original Task | Domain | # Samples |
Translationese | ||||
eng | emotes_3k_eng_seacrowd_t2t | Commonsense Reasoning | Ethics | 2000 |
eng | aya_evaluation_suite_eng_seacrowd_t2t | Instruction Tuning | General | 400 |
ind | belebele_ind_latn_seacrowd_qa | QA | General | 1969 |
ind | parallel_asian_treebank_ind_eng_seacrowd_t2t | Machine Translation | News | 31 |
ind | aya_evaluation_suite_ind_seacrowd_t2t | Instruction Tuning | General | 4 |
ind | bactrian_x_id_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1972 |
ind | seaeval_cross_logiqa_ind_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 16 |
ind | seaeval_cross_mmlu_ind_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 8 |
khm | belebele_khm_khmr_seacrowd_qa | QA | General | 399 |
khm | khmer_alt_pos_seacrowd_seq_label | POS Tagging | News | 1595 |
khm | parallel_asian_treebank_khm_eng_seacrowd_t2t | Machine Translation | News | 6 |
khm | aya_evaluation_suite_khm_seacrowd_t2t | Instruction Tuning | General | 8 |
khm | bactrian_x_km_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1992 |
lao | belebele_lao_laoo_seacrowd_qa | QA | General | 1969 |
lao | parallel_asian_treebank_lao_eng_seacrowd_t2t | Machine Translation | News | 31 |
lao | aya_evaluation_suite_lao_seacrowd_t2t | Instruction Tuning | General | 400 |
mya | belebele_mya_mymr_seacrowd_qa | QA | General | 1969 |
mya | parallel_asian_treebank_mya_eng_seacrowd_t2t | Machine Translation | News | 31 |
mya | aya_evaluation_suite_mya_seacrowd_t2t | Instruction Tuning | General | 8 |
mya | bactrian_x_my_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1992 |
fil | belebele_tgl_latn_seacrowd_qa | QA | General | 2000 |
fil | bactrian_x_tl_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 2000 |
tha | belebele_tha_thai_seacrowd_qa | QA | General | 1969 |
tha | parallel_asian_treebank_tha_eng_seacrowd_t2t | Machine Translation | News | 31 |
tha | aya_evaluation_suite_tha_seacrowd_t2t | Instruction Tuning | General | 8 |
tha | bactrian_x_th_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1992 |
vie | belebele_vie_latn_seacrowd_qa | QA | General | 1969 |
vie | parallel_asian_treebank_vie_eng_seacrowd_t2t | Machine Translation | News | 31 |
vie | aya_evaluation_suite_vie_seacrowd_t2t | Instruction Tuning | General | 4 |
vie | bactrian_x_vi_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1972 |
vie | seaeval_cross_logiqa_vie_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 16 |
vie | seaeval_cross_mmlu_vie_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 8 |
zlm | belebele_zsm_latn_seacrowd_qa | QA | General | 1969 |
zlm | parallel_asian_treebank_zlm_eng_seacrowd_t2t | Machine Translation | News | 31 |
zlm | aya_evaluation_suite_zsm_seacrowd_t2t | Instruction Tuning | General | 400 |
zlm | seaeval_cross_logiqa_zlm_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 1056 |
zlm | seaeval_cross_mmlu_zlm_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 300 |
Natural | ||||
eng | cosem_seacrowd_ssp | Language Modeling | Social media | 2000 |
ind | sea_bench_ind_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 200 |
khm | gklmip_newsclass_seacrowd_text | Sentiment Analysis | E-commerce | 1436 |
khm | sea_bench_khm_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
lao | sea_bench_lao_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
mya | gklmip_sentiment_seacrowd_text | Sentiment Analysis | E-commerce | 716 |
mya | sea_bench_mya_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
fil | sea_bench_tgl_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
tha | sea_bench_tha_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 40 |
tha | vistec_tp_th_21_seacrowd_seq_label | NER | Social media | 1960 |
vie | sea_bench_vie_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 200 |
zlm | sea_bench_zlm_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
Lang. | Subset | Original Task | Domain | # Samples |
Translationese | ||||
eng | emotes_3k_eng_seacrowd_t2t | Commonsense Reasoning | Ethics | 2000 |
eng | aya_evaluation_suite_eng_seacrowd_t2t | Instruction Tuning | General | 400 |
ind | belebele_ind_latn_seacrowd_qa | QA | General | 1969 |
ind | parallel_asian_treebank_ind_eng_seacrowd_t2t | MT | News | 31 |
ind | aya_evaluation_suite_ind_seacrowd_t2t | Instruction Tuning | General | 4 |
ind | bactrian_x_id_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1972 |
ind | seaeval_cross_logiqa_ind_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 16 |
ind | seaeval_cross_mmlu_ind_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 8 |
khm | belebele_khm_khmr_seacrowd_qa | QA | General | 399 |
khm | khmer_alt_pos_seacrowd_seq_label | POS Tagging | News | 1595 |
khm | parallel_asian_treebank_khm_eng_seacrowd_t2t | MT | News | 6 |
khm | aya_evaluation_suite_khm_seacrowd_t2t | Instruction Tuning | General | 8 |
khm | bactrian_x_km_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1992 |
lao | belebele_lao_laoo_seacrowd_qa | QA | General | 1969 |
lao | parallel_asian_treebank_lao_eng_seacrowd_t2t | MT | News | 31 |
lao | aya_evaluation_suite_lao_seacrowd_t2t | Instruction Tuning | General | 400 |
mya | belebele_mya_mymr_seacrowd_qa | QA | General | 1969 |
mya | parallel_asian_treebank_mya_eng_seacrowd_t2t | MT | News | 31 |
mya | aya_evaluation_suite_mya_seacrowd_t2t | Instruction Tuning | General | 8 |
mya | bactrian_x_my_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1992 |
fil | belebele_tgl_latn_seacrowd_qa | QA | General | 2000 |
fil | bactrian_x_tl_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 2000 |
tha | belebele_tha_thai_seacrowd_qa | QA | General | 1969 |
tha | parallel_asian_treebank_tha_eng_seacrowd_t2t | MT | News | 31 |
tha | aya_evaluation_suite_tha_seacrowd_t2t | Instruction Tuning | General | 8 |
tha | bactrian_x_th_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1992 |
vie | belebele_vie_latn_seacrowd_qa | QA | General | 1969 |
vie | parallel_asian_treebank_vie_eng_seacrowd_t2t | MT | News | 31 |
vie | aya_evaluation_suite_vie_seacrowd_t2t | Instruction Tuning | General | 4 |
vie | bactrian_x_vi_seacrowd_t2t | Instruction Tuning | Mixed, Multi-domain, Wikipedia | 1972 |
vie | seaeval_cross_logiqa_vie_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 16 |
vie | seaeval_cross_mmlu_vie_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 8 |
zlm | belebele_zsm_latn_seacrowd_qa | QA | General | 1969 |
zlm | parallel_asian_treebank_zlm_eng_seacrowd_t2t | MT | News | 31 |
zlm | aya_evaluation_suite_zsm_seacrowd_t2t | Instruction Tuning | General | 400 |
zlm | seaeval_cross_logiqa_zlm_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 1056 |
zlm | seaeval_cross_mmlu_zlm_seacrowd_qa | Commonsense Reasoning, QA | Commentary, General, Multi-domain, Culture & heritage | 300 |
Natural | ||||
eng | cosem_seacrowd_ssp | Language Modeling | Social media | 2000 |
ind | sea_bench_ind_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 200 |
khm | gklmip_newsclass_seacrowd_text | Sentiment Analysis | E-commerce | 1436 |
khm | sea_bench_khm_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
lao | sea_bench_lao_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
mya | gklmip_sentiment_seacrowd_text | Sentiment Analysis | E-commerce | 716 |
mya | sea_bench_mya_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
fil | sea_bench_tgl_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
tha | sea_bench_tha_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 40 |
tha | vistec_tp_th_21_seacrowd_seq_label | NER | Social media | 1960 |
vie | sea_bench_vie_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 200 |
zlm | sea_bench_zlm_seacrowd_t2t | Instruction Tuning | Commentary, General, Multi-domain, Culture & heritage | 160 |
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
![Refer to caption](x22.png)
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | ind | Indonesian | Indonesia | <1B |
2 | jav | Javanese | Indonesia | <100M |
3 | vie | Vietnamese | Vietnam | <100M |
4 | tha | Thai | Thailand, Cambodia | <100M |
5 | fil | Filipino | Philippines | <100M |
6 | mya | Burmese | Myanmar | <100M |
7 | sun | Sunda | Indonesia | <100M |
8 | tgl | Tagalog | Philippines | <100M |
9 | khm | Khmer | Cambodia, Vietnam | <100M |
10 | ceb | Cebuano | Philippines | <100M |
11 | tts | Northeastern Thai | Thailand | <100M |
12 | zlm | Malay | Malaysia | <100M |
13 | zsm | Standard Malay | Malaysia, Brunei, Singapore | <100M |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | ilo | Ilocano | Philippines | <10M |
2 | mad | Madura | Indonesia | <10M |
3 | nod | Northern Thai | Laos, Thailand | <10M |
4 | hil | Hiligaynon | Philippines | <10M |
5 | min | Minangkabau | Indonesia | <10M |
6 | bug | Bugis | Indonesia | <10M |
7 | bew | Betawi | Indonesia | <10M |
8 | sou | Southern Thai | Thailand | <10M |
9 | lao | Lao | Cambodia, Laos | <10M |
10 | hmv | Hmong Dô | Vietnam | <10M |
11 | ace | Aceh | Indonesia | <10M |
12 | bjn | Banjar | Indonesia | <10M |
13 | ban | Bali | Indonesia | <10M |
14 | shn | Shan | Myanmar, Thailand | <10M |
15 | mui | Musi | Indonesia | <10M |
16 | msi | Sabah Malay | Malaysia | <10M |
17 | meo | Kedah Malay | Malaysia, Thailand | <10M |
18 | pcc | Giáy | Vietnam | <10M |
19 | war | Waray-Waray | Philippines | <10M |
20 | mak | Makasar | Indonesia | <10M |
21 | bcl | Central Bikol | Philippines | <10M |
22 | xmm | Manado Malay | Indonesia | <10M |
23 | sas | Sasak | Indonesia | <10M |
24 | bbc | Batak Toba | Indonesia | <10M |
25 | pam | Kapampangan | Philippines | <10M |
26 | rki | Rakhine | Myanmar | <10M |
27 | tyz | Tày | Vietnam | <10M |
28 | abs | Ambonese Malay | Indonesia | <10M |
29 | pse | Central Malay | Indonesia | <10M |
30 | iba | Iban | Brunei, Indonesia, Malaysia | <10M |
31 | kxm | Northern Khmer | Thailand | <10M |
32 | khg | Khams Tibetan | Myanmar | <10M |
33 | ksw | S’gaw Karen | Myanmar, Thailand | <10M |
34 | btd | Batak Dairi | Indonesia | <10M |
35 | bts | Batak Simalungun | Indonesia | <10M |
36 | cbk | Chavacano | Philippines | <10M |
37 | pag | Pangasinan | Philippines | <10M |
38 | mtq | Muong | Vietnam | <10M |
39 | btm | Batak Mandailing | Indonesia | <10M |
40 | mdh | Maguindanaon | Philippines | <10M |
41 | pmy | Papuan Malay | Indonesia | <10M |
42 | gor | Gorontalo | Indonesia | <10M |
43 | jax | Jambi Malay | Indonesia | <10M |
44 | kjp | Pwo Eastern Karen | Myanmar, Thailand | <10M |
45 | max | North Moluccan Malay | Indonesia | <10M |
46 | mfa | Pattani Malay | Thailand | <10M |
Not in SEACrowd | ||||
47 | mfp | Makassar Indonesian | Indonesia | <10M |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | nut | Nung | Vietnam | <1M |
2 | kac | Jingpho | Myanmar | <1M |
3 | tsg | Tausug | Philippines | <1M |
4 | nij | Ngaju | Indonesia | <1M |
5 | ljp | Lampung Api | Indonesia | <1M |
6 | mqy | Manggarai | Indonesia | <1M |
7 | mrw | Maranao | Philippines | <1M |
8 | nia | Nias | Indonesia | <1M |
9 | akb | Batak Angkola | Indonesia | <1M |
10 | sda | Toraja-Sa’dan | Indonesia | <1M |
11 | mnw | Mon | Myanmar, Thailand | <1M |
12 | hni | Hani | Laos, Vietnam | <1M |
13 | kjg | Khmu | Laos, Thailand, Vietnam | <1M |
14 | aoz | Uab Meto | Indonesia | <1M |
15 | blt | Tai Dam | Laos, Vietnam | <1M |
16 | lus | Mizo Chin | Myanmar | <1M |
17 | cps | Capiznon | Philippines | <1M |
18 | btx | Batak Karo | Indonesia | <1M |
19 | lis | Lisu | Myanmar | <1M |
20 | msb | Masbatenyo | Philippines | <1M |
21 | blk | Pa’o | Myanmar, Thailand | <1M |
22 | tdd | Tai Nüa | Myanmar | <1M |
23 | day | Land Dayak | Indonesia | <1M |
24 | xdy | Malayic Dayak | Indonesia | <1M |
25 | bhp | Bima | Indonesia | <1M |
26 | ibg | Ibanag | Philippines | <1M |
27 | zmi | Negeri Sembilan Malay | Malaysia | <1M |
28 | mdr | Mandar | Indonesia | <1M |
29 | kge | Komering | Indonesia | <1M |
30 | bdr | West Coast Bajau | Malaysia | <1M |
31 | kdt | Kuay | Cambodia, Laos, Thailand | <1M |
32 | prk | Parauk Wa | Myanmar | <1M |
33 | sgd | Surigaonon | Philippines | <1M |
34 | tet | Tetun | East Timor, Indonesia | <1M |
35 | bto | Rinconada Bikol | Philippines | <1M |
36 | tdt | Tetun Dili | East Timor | <1M |
37 | ium | Iu Mien | Laos, Vietnam | <1M |
38 | krj | Kinaray-a | Philippines | <1M |
39 | kyk | Kamayo | Philippines | <1M |
40 | lew | Ledo Kaili | Indonesia | <1M |
41 | mkn | Kupang Malay | Indonesia | <1M |
42 | rej | Rejang | Indonesia | <1M |
43 | mfb | Bangka | Indonesia | <1M |
44 | rob | Tae’ | Indonesia | <1M |
45 | lbw | Tolaki | Indonesia | <1M |
46 | knx | Kendayan | Indonesia, Malaysia | <1M |
47 | gay | Gayo | Indonesia | <1M |
48 | mnb | Muna | Indonesia | <1M |
49 | rbl | Miraya Bikol | Philippines | <1M |
50 | smw | Sumbawa | Indonesia | <1M |
51 | kxd | Brunei | Brunei | <1M |
52 | khb | Lü | Laos, Myanmar | <1M |
53 | lhu | Lahu | Laos, Myanmar | <1M |
54 | twh | Tai Dón | Laos, Vietnam | <1M |
55 | ysm | Myanmar Sign Language | Myanmar | <1M |
56 | dtp | Kadazan Dusun | Malaysia | <1M |
57 | fbl | West Albay Bikol | Philippines | <1M |
58 | kvr | Kerinci | Indonesia | <1M |
59 | pce | Ruching Palaung | Myanmar | <1M |
60 | mry | Mandaya | Philippines | <1M |
61 | nbe | Konyak Naga | Myanmar | <1M |
62 | tcz | Thado Chin | Myanmar | <1M |
63 | jra | Jarai | Cambodia, Vietnam | <1M |
64 | xbr | Kambera | Indonesia | <1M |
65 | mog | Mongondow | Indonesia | <1M |
66 | pwo | Pwo Western Karen | Myanmar | <1M |
67 | cja | Western Cham | Cambodia, Vietnam | <1M |
68 | ahk | Akha | Laos, Myanmar, Thailand | <1M |
69 | ssb | Southern Sama | Philippines | <1M |
70 | sxn | Sangir | Indonesia | <1M |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
71 | btz | Batak Alas-Kluet | Indonesia | <1M |
72 | ctd | Tedim Chin | Myanmar | <1M |
73 | srv | Southern Sorsoganon | Philippines | <1M |
74 | abl | Lampung Nyo | Indonesia | <1M |
75 | dnw | Western Dani | Indonesia | <1M |
76 | ktp | Kaduo | Laos | <1M |
77 | slp | Lamaholot | Indonesia | <1M |
78 | rad | Rade | Vietnam | <1M |
79 | ski | Sika | Indonesia | <1M |
80 | kpm | Koho | Vietnam | <1M |
81 | bdq | Bahnar | Vietnam | <1M |
82 | bdl | Indonesian Bajau | Indonesia | <1M |
83 | bpr | Koronadal Blaan | Philippines | <1M |
84 | ccp | Chakma | Myanmar | <1M |
85 | kne | Kankanaey | Philippines | <1M |
86 | kyu | Western Kayah | Myanmar | <1M |
87 | mhy | Ma’anyan | Indonesia | <1M |
88 | tnt | Tontemboan | Indonesia | <1M |
89 | pll | Shwe Palaung | Myanmar | <1M |
90 | daw | Davawenyo | Philippines | <1M |
91 | cnh | Hakha Chin | Myanmar | <1M |
92 | syb | Central Subanen | Philippines | <1M |
93 | rbb | Rumai Palaung | Myanmar | <1M |
94 | pmf | Pamona | Indonesia | <1M |
95 | bln | Southern Catanduanes Bikol | Philippines | <1M |
96 | itv | Itawit | Philippines | <1M |
97 | pdu | Kayan | Myanmar | <1M |
98 | mgm | Mambae | East Timor | <1M |
99 | bhq | Tukang Besi South | Indonesia | <1M |
100 | sly | Selayar | Indonesia | <1M |
101 | mvp | Duri | Indonesia | <1M |
102 | bgz | Banggai | Indonesia | <1M |
103 | kjc | Coastal Konjo | Indonesia | <1M |
104 | suc | Western Subanon | Philippines | <1M |
105 | cyo | Cuyonon | Philippines | <1M |
106 | khc | Tukang Besi North | Indonesia | <1M |
107 | lhi | Lahu Shi | Myanmar | <1M |
108 | mel | Central Melanau | Malaysia | <1M |
109 | ibl | Ibaloi | Philippines | <1M |
110 | end | Ende | Indonesia | <1M |
111 | hvn | Hawu | Indonesia | <1M |
112 | kkv | Kangean | Indonesia | <1M |
113 | yka | Yakan | Philippines | <1M |
114 | ljl | Li’o | Indonesia | <1M |
115 | mkz | Makasae | East Timor | <1M |
116 | bkd | Binukid | Philippines | <1M |
117 | bkr | Bakumpai | Indonesia | <1M |
118 | ekg | Ekari | Indonesia | <1M |
119 | hnj | Hmong Njua | Laos, Thailand, Vietnam | <1M |
120 | kak | Kalanguya | Philippines | <1M |
121 | kkh | Khün | Myanmar | <1M |
122 | lbx | Lawangan | Indonesia | <1M |
123 | mhx | Lhao Vo | Myanmar | <1M |
124 | mqj | Mamasa | Indonesia | <1M |
125 | psp | Filipino Sign Language | Philippines | <1M |
126 | tgn | Tandaganon | Philippines | <1M |
Not in SEACrowd | ||||
127 | rhg | Rohingya | Myanmar | <1M |
128 | pht | Phu Thai | Laos, Thailand, Vietnam | <1M |
129 | tvn | Tavoyan | Myanmar | <1M |
130 | osi | Osing | Indonesia | <1M |
131 | ilp | Iranun | Philippines | <1M |
132 | kzs | Sugut Dusun | Malaysia | <1M |
133 | vkt | Tenggarong Kutai Malay | Indonesia | <1M |
134 | phu | Phuan | Laos, Thailand | <1M |
135 | csh | Asho Chin | Myanmar | <1M |
136 | mlc | Cao Lan | Vietnam | <1M |
137 | kjk | Highland Konjo | Indonesia | <1M |
138 | liw | Col | Indonesia | <1M |
139 | sss | So | Laos, Thailand | <1M |
140 | dnv | Danu | Myanmar | <1M |
141 | sdq | Semandang | Indonesia | <1M |
142 | tjl | Tai Laing | Myanmar | <1M |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | adr | Adonara | Indonesia | <100K |
2 | sed | Sedang | Vietnam | <100K |
3 | blf | Buol | Indonesia | <100K |
4 | tbl | Tboli | Philippines | <100K |
5 | hre | Hre | Vietnam | <100K |
6 | rol | Romblomanon | Philippines | <100K |
7 | akl | Aklanon | Philippines | <100K |
8 | tdn | Tondano | Indonesia | <100K |
9 | bps | Sarangani Blaan | Philippines | <100K |
10 | kqr | Kimaragang | Malaysia | <100K |
11 | sml | Central Sama | Philippines | <100K |
12 | txs | Tonsea | Indonesia | <100K |
13 | stb | Northern Subanen | Philippines | <100K |
14 | bks | Northern Sorsoganon | Philippines | <100K |
15 | kei | Kei | Indonesia | <100K |
16 | klg | Tagakaulo | Philippines | <100K |
17 | tld | Talaud | Indonesia | <100K |
18 | atb | Zaiwa | Myanmar | <100K |
19 | sse | Balangingih Sama | Philippines | <100K |
20 | tes | Tengger | Indonesia | <100K |
21 | tyr | Tai Daeng | Laos, Vietnam | <100K |
22 | cia | Cia-Cia | Indonesia | <100K |
23 | gbi | Galela | Indonesia | <100K |
24 | otd | Ot Danum | Indonesia | <100K |
25 | cts | Northern Catanduanes Bikol | Philippines | <100K |
26 | loe | Saluan | Indonesia | <100K |
27 | bno | Bantoanon | Philippines | <100K |
28 | cmr | Mro-Khimi | Myanmar | <100K |
29 | ubl | Buhi’non Bikol | Philippines | <100K |
30 | cjm | Eastern Cham | Vietnam | <100K |
31 | bkx | Baikeno | East Timor | <100K |
32 | aaz | Amarasi | Indonesia | <100K |
33 | bhw | Biak | Indonesia | <100K |
34 | kqe | Kalagan | Philippines | <100K |
35 | xnn | Northern Kankanay | Philippines | <100K |
36 | xsb | Sambal | Philippines | <100K |
37 | cfm | Falam Chin | Myanmar | <100K |
38 | lbl | Libon Bikol | Philippines | <100K |
39 | wlo | Wolio | Indonesia | <100K |
40 | bth | Biatah Bidayuh | Indonesia, Malaysia | <100K |
41 | kem | Kemak | East Timor, Indonesia | <100K |
42 | raw | Rawang | Myanmar | <100K |
43 | tft | Ternate | Indonesia | <100K |
44 | zom | Zo | Myanmar | <100K |
45 | cnk | Khumi Chin | Myanmar | <100K |
46 | mqx | Mamuju | Indonesia | <100K |
47 | msm | Agusan Manobo | Philippines | <100K |
48 | nst | Tangshang Naga | Myanmar | <100K |
49 | nxg | Ngad’a | Indonesia | <100K |
50 | obo | Obo Manobo | Philippines | <100K |
51 | pww | Pwo Northern Karen | Thailand | <100K |
52 | sya | Siang | Indonesia | <100K |
53 | tom | Tombulu | Indonesia | <100K |
54 | xml | Malaysian Sign Language | Malaysia | <100K |
55 | mbs | Sarangani Manobo | Philippines | <100K |
56 | mwv | Mentawai | Indonesia | <100K |
57 | msk | Mansaka | Philippines | <100K |
58 | smk | Bolinao | Philippines | <100K |
59 | bfn | Bunak | East Timor, Indonesia | <100K |
60 | bgi | Bagobo-Klata | Philippines | <100K |
61 | drg | Rungus | Malaysia | <100K |
62 | kzf | Da’a Kaili | Indonesia | <100K |
63 | wew | Wejewa | Indonesia | <100K |
64 | rog | Northern Roglai | Vietnam | <100K |
65 | ilk | Bogkalot | Philippines | <100K |
66 | ktv | Eastern Katu | Vietnam | <100K |
67 | dnt | Mid Grand Valley Dani | Indonesia | <100K |
68 | frd | Fordata | Indonesia | <100K |
69 | mbt | Matigsalug Manobo | Philippines | <100K |
70 | nxe | Nage | Indonesia | <100K |
71 | ptt | Enrekang | Indonesia | <100K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
72 | tiy | Teduray | Philippines | <100K |
73 | tjg | Tunjung | Indonesia | <100K |
74 | wmm | Maiwa | Indonesia | <100K |
75 | sdo | Bukar-Sadong Bidayuh | Indonesia, Malaysia | <100K |
76 | kyp | Kang | Laos | <100K |
77 | tvo | Tidore | Indonesia | <100K |
78 | hos | Ho Chi Minh City Sign Language | Vietnam | <100K |
79 | mhs | Buru | Indonesia | <100K |
80 | sti | Bulo Stieng | Cambodia, Vietnam | <100K |
81 | law | Lauje | Indonesia | <100K |
82 | bgs | Tagabawa | Philippines | <100K |
83 | sjm | Mapun | Philippines | <100K |
84 | blr | Blang | Myanmar, Thailand | <100K |
85 | rgs | Southern Roglai | Vietnam | <100K |
86 | smr | Simeulue | Indonesia | <100K |
87 | czt | Zotung Chin | Myanmar | <100K |
88 | kvq | Geba Karen | Myanmar | <100K |
89 | mtd | Mualang | Indonesia | <100K |
90 | xxk | Ke’o | Indonesia | <100K |
91 | tkd | Tukudede | East Timor | <100K |
92 | kix | Khiamniungan Naga | Myanmar | <100K |
93 | bsb | Brunei Bisaya | Brunei, Malaysia | <100K |
94 | dao | Daai Chin | Myanmar | <100K |
95 | ddg | Fataluku | East Timor | <100K |
96 | mqn | Moronene | Indonesia | <100K |
97 | ges | Geser-Gorom | Indonesia | <100K |
98 | pho | Phunoi | Laos | <100K |
99 | slm | Pangutaran Sama | Philippines | <100K |
100 | hro | Haroi | Vietnam | <100K |
101 | ivv | Ivatan | Philippines | <100K |
102 | mrh | Mara Chin | Myanmar | <100K |
103 | btw | Butuanon | Philippines | <100K |
104 | cma | Maa | Vietnam | <100K |
105 | sbl | Botolan Sambal | Philippines | <100K |
106 | cmo | Central Mnong | Cambodia, Vietnam | <100K |
107 | blz | Balantak | Indonesia | <100K |
108 | tpu | Tampuan | Cambodia | <100K |
109 | blj | Bulungan | Indonesia | <100K |
110 | cgc | Kagayanen | Philippines | <100K |
111 | clu | Caluyanun | Philippines | <100K |
112 | cml | Koneq-koneq | Indonesia | <100K |
113 | gad | Gaddang | Philippines | <100K |
114 | hlt | Matu Chin | Myanmar | <100K |
115 | ifk | Tuwali Ifugao | Philippines | <100K |
116 | ifu | Mayoyao Ifugao | Philippines | <100K |
117 | knb | Lubuagan Kalinga | Philippines | <100K |
118 | ksx | Kedang | Indonesia | <100K |
119 | lcf | Lubu | Indonesia | <100K |
120 | lsi | Lacid | Myanmar | <100K |
121 | mba | Higaonon | Philippines | <100K |
122 | mng | Eastern Mnong | Vietnam | <100K |
123 | mro | Mru | Myanmar | <100K |
124 | mta | Cotabato Manobo | Philippines | <100K |
125 | set | Sentani | Indonesia | <100K |
126 | tmn | Taman | Indonesia | <100K |
127 | twu | Termanu | Indonesia | <100K |
128 | txm | Tomini | Indonesia | <100K |
129 | ulm | Ulumanda’ | Indonesia | <100K |
130 | wow | Wawonii | Indonesia | <100K |
131 | sne | Bau Bidayuh | Indonesia, Malaysia | <100K |
132 | tdf | Talieng | Laos | <100K |
133 | lbo | Laven | Laos | <100K |
134 | acn | Ngochang | Myanmar | <100K |
135 | tlb | Tobelo | Indonesia | <100K |
136 | ifa | Amganad Ifugao | Philippines | <100K |
137 | itd | Southern Tidung | Indonesia, Malaysia | <100K |
138 | pha | Pa-Hng | Vietnam | <100K |
139 | atd | Ata Manobo | Philippines | <100K |
140 | bru | Eastern Bru | Laos, Vietnam | <100K |
141 | kzp | Kaidipang | Indonesia | <100K |
142 | abx | Inabaknon | Philippines | <100K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
143 | aol | Alor | Indonesia | <100K |
144 | jmd | Yamdena | Indonesia | <100K |
145 | laa | Southern Subanen | Philippines | <100K |
146 | lmy | Lamboya | Indonesia | <100K |
147 | txe | Totoli | Indonesia | <100K |
148 | oyb | Oy | Laos | <100K |
149 | mlf | Mal | Laos, Thailand | <100K |
150 | lnd | Lundayeh | Brunei, Indonesia, Malaysia | <100K |
151 | prh | Porohanon | Philippines | <100K |
152 | brb | Brao | Cambodia, Laos, Vietnam | <100K |
153 | lbn | Rmeet | Laos | <100K |
154 | ilm | Iranun | Malaysia | <100K |
155 | ptu | Bambam | Indonesia | <100K |
156 | vkl | Kulisusu | Indonesia | <100K |
157 | blw | Balangao | Philippines | <100K |
158 | bsy | Sabah Bisaya | Malaysia | <100K |
159 | krr | Krung | Cambodia | <100K |
160 | dtb | Labuk-Kinabatangan Kadazan | Malaysia | <100K |
161 | ayz | Mai Brat | Indonesia | <100K |
162 | bac | Badui | Indonesia | <100K |
163 | brv | Western Bru | Laos, Thailand | <100K |
164 | bwp | Mandobo Bawah | Indonesia | <100K |
165 | dna | Upper Grand Valley Dani | Indonesia | <100K |
166 | dni | Lower Grand Valley Dani | Indonesia | <100K |
167 | dtr | Lotud | Malaysia | <100K |
168 | dun | Dusun Deyah | Indonesia | <100K |
169 | kje | Kisar | Indonesia | <100K |
170 | kli | Kalumpang | Indonesia | <100K |
171 | kod | Kodi | Indonesia | <100K |
172 | llg | Lole | Indonesia | <100K |
173 | lrt | Larantuka Malay | Indonesia | <100K |
174 | mnz | Moni | Indonesia | <100K |
175 | pea | Peranakan Indonesian | Indonesia | <100K |
176 | ppk | Uma | Indonesia | <100K |
177 | prt | Prai | Laos, Thailand | <100K |
178 | tmm | Tai Thanh | Vietnam | <100K |
179 | tnw | Tonsawang | Indonesia | <100K |
180 | twy | Tawoyan | Indonesia | <100K |
181 | txq | Tii | Indonesia | <100K |
182 | wlw | Walak | Indonesia | <100K |
183 | skh | Sikule | Indonesia | <100K |
184 | lbk | Central Bontok | Philippines | <100K |
185 | cje | Chru | Vietnam | <100K |
186 | hnn | Hanunoo | Philippines | <100K |
187 | tlu | Tulehu | Indonesia | <100K |
188 | wmh | Waima’a | East Timor | <100K |
189 | hrk | Haruku | Indonesia | <100K |
190 | lex | Luang | Indonesia | <100K |
191 | puo | Puoc | Vietnam | <100K |
192 | ren | Rengao | Vietnam | <100K |
193 | alp | Alune | Indonesia | <100K |
194 | bwe | Bwe Karen | Myanmar | <100K |
195 | tlt | Sou Nama | Indonesia | <100K |
196 | zyp | Zyphe Chin | Myanmar | <100K |
197 | abz | Abui | Indonesia | <100K |
198 | akg | Anakalangu | Indonesia | <100K |
199 | had | Hatam | Indonesia | <100K |
200 | htu | Hitu | Indonesia | <100K |
201 | nlc | Nalca | Indonesia | <100K |
202 | pac | Pacoh | Laos, Vietnam | <100K |
203 | yog | Yogad | Philippines | <100K |
204 | mxd | Modang | Indonesia | <100K |
205 | jeh | Jeh | Laos, Vietnam | <100K |
206 | kyn | Northern Binukidnon | Philippines | <100K |
207 | phg | Phuong | Vietnam | <100K |
208 | agn | Agutaynen | Philippines | <100K |
209 | cnw | Ngawn Chin | Myanmar | <100K |
210 | ila | Ile Ape | Indonesia | <100K |
211 | krd | Kairui-Midiki | East Timor | <100K |
212 | loa | Loloda | Indonesia | <100K |
213 | mbb | Western Bukidnon Manobo | Philippines | <100K |
214 | mwq | Müün Chin | Myanmar | <100K |
215 | nxa | Nauete | East Timor | <100K |
216 | prf | Paranan | Philippines | <100K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
217 | snl | Sangil | Philippines | <100K |
218 | tby | Tabaru | Indonesia | <100K |
219 | tea | Temiar | Malaysia | <100K |
220 | yli | Angguruk Yali | Indonesia | <100K |
221 | mej | Meyah | Indonesia | <100K |
222 | mbi | Ilianen Manobo | Philippines | <100K |
223 | plw | Brooke’s Point Palawano | Philippines | <100K |
224 | duu | Drung | Myanmar | <100K |
225 | heg | Helong | Indonesia | <100K |
226 | mzq | Mori Atas | Indonesia | <100K |
227 | uhn | Damal | Indonesia | <100K |
228 | xmz | Mori Bawah | Indonesia | <100K |
229 | kjm | Kháng | Vietnam | <100K |
230 | hal | Salang | Laos, Vietnam | <100K |
231 | idt | Idaté | East Timor | <100K |
232 | dok | Dondo | Indonesia | <100K |
233 | gal | Galolen | East Timor, Indonesia | <100K |
234 | ksc | Southern Kalinga | Philippines | <100K |
235 | txa | Tombonuo | Malaysia | <100K |
236 | ngt | Kriang | Laos | <100K |
237 | kmk | Limos Kalinga | Philippines | <100K |
238 | alo | Larike-Wakasihu | Indonesia | <100K |
239 | yno | Yong | Thailand | <100K |
240 | ril | Riang Lang | Myanmar | <100K |
241 | atq | Aralle-Tabulahan | Indonesia | <100K |
242 | cek | Eastern Khumi Chin | Myanmar | <100K |
243 | cua | Cua | Vietnam | <100K |
244 | mnx | Sougb | Indonesia | <100K |
245 | mqs | West Makian | Indonesia | <100K |
246 | nuf | Nusu | Myanmar | <100K |
247 | plc | Central Palawano | Philippines | <100K |
248 | plv | Southwest Palawano | Philippines | <100K |
249 | rgu | Rikou | Indonesia | <100K |
250 | szw | Sawai | Indonesia | <100K |
251 | tdj | Tajio | Indonesia | <100K |
252 | xkl | Mainstream Kenyah | Indonesia, Malaysia | <100K |
253 | yin | Riang Lai | Myanmar | <100K |
254 | lcl | Lisela | Indonesia | <100K |
255 | lra | Rara Bakati’ | Indonesia, Malaysia | <100K |
256 | bve | Berau Malay | Indonesia | <100K |
257 | kml | Tanudan Kalinga | Philippines | <100K |
258 | beu | Blagar | Indonesia | <100K |
259 | xem | Mateq | Indonesia | <100K |
260 | lev | Western Pantar | Indonesia | <100K |
261 | ptn | Patani | Indonesia | <100K |
262 | oog | Ong | Laos | <100K |
263 | spr | Saparua | Indonesia | <100K |
264 | amk | Ambai | Indonesia | <100K |
265 | ifb | Batad Ifugao | Philippines | <100K |
266 | aax | Mandobo Atas | Indonesia | <100K |
267 | bep | Behoa | Indonesia | <100K |
268 | bvy | Baybayanon | Philippines | <100K |
269 | csy | Siyin Chin | Myanmar | <100K |
270 | dbj | Ida’an | Malaysia | <100K |
271 | emb | Embaloh | Indonesia | <100K |
272 | iry | Iraya | Philippines | <100K |
273 | jak | Jakun | Malaysia | <100K |
274 | jaq | Yaqay | Indonesia | <100K |
275 | kps | Tehit | Indonesia | <100K |
276 | kvb | Kubu | Indonesia | <100K |
277 | kxf | Kawyaw | Myanmar | <100K |
278 | kyt | Kayagar | Indonesia | <100K |
279 | lje | Rampi | Indonesia | <100K |
280 | lur | Loura | Indonesia | <100K |
281 | mbd | Dibabawon Manobo | Philippines | <100K |
282 | mbf | Baba Malay | Singapore | <100K |
283 | mky | East Makian | Indonesia | <100K |
284 | mvd | Mamboru | Indonesia | <100K |
285 | ndx | Nduga | Indonesia | <100K |
286 | pez | Eastern Penan | Brunei, Malaysia | <100K |
287 | ple | Palu’e | Indonesia | <100K |
288 | sea | Semai | Malaysia | <100K |
289 | ssq | So’a | Indonesia | <100K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
290 | szb | Ngalum | Indonesia | <100K |
291 | tbk | Calamian Tagbanwa | Philippines | <100K |
292 | tbw | Tagbanwa | Philippines | <100K |
293 | txx | Tatana | Malaysia | <100K |
294 | wnk | Wanukaka | Indonesia | <100K |
295 | yva | Yawa | Indonesia | <100K |
Not in SEACrowd | ||||
296 | int | Intha | Myanmar | <100K |
297 | loc | Inonhan | Philippines | <100K |
298 | mqg | Kota Bangun Kutai Malay | Indonesia | <100K |
299 | bfx | Bantayanon | Philippines | <100K |
300 | tou | Tho | Vietnam | <100K |
301 | ncq | Northern Katang | Laos | <100K |
302 | bvu | Bukit Malay | Indonesia | <100K |
303 | byd | Benyadu’ | Indonesia | <100K |
304 | tsq | Thai Sign Language | Thailand | <100K |
305 | nyw | Nyaw | Thailand | <100K |
306 | rir | Ribun | Indonesia | <100K |
307 | scg | Sanggau | Indonesia | <100K |
308 | sct | Southern Katang | Laos | <100K |
309 | stt | Budeh Stieng | Vietnam | <100K |
310 | tco | Taungyo | Myanmar | <100K |
311 | vkk | Kaur | Indonesia | <100K |
312 | hab | Hanoi Sign Language | Vietnam | <100K |
313 | djo | Jangkang | Indonesia | <100K |
314 | sbx | Seberuang | Indonesia | <100K |
315 | lso | Laos Sign Language | Laos | <100K |
316 | sez | Senthang Chin | Myanmar | <100K |
317 | soa | Thai Song | Thailand | <100K |
318 | knl | Keninjal | Indonesia | <100K |
319 | tth | Upper Ta’oih | Laos, Vietnam | <100K |
320 | apg | Ampanang | Indonesia | <100K |
321 | mnn | Southern Mnong | Vietnam | <100K |
322 | pel | Pekal | Indonesia | <100K |
323 | zkd | Kadu | Myanmar | <100K |
324 | bkz | Bungku | Indonesia | <100K |
325 | mkx | Kinamiging Manobo | Philippines | <100K |
326 | bnu | Bentong | Indonesia | <100K |
327 | kxy | Kayong | Vietnam | <100K |
328 | mhp | Balinese Malay | Indonesia | <100K |
329 | unz | Unde Kaili | Indonesia | <100K |
330 | bld | Bolango | Indonesia | <100K |
331 | kuf | Western Katu | Laos | <100K |
332 | dnk | Dengka | Indonesia | <100K |
333 | mvv | Tagal Murut | Indonesia, Malaysia | <100K |
334 | skn | Kolibugan Subanon | Philippines | <100K |
335 | szn | Sula | Indonesia | <100K |
336 | cnb | Uppu Chin | Myanmar | <100K |
337 | bhv | Bahau | Indonesia | <100K |
338 | itt | Maeng Itneg | Philippines | <100K |
339 | hji | Haji | Indonesia | <100K |
340 | ghk | Geko Karen | Myanmar | <100K |
341 | kvl | Kayaw | Myanmar | <100K |
342 | tto | Lower Ta’oih | Laos | <100K |
343 | bdb | Basap | Indonesia | <100K |
344 | clj | Laitu Chin | Myanmar | <100K |
345 | clt | Lautu Chin | Myanmar | <100K |
346 | dup | Duano | Indonesia, Malaysia | <100K |
347 | kyb | Butbut Kalinga | Philippines | <100K |
348 | stg | Trieng | Vietnam | <100K |
349 | cbw | Kinabalian | Philippines | <100K |
350 | csv | Sumtu Chin | Myanmar | <100K |
351 | riu | Riung | Indonesia | <100K |
352 | srg | Sulod | Philippines | <100K |
353 | ity | Moyadan Itneg | Philippines | <100K |
354 | kkg | Mabaka Valley Kalinga | Philippines | <100K |
355 | bne | Bintauna | Indonesia | <100K |
356 | nlk | Ninia Yali | Indonesia | <100K |
357 | hik | Seit-Kaitetu | Indonesia | <100K |
358 | ksn | Kasiguranin | Philippines | <100K |
359 | tsl | Ts’ün-Lao | Vietnam | <100K |
360 | xao | Khao | Vietnam | <100K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | xte | Ketengban | Indonesia | <10K |
2 | bna | Bonerate | Indonesia | <10K |
3 | bku | Buhid | Philippines | <10K |
4 | aws | South Awyu | Indonesia | <10K |
5 | woo | Manombai | Indonesia | <10K |
6 | asc | Casuarina Coast Asmat | Indonesia | <10K |
7 | tih | Timugon Murut | Malaysia | <10K |
8 | asl | Asilulu | Indonesia | <10K |
9 | sgb | Mag-antsi Ayta | Philippines | <10K |
10 | eky | Eastern Kayah | Myanmar, Thailand | <10K |
11 | ify | Keley-i Kallahan | Philippines | <10K |
12 | inl | Indonesian Sign Language | Indonesia | <10K |
13 | kgq | Kamoro | Indonesia | <10K |
14 | kht | Khamti | Myanmar | <10K |
15 | kpq | Korupun-Sela | Indonesia | <10K |
16 | kti | North Muyu | Indonesia | <10K |
17 | lcp | Western Lawa | Thailand | <10K |
18 | mtj | Moskona | Indonesia | <10K |
19 | slu | Selaru | Indonesia | <10K |
20 | tmw | Temuan | Malaysia | <10K |
21 | txt | Citak | Indonesia | <10K |
22 | whk | Wahau Kenyah | Indonesia | <10K |
23 | txn | West Tarangan | Indonesia | <10K |
24 | dro | Daro-Matu Melanau | Malaysia | <10K |
25 | awu | Central Awyu | Indonesia | <10K |
26 | itb | Binongan Itneg | Philippines | <10K |
27 | lti | Leti | Indonesia | <10K |
28 | saj | Sahu | Indonesia | <10K |
29 | kvv | Kola | Indonesia | <10K |
30 | kvu | Yinbaw | Myanmar | <10K |
31 | akc | Mpur | Indonesia | <10K |
32 | cns | Central Asmat | Indonesia | <10K |
33 | crw | Chrau | Vietnam | <10K |
34 | lwl | Eastern Lawa | Thailand | <10K |
35 | lzn | Lainong Naga | Myanmar | <10K |
36 | mrz | Marind | Indonesia | <10K |
37 | row | Dela-Oenale | Indonesia | <10K |
38 | sfe | Eastern Subanen | Philippines | <10K |
39 | ttd | Tutong | Brunei | <10K |
40 | iwo | Morop | Indonesia | <10K |
41 | twb | Tawbuid | Philippines | <10K |
42 | bhz | Bada | Indonesia | <10K |
43 | pwm | Molbog | Malaysia, Philippines | <10K |
44 | psa | Asue Awyu | Indonesia | <10K |
45 | ebk | Eastern Bontok | Philippines | <10K |
46 | tre | East Tarangan | Indonesia | <10K |
47 | npy | Napu | Indonesia | <10K |
48 | gdg | Ga’dang | Philippines | <10K |
49 | gir | Red Gelao | Vietnam | <10K |
50 | kll | Kagan Kalagan | Philippines | <10K |
51 | lwt | Lewotobi | Indonesia | <10K |
52 | moo | Monom | Vietnam | <10K |
53 | pnp | Pancana | Indonesia | <10K |
54 | tdr | Todrah | Vietnam | <10K |
55 | weo | Wemale | Indonesia | <10K |
56 | woi | Kamang | Indonesia | <10K |
57 | wrp | Waropen | Indonesia | <10K |
58 | lha | Laha | Vietnam | <10K |
59 | kvo | Dobel | Indonesia | <10K |
60 | mtg | Una | Indonesia | <10K |
61 | inn | Isinay | Philippines | <10K |
62 | ihp | Iha | Indonesia | <10K |
63 | jka | Kaera | Indonesia | <10K |
64 | myl | Moma | Indonesia | <10K |
65 | mmn | Minamanwa | Philippines | <10K |
66 | nxr | Ninggerum | Indonesia | <10K |
67 | blx | Mag-Indi Ayta | Philippines | <10K |
68 | duw | Dusun Witu | Indonesia | <10K |
69 | kgw | Karon Dori | Indonesia | <10K |
70 | kyo | Klon | Indonesia | <10K |
71 | lbt | Lachi | Vietnam | <10K |
72 | mli | Malimpung | Indonesia | <10K |
73 | nfa | Dhao | Indonesia | <10K |
74 | pdo | Padoe | Indonesia | <10K |
75 | raz | Rahambuu | Indonesia | <10K |
76 | tpg | Kula | Indonesia | <10K |
77 | urk | Urak Lawoi’ | Thailand | <10K |
78 | wad | Wamesa | Indonesia | <10K |
79 | wod | Wolani | Indonesia | <10K |
80 | wul | Silimo | Indonesia | <10K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
81 | yac | Pass Valley Yali | Indonesia | <10K |
82 | yoy | Yoy | Laos, Thailand | <10K |
83 | and | Ansus | Indonesia | <10K |
84 | mxn | Moi Kelim | Indonesia | <10K |
85 | tlv | Taliabu | Indonesia | <10K |
86 | bty | Bobot | Indonesia | <10K |
87 | duq | Dusun Malang | Indonesia | <10K |
88 | ums | Pendau | Indonesia | <10K |
89 | vbb | Southeast Babar | Indonesia | <10K |
90 | baj | Barakai | Indonesia | <10K |
91 | bgr | Bawm Chin | Myanmar | <10K |
92 | irr | Ir | Laos | <10K |
93 | nbq | Nggem | Indonesia | <10K |
94 | bqr | Burusu | Indonesia | <10K |
95 | kvd | Kui | Indonesia | <10K |
96 | bny | Bintulu | Malaysia | <10K |
97 | rka | Kraol | Cambodia | <10K |
98 | jah | Jah Hut | Malaysia | <10K |
99 | kys | Baram Kayan | Malaysia | <10K |
100 | smu | Somray | Cambodia | <10K |
101 | sza | Semelai | Malaysia | <10K |
102 | alk | Alak | Laos | <10K |
103 | anl | Anu-Khongso Chin | Myanmar | <10K |
104 | bei | Bakati’ | Indonesia | <10K |
105 | irh | Irarutu | Indonesia | <10K |
106 | kta | Katua | Vietnam | <10K |
107 | kts | South Muyu | Indonesia | <10K |
108 | kzi | Kelabit | Indonesia, Malaysia | <10K |
109 | lmr | Lamalera | Indonesia | <10K |
110 | mwt | Moken | Myanmar, Thailand | <10K |
111 | ntx | Tangkhul Naga | Myanmar | <10K |
112 | ror | Rongga | Indonesia | <10K |
113 | sdu | Sarudu | Indonesia | <10K |
114 | slz | Ma’ya | Indonesia | <10K |
115 | sre | Sara Bakati’ | Indonesia | <10K |
116 | tgb | Tobilung | Malaysia | <10K |
117 | twe | Teiwa | Indonesia | <10K |
118 | tyn | Kombai | Indonesia | <10K |
119 | wah | Watubela | Indonesia | <10K |
120 | nev | Nyaheun | Laos | <10K |
121 | klz | Kabola | Indonesia | <10K |
122 | awy | Edera Awyu | Indonesia | <10K |
123 | abd | Manide | Philippines | <10K |
124 | tnm | Tabla | Indonesia | <10K |
125 | skb | Saek | Laos, Thailand | <10K |
126 | kvw | Wersing | Indonesia | <10K |
127 | xod | Kokoda | Indonesia | <10K |
128 | bpq | Banda Malay | Indonesia | <10K |
129 | bay | Batuley | Indonesia | <10K |
130 | kgx | Kamaru | Indonesia | <10K |
131 | khe | Korowai | Indonesia | <10K |
132 | lkj | Remun | Malaysia | <10K |
133 | pku | Paku | Indonesia | <10K |
134 | saw | Sawi | Indonesia | <10K |
135 | tcg | Tamagario | Indonesia | <10K |
136 | pne | Western Penan | Malaysia | <10K |
137 | xks | Kumbewaha | Indonesia | <10K |
138 | pgu | Pagu | Indonesia | <10K |
139 | tpo | Tai Pao | Laos, Vietnam | <10K |
140 | zrs | Mairasi | Indonesia | <10K |
141 | kzz | Kalabra | Indonesia | <10K |
142 | bls | Balaesang | Indonesia | <10K |
143 | kuv | Kur | Indonesia | <10K |
144 | ree | Rejang Kayan | Malaysia | <10K |
145 | abp | Abellen Ayta | Philippines | <10K |
146 | adn | Adang | Indonesia | <10K |
147 | ahh | Aghu | Indonesia | <10K |
148 | bnd | Banda | Indonesia | <10K |
149 | bnq | Bantik | Indonesia | <10K |
150 | ckh | Chak | Myanmar | <10K |
151 | due | Umiray Dumaget Agta | Philippines | <10K |
152 | eip | Lik | Indonesia | <10K |
153 | kgr | Abun | Indonesia | <10K |
154 | kig | Kimaghima | Indonesia | <10K |
155 | nsy | Nasal | Indonesia | <10K |
156 | swt | Sawila | Indonesia | <10K |
157 | tmg | Ternateño | Indonesia | <10K |
158 | wms | Wambon | Indonesia | <10K |
159 | mhe | Mah Meri | Malaysia | <10K |
160 | bgl | Bo | Laos | <10K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
161 | bpv | Bian Marind | Indonesia | <10K |
162 | gzn | Gane | Indonesia | <10K |
163 | dmr | East Damar | Indonesia | <10K |
164 | obk | Southern Bontok | Philippines | <10K |
165 | bzl | Boano | Indonesia | <10K |
166 | hbu | Habun | East Timor | <10K |
167 | zng | Mang | Vietnam | <10K |
168 | gei | Gebe | Indonesia | <10K |
169 | spb | Sepa | Indonesia | <10K |
170 | agv | Remontado Dumagat | Philippines | <10K |
171 | bzq | Buli | Indonesia | <10K |
172 | brp | Barapasi | Indonesia | <10K |
173 | cbl | Bualkhaw Chin | Myanmar | <10K |
174 | grs | Gresi | Indonesia | <10K |
175 | jmn | Makuri Naga | Myanmar | <10K |
176 | kmt | Kemtuik | Indonesia | <10K |
177 | kwe | Kwerba | Indonesia | <10K |
178 | sko | Seko Tengah | Indonesia | <10K |
179 | wrs | Waris | Indonesia | <10K |
180 | kyi | Kiput | Malaysia | <10K |
181 | nrm | Narom | Malaysia | <10K |
182 | klw | Tado | Indonesia | <10K |
183 | spu | Sapuan | Laos | <10K |
184 | jei | Yei | Indonesia | <10K |
185 | sqq | Sou | Laos | <10K |
186 | awv | Jair Awyu | Indonesia | <10K |
187 | bup | Busoa | Indonesia | <10K |
188 | kkl | Kosarek Yale | Indonesia | <10K |
189 | zka | Kaimbulawa | Indonesia | <10K |
190 | kjr | Kurudu | Indonesia | <10K |
191 | alj | Alangan | Philippines | <10K |
192 | asy | Yaosakor Asmat | Indonesia | <10K |
193 | dms | Dampelas | Indonesia | <10K |
194 | enr | Emem | Indonesia | <10K |
195 | hnu | Hung | Laos, Vietnam | <10K |
196 | kwt | Kwesten | Indonesia | <10K |
197 | kyj | Karao | Philippines | <10K |
198 | lau | Laba | Indonesia | <10K |
199 | ley | Limola | Indonesia | <10K |
200 | mqf | Momuna | Indonesia | <10K |
201 | mqo | Modole | Indonesia | <10K |
202 | nir | Nimboran | Indonesia | <10K |
203 | pmo | Pom | Indonesia | <10K |
204 | sge | Segai | Indonesia | <10K |
205 | szc | Semaq Beri | Malaysia | <10K |
206 | tgt | Central Tagbanwa | Philippines | <10K |
207 | tty | Sikaritai | Indonesia | <10K |
208 | bgk | Bit | Laos | <10K |
209 | grm | Kota Marudu Talantang | Malaysia | <10K |
210 | srl | Isirawa | Indonesia | <10K |
211 | wbw | Woi | Indonesia | <10K |
212 | sib | Sebop | Malaysia | <10K |
213 | bnb | Bookan Murut | Malaysia | <10K |
214 | llm | Lasalimu | Indonesia | <10K |
215 | rmm | Roma | Indonesia | <10K |
216 | pcb | Pear | Cambodia | <10K |
217 | abc | Ambala Ayta | Philippines | <10K |
218 | nxx | Nafri | Indonesia | <10K |
219 | lwh | White Lachi | Vietnam | <10K |
220 | ury | Orya | Indonesia | <10K |
221 | irx | Kamberau | Indonesia | <10K |
222 | atk | Ati | Philippines | <10K |
223 | bgb | Bobongko | Indonesia | <10K |
224 | bvz | Bauzi | Indonesia | <10K |
225 | bzp | Kemberano | Indonesia | <10K |
226 | cbn | Nyahkur | Thailand | <10K |
227 | dbf | Edopi | Indonesia | <10K |
228 | eno | Enggano | Indonesia | <10K |
229 | mkm | Moklen | Thailand | <10K |
230 | nxl | South Nuaulu | Indonesia | <10K |
231 | vko | Kodeoha | Indonesia | <10K |
232 | wbb | Wabo | Indonesia | <10K |
233 | yir | North Awyu | Indonesia | <10K |
234 | zbc | Central Berawan | Malaysia | <10K |
235 | bya | Batak | Philippines | <10K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
236 | bdg | Bonggi | Malaysia | <10K |
237 | fau | Fayu | Indonesia | <10K |
238 | ilu | Ili’uun | Indonesia | <10K |
239 | yet | Yetfa | Indonesia | <10K |
240 | dmy | Sowari | Indonesia | <10K |
241 | ddw | Dawera-Daweloor | Indonesia | <10K |
242 | jhi | Jehai | Malaysia | <10K |
243 | xmt | Matbat | Indonesia | <10K |
244 | beg | Belait | Brunei | <10K |
245 | ivb | Ibatan | Philippines | <10K |
246 | oia | Oirata | Indonesia | <10K |
247 | bkl | Berik | Indonesia | <10K |
248 | duo | Dupaninan Agta | Philippines | <10K |
249 | kdw | Koneraw | Indonesia | <10K |
250 | msf | Mekwei | Indonesia | <10K |
251 | nqm | Ndom | Indonesia | <10K |
252 | sbg | Moi Lemas | Indonesia | <10K |
253 | seu | Serui-Laut | Indonesia | <10K |
254 | tve | Te’un | Indonesia | <10K |
255 | tzn | Tugun | Indonesia | <10K |
256 | wng | Wanggom | Indonesia | <10K |
257 | bnj | Bangon | Philippines | <10K |
258 | snv | Sa’ban | Indonesia, Malaysia | <10K |
259 | bdw | Baham | Indonesia | <10K |
260 | ran | Riantana | Indonesia | <10K |
261 | rnn | Roon | Indonesia | <10K |
262 | szp | Suabo | Indonesia | <10K |
263 | zbe | East Berawan | Malaysia | <10K |
264 | scb | Chut | Laos, Vietnam | <10K |
265 | tvm | Tela-Masbuar | Indonesia | <10K |
266 | udj | Ujir | Indonesia | <10K |
267 | agy | Southern Alta | Philippines | <10K |
268 | air | Airoran | Indonesia | <10K |
269 | aqm | Atohwaim | Indonesia | <10K |
270 | asi | Buruwai | Indonesia | <10K |
271 | att | Pamplona Atta | Philippines | <10K |
272 | bcd | North Babar | Indonesia | <10K |
273 | bnf | Masiwang | Indonesia | <10K |
274 | btq | Batek | Malaysia | <10K |
275 | cth | Thaiphum Chin | Myanmar | <10K |
276 | dem | Dem | Indonesia | <10K |
277 | dmg | Upper Kinabatangan | Malaysia | <10K |
278 | dnu | Danau | Myanmar | <10K |
279 | etz | Semimi | Indonesia | <10K |
280 | jbj | Arandai | Indonesia | <10K |
281 | kbv | Dla | Indonesia | <10K |
282 | kpu | Kafoa | Indonesia | <10K |
283 | kvy | Yintale | Myanmar | <10K |
284 | msg | Moraid | Indonesia | <10K |
285 | nks | North Asmat | Indonesia | <10K |
286 | pnx | Phong-Kniang | Laos | <10K |
287 | sob | Sobei | Indonesia | <10K |
288 | wgo | Ambel | Indonesia | <10K |
289 | wno | Wano | Indonesia | <10K |
290 | xse | Sempan | Indonesia | <10K |
291 | zbw | West Berawan | Malaysia | <10K |
Not in SEACrowd | ||||
292 | rbk | Northern Bontok | Philippines | <10K |
293 | kvt | Lahta | Myanmar | <10K |
294 | lbg | Laopang | Laos | <10K |
295 | stu | Samtao | Myanmar | <10K |
296 | kxk | Zayein | Myanmar | <10K |
297 | iti | Inlaud Itneg | Philippines | <10K |
298 | nqq | Chen-Kayu Naga | Myanmar | <10K |
299 | pnc | Pannei | Indonesia | <10K |
300 | zkn | Kanan | Myanmar | <10K |
301 | mlz | Malaynon | Philippines | <10K |
302 | khf | Khuen | Laos | <10K |
303 | kkx | Kohin | Indonesia | <10K |
304 | lmj | West Lembata | Indonesia | <10K |
305 | dkr | Kuijau | Malaysia | <10K |
306 | ebc | Beginci | Indonesia | <10K |
307 | mtw | Southern Binukidnon | Philippines | <10K |
308 | mqk | Rajah Kabunsuwan Manobo | Philippines | <10K |
309 | csx | Cambodian Sign Language | Cambodia | <10K |
310 | tis | Masadiit Itneg | Philippines | <10K |
311 | csj | Songlai Chin | Myanmar | <10K |
312 | mqc | Mangole | Indonesia | <10K |
313 | bpz | Bilba | Indonesia | <10K |
314 | lmf | South Lembata | Indonesia | <10K |
315 | wha | Sou Upaa | Indonesia | <10K |
316 | lkc | Kucong | Vietnam | <10K |
317 | mqa | Maba | Indonesia | <10K |
318 | lcq | Luhu | Indonesia | <10K |
319 | mjb | Makalero | East Timor | <10K |
No. | ISO 639-3 | Language | Region(s) | Population |
Not in SEACrowd | ||||
320 | krv | Kavet | Cambodia | <10K |
321 | cey | Ekai Chin | Myanmar | <10K |
322 | kjt | Phrae Pwo Karen | Thailand | <10K |
323 | kuk | Kepo’ | Indonesia | <10K |
324 | put | Putoh | Indonesia | <10K |
325 | rjg | Rajong | Indonesia | <10K |
326 | sjb | Sajau Basap | Indonesia | <10K |
327 | tkz | Takua | Vietnam | <10K |
328 | amv | Ambelau | Indonesia | <10K |
329 | wlh | Welaun | East Timor, Indonesia | <10K |
330 | plz | Paluan Murut | Malaysia | <10K |
331 | jkp | Paku Karen | Myanmar | <10K |
332 | adb | Atauran | East Timor | <10K |
333 | nea | Eastern Ngad’a | Indonesia | <10K |
334 | ntd | Northern Tidung | Malaysia | <10K |
335 | phh | Phula | Vietnam | <10K |
336 | reb | Rembong | Indonesia | <10K |
337 | skx | Seko Padang | Indonesia | <10K |
338 | swu | Suwawa | Indonesia | <10K |
339 | tgr | Tareng | Laos | <10K |
340 | weu | Rawngtu Chin | Myanmar | <10K |
341 | sau | Saleman | Indonesia | <10K |
342 | thi | Tai Long | Laos | <10K |
343 | low | Tampias Lobu | Malaysia | <10K |
344 | npg | Ponyo-Gongwang Naga | Myanmar | <10K |
345 | ukk | Muak Sa-aak | Myanmar | <10K |
346 | tlq | Tai Loi | Laos, Myanmar | <10K |
347 | hkn | Mel-Khaonh | Cambodia | <10K |
348 | jkm | Mobwa Karen | Myanmar | <10K |
349 | lmq | Lamatuka | Indonesia | <10K |
350 | lvu | Levuka | Indonesia | <10K |
351 | lwe | Lewoeleng | Indonesia | <10K |
352 | rtc | Rungtu Chin | Myanmar | <10K |
353 | ruu | Lanas Lobu | Malaysia | <10K |
354 | tiu | Adasen | Philippines | <10K |
355 | umn | Paungnyuan Naga | Myanmar | <10K |
356 | lhh | Laha | Indonesia | <10K |
357 | bjx | Vanaw Kalinga | Philippines | <10K |
358 | bvt | Bati | Indonesia | <10K |
359 | kqv | Okolod | Indonesia, Malaysia | <10K |
360 | xkk | Kachok | Cambodia | <10K |
361 | iwk | I-wak | Philippines | <10K |
362 | lka | Lakalei | East Timor | <10K |
363 | bzn | Boano | Indonesia | <10K |
364 | sbr | Sembakung Murut | Indonesia, Malaysia | <10K |
365 | bfg | Busang Kayan | Indonesia | <10K |
366 | hap | Hupla | Indonesia | <10K |
367 | kxi | Keningau Murut | Malaysia | <10K |
368 | llq | Lolak | Indonesia | <10K |
369 | roc | Cacgia Roglai | Vietnam | <10K |
370 | sls | Singapore Sign Language | Singapore | <10K |
371 | ste | Liana-Seti | Indonesia | <10K |
372 | ulu | Uma’ Lung | Indonesia | <10K |
373 | wli | Waioli | Indonesia | <10K |
374 | wrx | Wae Rana | Indonesia | <10K |
375 | xhv | Khua | Laos, Vietnam | <10K |
376 | tdy | Tadyawan | Philippines | <10K |
377 | zbt | Batui | Indonesia | <10K |
378 | sws | Seluwasan | Indonesia | <10K |
379 | pni | Aoheng | Indonesia | <10K |
380 | tuj | Tugutil | Indonesia | <10K |
381 | nps | Nipsan | Indonesia | <10K |
382 | uan | Kuan | Laos | <10K |
383 | vbk | Southwestern Bontok | Philippines | <10K |
384 | dmv | Dumpas | Malaysia | <10K |
385 | xko | Kiorr | Laos | <10K |
386 | kve | Kalabakan Murut | Malaysia | <10K |
387 | mcm | Malaccan Portuguese Creole | Malaysia | <10K |
388 | ltu | Latu | Indonesia | <10K |
389 | gef | Gerai | Indonesia | <10K |
390 | cnc | Côông | Vietnam | <10K |
391 | bpo | Anasi | Indonesia | <10K |
392 | hld | Halang Doan | Laos, Vietnam | <10K |
393 | nxk | Kokak Naga | Myanmar | <10K |
394 | puj | Punan Tubu | Indonesia | <10K |
395 | xkn | Kayan River Kayan | Indonesia | <10K |
396 | ycp | Chepya | Laos | <10K |
397 | lcs | Lisabata-Nuniali | Indonesia | <10K |
398 | haf | Haiphong Sign Language | Vietnam | <10K |
399 | slt | Sila | Laos, Vietnam | <10K |
No. | ISO 639-3 | Language | Region(s) | Population |
Not in SEACrowd | ||||
400 | kvh | Komodo | Indonesia | <10K |
401 | apf | Pahanan Agta | Philippines | <10K |
402 | bzb | Andio | Indonesia | <10K |
403 | jal | Yalahatan | Indonesia | <10K |
404 | mvr | Marau | Indonesia | <10K |
405 | agz | Mt. Iriga Agta | Philippines | <10K |
406 | dkk | Dakka | Indonesia | <10K |
407 | gak | Gamkonora | Indonesia | <10K |
408 | kmd | Majukayang Kalinga | Philippines | <10K |
409 | mqp | Manipa | Indonesia | <10K |
410 | pzn | Jejara Naga | Myanmar | <10K |
411 | xkd | Mendalam Kayan | Indonesia | <10K |
412 | xay | Kayan Mahakam | Indonesia | <10K |
413 | xky | Uma’ Lasan | Indonesia, Malaysia | <10K |
414 | mqq | Minokok | Malaysia | <10K |
415 | neo | Ná-Meo | Vietnam | <10K |
416 | tln | Talondo’ | Indonesia | <10K |
417 | bqy | Kata Kolok | Indonesia | <10K |
418 | mxr | Murik | Malaysia | <10K |
419 | nty | Mantsi | Vietnam | <10K |
420 | tev | Teor | Indonesia | <10K |
421 | ttp | Tombelala | Indonesia | <10K |
422 | ayt | Magbukun Ayta | Philippines | <10K |
423 | ckn | Kaang Chin | Myanmar | <10K |
424 | cno | Con | Laos | <10K |
425 | goq | Gorap | Indonesia | <10K |
426 | hov | Hovongan | Indonesia | <10K |
427 | lpn | Long Phuri Naga | Myanmar | <10K |
428 | nlq | Lao Naga | Myanmar | <10K |
429 | nqy | Akyaung Ari Naga | Myanmar | <10K |
430 | nuo | Ngoaun | Laos, Vietnam | <10K |
431 | psg | Penang Sign Language | Malaysia | <10K |
432 | ues | Kioko | Indonesia | <10K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | sow | Sowanda | Indonesia | <1K |
2 | duv | Duvle | Indonesia | <1K |
3 | hmu | Hamap | Indonesia | <1K |
4 | ktt | Ketum | Indonesia | <1K |
5 | mpz | Mpi | Thailand | <1K |
6 | tvw | Sedoa | Indonesia | <1K |
7 | syo | Su’ung | Cambodia | <1K |
8 | mgk | Mawes | Indonesia | <1K |
9 | mss | West Masela | Indonesia | <1K |
10 | dij | Dai | Indonesia | <1K |
11 | drn | West Damar | Indonesia | <1K |
12 | lji | Laiyolo | Indonesia | <1K |
13 | mth | Munggui | Indonesia | <1K |
14 | psn | Panasuan | Indonesia | <1K |
15 | ret | Reta | Indonesia | <1K |
16 | twg | Tereweng | Indonesia | <1K |
17 | bpg | Bonggo | Indonesia | <1K |
18 | agt | Central Cagayan Agta | Philippines | <1K |
19 | kvz | Tsaukambo | Indonesia | <1K |
20 | skp | Sekapan | Malaysia | <1K |
21 | bsm | Busami | Indonesia | <1K |
22 | bzi | Bisu | Thailand | <1K |
23 | kzm | Kais | Indonesia | <1K |
24 | mhz | Mor | Indonesia | <1K |
25 | nkj | Nakai | Indonesia | <1K |
26 | pru | Puragi | Indonesia | <1K |
27 | skv | Skou | Indonesia | <1K |
28 | laq | Qabiao | Vietnam | <1K |
29 | ssm | Semnam | Malaysia | <1K |
30 | slg | Selungai Murut | Indonesia, Malaysia | <1K |
31 | tpf | Tarpia | Indonesia | <1K |
32 | vto | Vitou | Indonesia | <1K |
33 | wsa | Warembori | Indonesia | <1K |
34 | dgc | Casiguran Dumagat Agta | Philippines | <1K |
35 | bfe | Betaf | Indonesia | <1K |
36 | kgb | Kawe | Indonesia | <1K |
37 | kwh | Kowiai | Indonesia | <1K |
38 | ppm | Papuma | Indonesia | <1K |
39 | tdi | Tomadino | Indonesia | <1K |
40 | tmu | Iau | Indonesia | <1K |
41 | uka | Kaburi | Indonesia | <1K |
42 | bkn | Bukitan | Indonesia, Malaysia | <1K |
43 | imr | Imroing | Indonesia | <1K |
44 | tgq | Tring | Malaysia | <1K |
45 | tlk | Taloki | Indonesia | <1K |
46 | ert | Eritai | Indonesia | <1K |
47 | lpe | Lepki | Indonesia | <1K |
48 | vme | East Masela | Indonesia | <1K |
49 | mxz | Central Masela | Indonesia | <1K |
50 | aos | Taikat | Indonesia | <1K |
51 | cog | Chong | Thailand | <1K |
52 | dpp | Papar | Malaysia | <1K |
53 | jet | Manem | Indonesia | <1K |
54 | kag | Kajaman | Malaysia | <1K |
55 | kgi | Selangor Sign Language | Malaysia | <1K |
56 | kly | Kalao | Indonesia | <1K |
57 | knd | Konda | Indonesia | <1K |
58 | kuc | Kwinsu | Indonesia | <1K |
59 | lvi | Lavi | Laos | <1K |
60 | nbn | Kuri | Indonesia | <1K |
61 | ner | Yahadian | Indonesia | <1K |
62 | oni | Onin | Indonesia | <1K |
63 | orz | Ormu | Indonesia | <1K |
64 | pkt | Maleng | Laos, Vietnam | <1K |
65 | rth | Ratahan | Indonesia | <1K |
66 | sbt | Kimki | Indonesia | <1K |
67 | tcm | Tanahmerah | Indonesia | <1K |
68 | trt | Tunggare | Indonesia | <1K |
69 | wtw | Wotu | Indonesia | <1K |
70 | xkq | Koroni | Indonesia | <1K |
71 | cwg | Cheq Wong | Malaysia | <1K |
72 | bpp | Kaure | Indonesia | <1K |
73 | isd | Isnag | Philippines | <1K |
74 | pna | Punan Bah-Biau | Malaysia | <1K |
75 | skz | Sekar | Indonesia | <1K |
76 | thm | Aheu | Thailand | <1K |
77 | toy | Topoiyo | Indonesia | <1K |
78 | dbe | Dabe | Indonesia | <1K |
79 | bvk | Bukat | Indonesia | <1K |
80 | dei | Demisa | Indonesia | <1K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
81 | jel | Yelmek | Indonesia | <1K |
82 | nun | Anong | Myanmar | <1K |
83 | opk | Kopkaka | Indonesia | <1K |
84 | pas | Papasena | Indonesia | <1K |
85 | tmj | Samarokena | Indonesia | <1K |
86 | urn | Uruangnirin | Indonesia | <1K |
87 | xau | Kauwera | Indonesia | <1K |
88 | kdy | Keijar | Indonesia | <1K |
89 | auu | Auye | Indonesia | <1K |
90 | auw | Awyi | Indonesia | <1K |
91 | flh | Foau | Indonesia | <1K |
92 | gop | Yeretuar | Indonesia | <1K |
93 | jau | Yaur | Indonesia | <1K |
94 | lhn | Lahanan | Malaysia | <1K |
95 | pee | Taje | Indonesia | <1K |
96 | phq | Phana’ | Laos | <1K |
97 | tnz | Ten’edn | Malaysia, Thailand | <1K |
98 | wru | Waru | Indonesia | <1K |
99 | sve | Serili | Indonesia | <1K |
100 | bgv | Warkay-Bipim | Indonesia | <1K |
101 | bhc | Biga | Indonesia | <1K |
102 | bqb | Bagusa | Indonesia | <1K |
103 | bsa | Abinomn | Indonesia | <1K |
104 | ccm | Malaccan Malay Creole | Malaysia | <1K |
105 | giq | Green Gelao | Vietnam | <1K |
106 | kja | Mlap | Indonesia | <1K |
107 | kzv | Komyandaret | Indonesia | <1K |
108 | mrf | Elseng | Indonesia | <1K |
109 | swr | Saweru | Indonesia | <1K |
110 | tad | Tause | Indonesia | <1K |
111 | tbp | Diebroud | Indonesia | <1K |
112 | tmo | Temoq | Malaysia | <1K |
113 | tyh | O’du | Laos, Vietnam | <1K |
114 | wuy | Wauyai | Indonesia | <1K |
115 | xwr | Kwerba Mamberamo | Indonesia | <1K |
116 | rmh | Murkim | Indonesia | <1K |
117 | tml | Tamnim Citak | Indonesia | <1K |
118 | wet | Perai | Indonesia | <1K |
119 | bqq | Biritai | Indonesia | <1K |
120 | brs | Baras | Indonesia | <1K |
121 | bzu | Burmeso | Indonesia | <1K |
122 | emw | Emplawas | Indonesia | <1K |
123 | kiq | Kosare | Indonesia | <1K |
124 | kiy | Kirikiri | Indonesia | <1K |
125 | kns | Kensiu | Malaysia, Thailand | <1K |
126 | lcc | Legenyem | Indonesia | <1K |
127 | mso | Mombum | Indonesia | <1K |
128 | mvx | Meoswar | Indonesia | <1K |
129 | sao | Sause | Indonesia | <1K |
130 | snu | Viid | Indonesia | <1K |
131 | tlg | Tofanma | Indonesia | <1K |
132 | kgv | Karas | Indonesia | <1K |
133 | lnh | Lanoh | Malaysia | <1K |
134 | asz | As | Indonesia | <1K |
135 | kbi | Kaptiau | Indonesia | <1K |
136 | msl | Molof | Indonesia | <1K |
137 | wfg | Zorop | Indonesia | <1K |
138 | dmu | Tebi | Indonesia | <1K |
139 | llk | Lelak | Malaysia | <1K |
140 | tcq | Kaiy | Indonesia | <1K |
141 | aqn | Northern Alta | Philippines | <1K |
142 | bnv | Beneraf | Indonesia | <1K |
143 | enc | En | Vietnam | <1K |
144 | erw | Erokwanas | Indonesia | <1K |
145 | jbr | Jofotek-Bromnya | Indonesia | <1K |
146 | khh | Kehu | Indonesia | <1K |
147 | khp | Kapauri | Indonesia | <1K |
148 | kxn | Kanowit-Tanjong Melanau | Malaysia | <1K |
149 | mmb | Momina | Indonesia | <1K |
150 | nec | Nedebang | Indonesia | <1K |
151 | nyl | Nyeu | Thailand | <1K |
152 | rac | Rasawa | Indonesia | <1K |
153 | tnu | Tai Khang | Laos | <1K |
154 | wai | Wares | Indonesia | <1K |
155 | yki | Yoke | Indonesia | <1K |
156 | bed | Bedoanas | Indonesia | <1K |
157 | mzt | Mintil | Malaysia | <1K |
158 | agf | Arguni | Indonesia | <1K |
159 | apx | Aputai | Indonesia | <1K |
160 | kcd | Ngkâlmpw Kanum | Indonesia | <1K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
161 | ugo | Ugong | Thailand | <1K |
162 | wbe | Waritai | Indonesia | <1K |
163 | mra | Mlabri | Laos, Thailand | <1K |
164 | afz | Obokuitai | Indonesia | <1K |
165 | mgf | Maklew | Indonesia | <1K |
166 | ttn | Towei | Indonesia | <1K |
167 | knq | Kintaq | Malaysia | <1K |
168 | ulf | Usku | Indonesia | <1K |
169 | awh | Awbono | Indonesia | <1K |
170 | bti | Burate | Indonesia | <1K |
171 | byl | Bayono | Indonesia | <1K |
172 | diy | Diuwe | Indonesia | <1K |
173 | kpi | Kofei | Indonesia | <1K |
174 | krz | Sota Kanum | Indonesia | <1K |
175 | kwr | Kwer | Indonesia | <1K |
176 | tfo | Tefaro | Indonesia | <1K |
177 | tkx | Tangko | Indonesia | <1K |
178 | tti | Tobati | Indonesia | <1K |
Not in SEACrowd | ||||
179 | lcd | Lola | Indonesia | <1K |
180 | ors | Orang Seletar | Malaysia | <1K |
181 | kpd | Koba | Indonesia | <1K |
182 | trx | Tringgus-Sembaan Bidayuh | Malaysia | <1K |
183 | kqt | Klias River Kadazan | Malaysia | <1K |
184 | atp | Pudtol Atta | Philippines | <1K |
185 | tcp | Tawr Chin | Myanmar | <1K |
186 | kyd | Karey | Indonesia | <1K |
187 | pyy | Pyen | Myanmar | <1K |
188 | ttw | Long Wat | Malaysia | <1K |
189 | xmx | Salawati | Indonesia | <1K |
190 | ymn | Sunum | Indonesia | <1K |
191 | wkd | Mo | Indonesia | <1K |
192 | abf | Abai Sungai | Malaysia | <1K |
193 | esy | Eskayan | Philippines | <1K |
194 | kzb | Kaibobo | Indonesia | <1K |
195 | njs | Nisa | Indonesia | <1K |
196 | nni | North Nuaulu | Indonesia | <1K |
197 | whu | Wahau Kayan | Indonesia | <1K |
198 | xke | Kereho | Indonesia | <1K |
199 | lce | Sekak | Indonesia | <1K |
200 | sdx | Sibu Melanau | Malaysia | <1K |
201 | bfk | Ban Khor Sign Language | Thailand | <1K |
202 | kax | Kao | Indonesia | <1K |
203 | srk | Serudung Murut | Malaysia | <1K |
204 | pud | Punan Aput | Indonesia | <1K |
205 | bgy | Benggoi | Indonesia | <1K |
206 | kzd | Kadai | Indonesia | <1K |
207 | kvp | Kompane | Indonesia | <1K |
208 | auq | Anus | Indonesia | <1K |
209 | azt | Faire Atta | Philippines | <1K |
210 | hud | Huaulu | Indonesia | <1K |
211 | lgh | Laghuu | Vietnam | <1K |
212 | tip | Trimuris | Indonesia | <1K |
213 | tyj | Tai Yo | Laos, Vietnam | <1K |
214 | tys | Tày Sa Pa | Vietnam | <1K |
215 | mqi | Mariri | Indonesia | <1K |
216 | pdn | Fedan | Indonesia | <1K |
217 | mnq | Minriq | Malaysia | <1K |
218 | daz | Dao | Indonesia | <1K |
219 | gnq | Gana | Malaysia | <1K |
220 | lrn | Lorang | Indonesia | <1K |
221 | bsu | Bahonsuai | Indonesia | <1K |
222 | puc | Punan Merap | Indonesia | <1K |
223 | rmx | Romam | Vietnam | <1K |
224 | tyl | Thu Lao | Vietnam | <1K |
225 | yrs | Yarsun | Indonesia | <1K |
226 | atl | Mt. Iraya Agta | Philippines | <1K |
227 | puf | Punan Merah | Indonesia | <1K |
228 | umi | Ukit | Malaysia | <1K |
229 | jvd | Javindo | Indonesia | <1K |
230 | srt | Sauri | Indonesia | <1K |
No. | ISO 639-3 | Language | Region(s) | Population |
In SEACrowd | ||||
1 | mnu | Mer | Indonesia | <100 |
2 | itx | Itik | Indonesia | <100 |
3 | kxq | Smärky Kanum | Indonesia | <100 |
4 | lix | Liabuku | Indonesia | <100 |
5 | awr | Awera | Indonesia | <100 |
6 | bdx | Budong-Budong | Indonesia | <100 |
7 | ire | Yeresiam | Indonesia | <100 |
8 | tds | Doutai | Indonesia | <100 |
9 | mrx | Dineor | Indonesia | <100 |
10 | amq | Amahai | Indonesia | <100 |
11 | kzu | Kayupulau | Indonesia | <100 |
12 | mok | Morori | Indonesia | <100 |
13 | plh | Paulohi | Indonesia | <100 |
14 | sgu | Salas | Indonesia | <100 |
15 | aip | Burumakok | Indonesia | <100 |
16 | dbn | Duriankere | Indonesia | <100 |
17 | dul | Inagta Alabat | Philippines | <100 |
18 | moq | Mor | Indonesia | <100 |
19 | naa | Namla | Indonesia | <100 |
20 | mvs | Massep | Indonesia | <100 |
21 | aem | Arem | Laos, Vietnam | <100 |
22 | mqr | Mander | Indonesia | <100 |
23 | xkw | Kembra | Indonesia | <100 |
24 | kkb | Kwerisa | Indonesia | <100 |
25 | atz | Arta | Philippines | <100 |
26 | ibh | Bih | Vietnam | <100 |
27 | khd | Bädi Kanum | Indonesia | <100 |
28 | nul | Nusa Laut | Indonesia | <100 |
29 | scq | Chung | Cambodia | <100 |
30 | mqt | Mok | Myanmar, Thailand | <10 |
31 | btj | Bacanese Malay | Indonesia | <10 |
32 | wor | Woria | Indonesia | <10 |
33 | spi | Saponi | Indonesia | <10 |
34 | dsn | Dusner | Indonesia | <10 |
35 | lgi | Lengilu | Indonesia | <10 |
36 | btn | Ratagnon | Philippines | <10 |
37 | tni | Tandia | Indonesia | <10 |
38 | huw | Hukumina | Indonesia | <10 |
39 | kzl | Kayeli | Indonesia | <10 |
40 | sxm | Samre | Cambodia, Thailand | <10 |
41 | hpo | Hpon | Myanmar | <10 |
42 | mpy | Mapia | Indonesia | <10 |
43 | nil | Nila | Indonesia | <10 |
44 | sbo | Sabüm | Malaysia | <10 |
45 | srw | Serua | Indonesia | <10 |
46 | tas | Tay Boi | Vietnam | <10 |
47 | xbn | Kenaboi | Malaysia | <10 |
48 | xxt | Tambora | Indonesia | <10 |
Not in SEACrowd | ||||
49 | orn | Orang Kanaq | Malaysia | <100 |
50 | lva | Makuva | East Timor | <100 |
51 | spg | Sihan | Malaysia | <100 |
52 | ibu | Ibu | Indonesia | <100 |
53 | pnm | Punan Batu | Malaysia | <100 |
54 | csd | Chiangmai Sign Language | Thailand | <100 |
55 | ays | Sorsogon Ayta | Philippines | <100 |
56 | lio | Liki | Indonesia | <100 |
57 | pey | Petjo | Indonesia | <100 |
58 | hti | Hoti | Indonesia | <100 |
59 | huk | Hulung | Indonesia | <100 |
60 | ism | Masimasi | Indonesia | <100 |
61 | kzx | Kamarian | Indonesia | <100 |
62 | pns | Ponosakan | Indonesia | <100 |
63 | agk | Katubung Agta | Philippines | <10 |
64 | nae | Naka’ela | Indonesia | <10 |
65 | atm | Ata | Philippines | <10 |
66 | ihb | Iha Based Pidgin | Indonesia | <10 |
67 | tvy | Timor Pidgin | East Timor | <10 |
68 | duy | Dicamay Agta | Philippines | <10 |
69 | dyg | Villa Viciosa Agta | Philippines | <10 |
70 | lox | Loun | Indonesia | <10 |
71 | onx | Onin Based Pidgin | Indonesia | <10 |
72 | tcl | Taman | Myanmar | <10 |
73 | vms | Moksela | Indonesia | <10 |
74 | wea | Wewaw | Myanmar | <10 |