Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Yong, Zheng-Xin; Zhang, Ruochen; Forde, Jessica Zosa; Wang, Skyler; Subramonian, Arjun; Lovenia, Holy; Cahyawijaya, Samuel; Winata, Genta Indra; Sutawika, Lintang; Cruz, Jan Christian Blaise; Tan, Yin Lin; Phan, Long; Garcia, Rowena; Solorio, Thamar; Aji, Alham Fikri

Computer Science > Computation and Language

arXiv:2303.13592 (cs)

[Submitted on 23 Mar 2023 (v1), last revised 12 Sep 2023 (this version, v4)]

Title:Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Authors:Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Yin Lin Tan, Long Phan, Rowena Garcia, Thamar Solorio, Alham Fikri Aji

View PDF

Abstract:While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

Comments:	Updating Authors
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2303.13592 [cs.CL]
	(or arXiv:2303.13592v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2303.13592

Submission history

From: Zheng-Xin Yong [view email]
[v1] Thu, 23 Mar 2023 18:16:30 UTC (9,873 KB)
[v2] Thu, 30 Mar 2023 14:59:26 UTC (10,009 KB)
[v3] Thu, 7 Sep 2023 03:20:41 UTC (14,467 KB)
[v4] Tue, 12 Sep 2023 16:35:30 UTC (14,467 KB)

Computer Science > Computation and Language

Title:Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators