A Parallel Corpus for Vietnamese Central-Northern Dialect Text Transfer

Thang Le; Luu Anh Tuan

doi:10.18653/v1/2023.findings-emnlp.925

A Parallel Corpus for Vietnamese Central-Northern Dialect Text Transfer

Abstract

The Vietnamese language embodies dialectal variants closely attached to the nation’s three macro-regions: the Northern, Central and Southern regions. As the northern dialect forms the basis of the standard language, it’s considered the prestige dialect. While the northern dialect differs from the remaining two in certain aspects, it almost shares an identical lexicon with the southern dialect, making the textual attributes nearly interchangeable. In contrast, the central dialect possesses a number of unique vocabularies and is less mutually intelligible to the standard dialect. Through preliminary experiments, we observe that current NLP models do not possess understandings of the Vietnamese central dialect text, which most likely originates from the lack of resources. To facilitate research on this domain, we introduce a new parallel corpus for Vietnamese central-northern dialect text transfer. Via exhaustive benchmarking, we discover monolingual language models’ superiority over their multilingual counterparts on the dialect transfer task. We further demonstrate that fine-tuned transfer models can seamlessly improve the performance of existing NLP systems on the central dialect domain with dedicated results in translation and text-image retrieval tasks.

Anthology ID:: 2023.findings-emnlp.925
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapur
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13839–13855
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.925
DOI:: 10.18653/v1/2023.findings-emnlp.925
Bibkey:
Cite (ACL):: Thang Le and Anh Luu. 2023. A Parallel Corpus for Vietnamese Central-Northern Dialect Text Transfer. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13839–13855, Singapore. Association for Computational Linguistics.
Cite (Informal):: A Parallel Corpus for Vietnamese Central-Northern Dialect Text Transfer (Le & Luu, Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.925.pdf

PDF Cite Search