WangchanLion and WangchanX MRC Eval
Authors:
Wannaphong Phatthiyaphaibun,
Surapon Nonesung,
Patomporn Payoungkhamdee,
Peerat Limkonchotiwat,
Can Udomcharoenchaikit,
Jitkapat Sawatphol,
Chompakorn Chaksangchaichot,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To…
▽ More
This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To assess the contextual understanding capability, we conducted extensive experimental studies using two Thai MRC datasets, XQuAD and Iapp_wiki_qa_squad. Experimental results demonstrate the model's ability to comprehend the context and produce an answer faithful to the reference one in 0-shot and 1-shot settings. In addition, our evaluation goes beyond the traditional MRC. We propose a new evaluation scheme assessing the answer's correctness, helpfulness, conciseness, and contextuality. Our code is available publicly at https://github.com/vistec-AI/WangchanLion.
△ Less
Submitted 23 April, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
Thai Wav2Vec2.0 with CommonVoice V8
Authors:
Wannaphong Phatthiyaphaibun,
Chompakorn Chaksangchaichot,
Peerat Limkonchotiwat,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and…
▽ More
Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and the performance of existing open-sourced models lacks robustness. To address this problem, we train a new ASR model on a pre-trained XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram language model to boost the performance of our ASR model. We hope that our models will be beneficial to individuals and the ASR community in Thailand.
△ Less
Submitted 9 August, 2022;
originally announced August 2022.