-
Addressing Topic Leakage in Cross-Topic Evaluation for Authorship Verification
Authors:
Jitkapat Sawatphol,
Can Udomcharoenchaikit,
Sarana Nutanong
Abstract:
Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models' robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To add…
▽ More
Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models' robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To address this, we propose an evaluation method called Heterogeneity-Informed Topic Sampling (HITS), which creates a smaller dataset with a heterogeneously distributed topic set. Our experimental results demonstrate that HITS-sampled datasets yield a more stable ranking of models across random seeds and evaluation splits. Our contributions include: 1. An analysis of causes and effects of topic leakage. 2. A demonstration of the HITS in reducing the effects of topic leakage, and 3. The Robust Authorship Verification bENchmark (RAVEN) that allows topic shortcut test to uncover AV models' reliance on topic-specific features.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Space Decomposition for Sentence Embedding
Authors:
Wuttikorn Ponwitayarat,
Peerat Limkonchotiwat,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a…
▽ More
Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
WangchanLion and WangchanX MRC Eval
Authors:
Wannaphong Phatthiyaphaibun,
Surapon Nonesung,
Patomporn Payoungkhamdee,
Peerat Limkonchotiwat,
Can Udomcharoenchaikit,
Jitkapat Sawatphol,
Chompakorn Chaksangchaichot,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To…
▽ More
This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To assess the contextual understanding capability, we conducted extensive experimental studies using two Thai MRC datasets, XQuAD and Iapp_wiki_qa_squad. Experimental results demonstrate the model's ability to comprehend the context and produce an answer faithful to the reference one in 0-shot and 1-shot settings. In addition, our evaluation goes beyond the traditional MRC. We propose a new evaluation scheme assessing the answer's correctness, helpfulness, conciseness, and contextuality. Our code is available publicly at https://github.com/vistec-AI/WangchanLion.
△ Less
Submitted 23 April, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
An Efficient Self-Supervised Cross-View Training For Sentence Embedding
Authors:
Peerat Limkonchotiwat,
Wuttikorn Ponwitayarat,
Lalita Lowphansirikul,
Can Udomcharoenchaikit,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrade…
▽ More
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Typo-Robust Representation Learning for Dense Retrieval
Authors:
Panuthep Tasawong,
Wuttikorn Ponwitayarat,
Peerat Limkonchotiwat,
Can Udomcharoenchaikit,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only fo…
▽ More
Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only focus on the alignment between misspelled and pristine queries, our method also improves the contrast between each misspelled query and its surrounding queries. To assess the effectiveness of our proposed method, we compare it against the existing competitors using two benchmark datasets and two base encoders. Our method outperforms the competitors in all cases with misspelled queries. Our code and models are available at https://github. com/panuthept/DST-DenseRetrieval.
△ Less
Submitted 17 June, 2023;
originally announced June 2023.
-
Thai Wav2Vec2.0 with CommonVoice V8
Authors:
Wannaphong Phatthiyaphaibun,
Chompakorn Chaksangchaichot,
Peerat Limkonchotiwat,
Ekapol Chuangsuwanich,
Sarana Nutanong
Abstract:
Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and…
▽ More
Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and the performance of existing open-sourced models lacks robustness. To address this problem, we train a new ASR model on a pre-trained XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram language model to boost the performance of our ASR model. We hope that our models will be beneficial to individuals and the ASR community in Thailand.
△ Less
Submitted 9 August, 2022;
originally announced August 2022.
-
WangchanBERTa: Pretraining transformer-based Thai Language Models
Authors:
Lalita Lowphansirikul,
Charin Polpanumas,
Nawat Jantrakulchai,
Sarana Nutanong
Abstract:
Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Mor…
▽ More
Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance. Our model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models (XLMR and mBERT) on both sequence classification and token classification tasks in human-annotated, mono-lingual contexts.
△ Less
Submitted 20 March, 2021; v1 submitted 23 January, 2021;
originally announced January 2021.
-
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
Authors:
Lalita Lowphansirikul,
Charin Polpanumas,
Attapol T. Rutherford,
Sarana Nutanong
Abstract:
The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and re…
▽ More
The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.
△ Less
Submitted 7 July, 2020;
originally announced July 2020.
-
Shipper Cooperation in Stochastic Drone Delivery: A Dynamic Bayesian Game Approach
Authors:
Suttinee Sawadsitang,
Dusit Niyato,
Tan Puay Siew,
Ping Wang,
Sarana Nutanong
Abstract:
With the recent technological innovation, unmanned aerial vehicles, known as drones, have found numerous applications including package and parcel delivery for shippers. Drone delivery offers benefits over conventional ground-based vehicle delivery in terms of faster speed, lower cost, more environment-friendly, and less manpower needed. However, most of existing studies on drone delivery planning…
▽ More
With the recent technological innovation, unmanned aerial vehicles, known as drones, have found numerous applications including package and parcel delivery for shippers. Drone delivery offers benefits over conventional ground-based vehicle delivery in terms of faster speed, lower cost, more environment-friendly, and less manpower needed. However, most of existing studies on drone delivery planning and scheduling focus on a single shipper and ignore uncertainty factors. As such, in this paper, we consider a scenario that multiple shippers can cooperate to minimize their drone delivery cost. We propose the Bayesian Shipper Cooperation in Stochastic Drone Delivery (BCoSDD) framework. The framework is composed of three functions, i.e., package assignment, shipper cooperation formation and cost management. The uncertainties of drone breakdown and misbehavior of cooperative shippers are taken into account by using multistage stochastic programming optimization and dynamic Bayesian coalition formation game. We conduct extensive performance evaluation of the BCoSDD framework by using customer locations from Solomon benchmark suite and a real Singapore logistics industry. As a result, the framework can help the shippers plan and schedule their drone delivery effectively.
△ Less
Submitted 8 February, 2020;
originally announced February 2020.
-
P2P Networks for Content Sharing
Authors:
Choon Hoong Ding,
Sarana Nutanong,
Rajkumar Buyya
Abstract:
Peer-to-peer (P2P) technologies have been widely used for content sharing, popularly called "file-swapping" networks. This chapter gives a broad overview of content sharing P2P technologies. It starts with the fundamental concept of P2P computing followed by the analysis of network topologies used in peer-to-peer systems. Next, three milestone peer-to-peer technologies: Napster, Gnutella, and Fa…
▽ More
Peer-to-peer (P2P) technologies have been widely used for content sharing, popularly called "file-swapping" networks. This chapter gives a broad overview of content sharing P2P technologies. It starts with the fundamental concept of P2P computing followed by the analysis of network topologies used in peer-to-peer systems. Next, three milestone peer-to-peer technologies: Napster, Gnutella, and Fasttrack are explored in details, and they are finally concluded with the comparison table in the last section.
△ Less
Submitted 10 February, 2004;
originally announced February 2004.