Zum Hauptinhalt springen

Showing 1–10 of 10 results for author: Nutanong, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.19164  [pdf, other

    cs.CL

    Addressing Topic Leakage in Cross-Topic Evaluation for Authorship Verification

    Authors: Jitkapat Sawatphol, Can Udomcharoenchaikit, Sarana Nutanong

    Abstract: Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models' robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To add… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: Accepted to publish at Transactions of the Association for Computational Linguistics

  2. arXiv:2406.03125  [pdf, other

    cs.CL

    Space Decomposition for Sentence Embedding

    Authors: Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: ACL Finding 2024. The code and pre-trained models are available at https://github.com/KornWtp/MixSP

  3. arXiv:2403.16127  [pdf, other

    cs.CL cs.AI

    WangchanLion and WangchanX MRC Eval

    Authors: Wannaphong Phatthiyaphaibun, Surapon Nonesung, Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Jitkapat Sawatphol, Chompakorn Chaksangchaichot, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: This technical report describes the development of WangchanLion, an instruction fine-tuned model focusing on Machine Reading Comprehension (MRC) in the Thai language. Our model is based on SEA-LION and a collection of instruction following datasets. To promote open research and reproducibility, we publicly release all training data, code, and the final model weights under the Apache-2 license. To… ▽ More

    Submitted 23 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

  4. arXiv:2311.03228  [pdf, other

    cs.CL cs.AI

    An Efficient Self-Supervised Cross-View Training For Sentence Embedding

    Authors: Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrade… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: Accepted to TACL. The code and pre-trained models are available at https://github.com/mrpeerat/SCT

  5. arXiv:2306.10348  [pdf, other

    cs.IR cs.CL

    Typo-Robust Representation Learning for Dense Retrieval

    Authors: Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: Dense retrieval is a basic building block of information retrieval applications. One of the main challenges of dense retrieval in real-world settings is the handling of queries containing misspelled words. A popular approach for handling misspelled queries is minimizing the representations discrepancy between misspelled queries and their pristine ones. Unlike the existing approaches, which only fo… ▽ More

    Submitted 17 June, 2023; originally announced June 2023.

    Comments: 5 pages, 2 figures

    ACM Class: I.2.7

  6. arXiv:2208.04799  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Thai Wav2Vec2.0 with CommonVoice V8

    Authors: Wannaphong Phatthiyaphaibun, Chompakorn Chaksangchaichot, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

    Abstract: Recently, Automatic Speech Recognition (ASR), a system that converts audio into text, has caught a lot of attention in the machine learning community. Thus, a lot of publicly available models were released in HuggingFace. However, most of these ASR models are available in English; only a minority of the models are available in Thai. Additionally, most of the Thai ASR models are closed-sourced, and… ▽ More

    Submitted 9 August, 2022; originally announced August 2022.

  7. arXiv:2101.09635  [pdf, ps, other

    cs.CL

    WangchanBERTa: Pretraining transformer-based Thai Language Models

    Authors: Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, Sarana Nutanong

    Abstract: Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Mor… ▽ More

    Submitted 20 March, 2021; v1 submitted 23 January, 2021; originally announced January 2021.

    Comments: 24 pages, edited the citation of the syllable-level tokenizer from [Chormai et al., 2020] to [Phatthiyaphaibun et al., 2020] as the authors used the syllable-level tokenizer from PyThaiNLP [Phatthiyaphaibun et al., 2020] in the experiments

  8. scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

    Authors: Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford, Sarana Nutanong

    Abstract: The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and re… ▽ More

    Submitted 7 July, 2020; originally announced July 2020.

    Comments: 35 pages, 4 figures

  9. arXiv:2002.03118  [pdf, other

    cs.GT math.OC

    Shipper Cooperation in Stochastic Drone Delivery: A Dynamic Bayesian Game Approach

    Authors: Suttinee Sawadsitang, Dusit Niyato, Tan Puay Siew, Ping Wang, Sarana Nutanong

    Abstract: With the recent technological innovation, unmanned aerial vehicles, known as drones, have found numerous applications including package and parcel delivery for shippers. Drone delivery offers benefits over conventional ground-based vehicle delivery in terms of faster speed, lower cost, more environment-friendly, and less manpower needed. However, most of existing studies on drone delivery planning… ▽ More

    Submitted 8 February, 2020; originally announced February 2020.

    Comments: 15 Pages, 10 figures, 2 tables. This paper is still under review

  10. arXiv:cs/0402018  [pdf

    cs.DC

    P2P Networks for Content Sharing

    Authors: Choon Hoong Ding, Sarana Nutanong, Rajkumar Buyya

    Abstract: Peer-to-peer (P2P) technologies have been widely used for content sharing, popularly called "file-swapping" networks. This chapter gives a broad overview of content sharing P2P technologies. It starts with the fundamental concept of P2P computing followed by the analysis of network topologies used in peer-to-peer systems. Next, three milestone peer-to-peer technologies: Napster, Gnutella, and Fa… ▽ More

    Submitted 10 February, 2004; originally announced February 2004.

    Comments: 35 pages, 26 figures

    Report number: GRIDS-TR-2003-7 ACM Class: C.2.4

    Journal ref: Technical Report, GRIDS-TR-2003-7, Grid Computing and Distributed Systems Laboratory, University of Melbourne, Australia, December 2003