Zum Hauptinhalt springen

Showing 1–5 of 5 results for author: Polpanumas, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.06000  [pdf

    cs.CL

    ThaiCoref: Thai Coreference Resolution Dataset

    Authors: Pontakorn Trakuekul, Wei Qi Leong, Charin Polpanumas, Jitkapat Sawatphol, William Chandra Tjhi, Attapol T. Rutherford

    Abstract: While coreference resolution is a well-established research area in Natural Language Processing (NLP), research focusing on Thai language remains limited due to the lack of large annotated corpora. In this work, we introduce ThaiCoref, a dataset for Thai coreference resolution. Our dataset comprises 777,271 tokens, 44,082 mentions and 10,429 entities across four text genres: university essays, new… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  2. arXiv:2405.07586  [pdf, other

    cs.CL

    Thai Universal Dependency Treebank

    Authors: Panyut Sriwirote, Wei Qi Leong, Charin Polpanumas, Santhawat Thanyawong, William Chandra Tjhi, Wirote Aroonmanakun, Attapol T. Rutherford

    Abstract: Automatic dependency parsing of Thai sentences has been underexplored, as evidenced by the lack of large Thai dependency treebanks with complete dependency structures and the lack of a published systematic evaluation of state-of-the-art models, especially transformer-based parsers. In this work, we address these problems by introducing Thai Universal Dependency Treebank (TUD), a new largest Thai t… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  3. arXiv:2312.04649  [pdf, other

    cs.CL

    PyThaiNLP: Thai Natural Language Processing in Python

    Authors: Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, Can Udomcharoenchaikit

    Abstract: We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: 12 pages, 2 figures, LaTeX; typos corrected, timeline clarified for section 2. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25-36, Singapore, Singapore. Empirical Methods in Natural Language Processing

    ACM Class: I.2.7

  4. arXiv:2101.09635  [pdf, ps, other

    cs.CL

    WangchanBERTa: Pretraining transformer-based Thai Language Models

    Authors: Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, Sarana Nutanong

    Abstract: Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Mor… ▽ More

    Submitted 20 March, 2021; v1 submitted 23 January, 2021; originally announced January 2021.

    Comments: 24 pages, edited the citation of the syllable-level tokenizer from [Chormai et al., 2020] to [Phatthiyaphaibun et al., 2020] as the authors used the syllable-level tokenizer from PyThaiNLP [Phatthiyaphaibun et al., 2020] in the experiments

  5. scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

    Authors: Lalita Lowphansirikul, Charin Polpanumas, Attapol T. Rutherford, Sarana Nutanong

    Abstract: The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and re… ▽ More

    Submitted 7 July, 2020; originally announced July 2020.

    Comments: 35 pages, 4 figures