Skip to main content

Showing 1–13 of 13 results for author: Mahendra, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 8 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: https://github.com/SEACrowd

  2. arXiv:2404.01854  [pdf, other

    cs.CL

    IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces

    Authors: Fajri Koto, Rahmad Mahendra, Nurul Aisyah, Timothy Baldwin

    Abstract: Although commonsense reasoning is greatly shaped by cultural and geographical factors, previous studies on language models have predominantly centered on English cultures, potentially resulting in an Anglocentric bias. In this paper, we introduce IndoCulture, aimed at understanding the influence of geographical factors on language model reasoning ability, with a specific emphasis on the diverse cu… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  3. arXiv:2212.09648  [pdf, other

    cs.CL cs.AI

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Authors: Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri , et al. (22 additional authors not shown)

    Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp… ▽ More

    Submitted 21 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

  4. arXiv:2207.10524  [pdf, other

    cs.CL cs.AI

    NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

    Authors: Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, Ayu Purwarianti

    Abstract: At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus m… ▽ More

    Submitted 1 August, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

  5. arXiv:2206.15359  [pdf, other

    cs.CL cs.SI

    Two-Stage Classifier for COVID-19 Misinformation Detection Using BERT: a Study on Indonesian Tweets

    Authors: Douglas Raevan Faisal, Rahmad Mahendra

    Abstract: The COVID-19 pandemic has caused globally significant impacts since the beginning of 2020. This brought a lot of confusion to society, especially due to the spread of misinformation through social media. Although there were already several studies related to the detection of misinformation in social media data, most studies focused on the English dataset. Research on COVID-19 misinformation detect… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: 29 pages, 5 figures, submitted to Elsevier Journal

  6. arXiv:2205.15960  [pdf, other

    cs.CL

    NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

    Authors: Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

    Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing re… ▽ More

    Submitted 12 April, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: EACL 2023

  7. arXiv:2203.13357  [pdf, other

    cs.CL

    One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

    Authors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder

    Abstract: NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian N… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted in ACL 2022

  8. arXiv:2202.07858  [pdf, ps, other

    cs.CL cs.IR

    ITTC @ TREC 2021 Clinical Trials Track

    Authors: Thinh Hung Truong, Yulia Otmakhova, Rahmad Mahendra, Timothy Baldwin, Jey Han Lau, Trevor Cohn, Lawrence Cavedon, Damiano Spina, Karin Verspoor

    Abstract: This paper describes the submissions of the Natural Language Processing (NLP) team from the Australian Research Council Industrial Transformation Training Centre (ITTC) for Cognitive Computing in Medical Technologies to the TREC 2021 Clinical Trials Track. The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes. We explor… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: 7 pages

  9. IndoNLI: A Natural Language Inference Dataset for Indonesian

    Authors: Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, Clara Vania

    Abstract: We present IndoNLI, the first human-elicited NLI dataset for Indonesian. We adapt the data collection protocol for MNLI and collect nearly 18K sentence pairs annotated by crowd workers and experts. The expert-annotated data is used exclusively as a test set. It is designed to provide a challenging test-bed for Indonesian NLI by explicitly incorporating various linguistic phenomena such as numerica… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted at EMNLP 2021 main conference

    Journal ref: https://aclanthology.org/2021.emnlp-main.821/

  10. arXiv:2107.04798  [pdf, other

    cs.DM math.CO

    Hamiltonicity: Variants and Generalization in $P_5$-free Chordal Bipartite graphs

    Authors: S. Aadhavan, R. Mahendra Kumar, P. Renjith, N. Sadagopan

    Abstract: A bipartite graph is chordal bipartite if every cycle of length at least six has a chord in it. M$\ddot{\rm u}$ller \cite {muller1996Hamiltonian} has shown that the Hamiltonian cycle problem is NP-complete on chordal bipartite graphs by presenting a polynomial-time reduction from the satisfiability problem. The microscopic view of the reduction instances reveals that the instances are $P_9$-free c… ▽ More

    Submitted 10 July, 2021; originally announced July 2021.

    Comments: 23 pages, 8 figures

    MSC Class: 05C45; 05C38; 05C85

  11. arXiv:2011.03286  [pdf, other

    cs.CL

    Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation

    Authors: Haryo Akbarianto Wibowo, Tatag Aziz Prawiro, Muhammad Ihsan, Alham Fikri Aji, Radityo Eko Prasojo, Rahmad Mahendra, Suci Fitriany

    Abstract: In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a low-resource machine translation probl… ▽ More

    Submitted 22 December, 2020; v1 submitted 6 November, 2020; originally announced November 2020.

    Comments: 6 pages, Camera ready to be presented at IALP 2020

    MSC Class: 68T50

  12. arXiv:2009.05387  [pdf, other

    cs.CL

    IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

    Authors: Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, Ayu Purwarianti

    Abstract: Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU… ▽ More

    Submitted 8 October, 2020; v1 submitted 11 September, 2020; originally announced September 2020.

    Comments: This paper will be presented in AACL-IJCNLP 2020 (with new results and acknowledgment)

  13. arXiv:1806.01523  [pdf, other

    cs.CL

    Multi-Task Active Learning for Neural Semantic Role Labeling on Low Resource Conversational Corpus

    Authors: Fariz Ikhwantri, Samuel Louvan, Kemal Kurniawan, Bagas Abisena, Valdi Rachman, Alfan Farizki Wicaksono, Rahmad Mahendra

    Abstract: Most Semantic Role Labeling (SRL) approaches are supervised methods which require a significant amount of annotated corpus, and the annotation requires linguistic expertise. In this paper, we propose a Multi-Task Active Learning framework for Semantic Role Labeling with Entity Recognition (ER) as the auxiliary task to alleviate the need for extensive data and use additional information from ER to… ▽ More

    Submitted 5 June, 2018; originally announced June 2018.

    Comments: ACL 2018 workshop on Deep Learning Approaches for Low-Resource NLP