Skip to main content

Showing 1–4 of 4 results for author: Miranda, L J V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10118  [pdf, other

    cs.CL

    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

    Authors: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse , et al. (36 additional authors not shown)

    Abstract: Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due t… ▽ More

    Submitted 8 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: https://github.com/SEACrowd

  2. arXiv:2311.07171  [pdf, other

    cs.CL

    calamanCy: A Tagalog Natural Language Processing Toolkit

    Authors: Lester James V. Miranda

    Abstract: We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dep… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

    Comments: To be published in The Third Workshop for NLP-OSS at EMNLP 2023

  3. arXiv:2311.07161  [pdf, other

    cs.CL

    Developing a Named Entity Recognition Dataset for Tagalog

    Authors: Lester James V. Miranda

    Abstract: We present the development of a Named Entity Recognition (NER) dataset for Tagalog. This corpus helps fill the resource gap present in Philippine languages today, where NER resources are scarce. The texts were obtained from a pretraining corpora containing news reports, and were labeled by native speakers in an iterative fashion. The resulting dataset contains ~7.8k documents across three entity t… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

    Comments: To be published in The First Workshop for Southeast Asian Language Processing 2023 at IJCNLP-AACL

  4. arXiv:1910.05571  [pdf, other

    cs.LG

    Geomancer: An Open-Source Framework for Geospatial Feature Engineering

    Authors: Lester James V. Miranda, Mark Steve Samson, Alfiero K. Orden II, Bianca S. Silmaro, Ram K. De Guzman III, Stephanie S. Sy

    Abstract: This paper presents Geomancer, an open-source framework for geospatial feature engineering. It simplifies the acquisition of geospatial attributes for downstream, large-scale machine learning tasks. Geomancer leverages any geospatial dataset stored in a data warehouse, users need only to define the features (Spells) they want to create, and cast them on any spatial dataset. In addition, these feat… ▽ More

    Submitted 12 October, 2019; originally announced October 2019.