Zum Hauptinhalt springen

Showing 1–12 of 12 results for author: Gessler, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.14840  [pdf, other

    cs.CL

    TAMS: Translation-Assisted Morphological Segmentation

    Authors: Enora Rice, Ali Marashian, Luke Gessler, Alexis Palmer, Katharina von der Wense

    Abstract: Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to dramatically speed up this process. But in typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to tra… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Submitted to ACL ARR on December 15th 2023

  2. arXiv:2403.13560  [pdf, other

    cs.CL

    eRST: A Signaled Graph Theory of Discourse Relations and Organization

    Authors: Amir Zeldes, Tatsuya Aoyama, Yang Janet Liu, Siyao Peng, Debopam Das, Luke Gessler

    Abstract: In this article we present Enhanced Rhetorical Structure Theory (eRST), a new theoretical framework for computational discourse analysis, based on an expansion of Rhetorical Structure Theory (RST). The framework encompasses discourse relation graphs with tree-breaking, non-projective and concurrent relations, as well as implicit and explicit signals which give explainable rationales to our analyse… ▽ More

    Submitted 28 August, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  3. arXiv:2311.00268  [pdf, other

    cs.CL

    Syntactic Inductive Bias in Transformer Language Models: Especially Helpful for Low-Resource Languages?

    Authors: Luke Gessler, Nathan Schneider

    Abstract: A line of work on Transformer-based language models such as BERT has attempted to use syntactic inductive bias to enhance the pretraining process, on the theory that building syntactic structure into the training process should reduce the amount of data needed for training. But such methods are often tested for high-resource languages such as English. In this work, we investigate whether these met… ▽ More

    Submitted 31 October, 2023; originally announced November 2023.

    Comments: Accepted at CoNLL 2023

  4. arXiv:2306.01966  [pdf, other

    cs.CL

    GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation

    Authors: Tatsuya Aoyama, Shabnam Behzad, Luke Gessler, Lauren Levine, Jessica Lin, Yang Janet Liu, Siyao Peng, Yilun Zhu, Amir Zeldes

    Abstract: We present GENTLE, a new mixed-genre English challenge corpus totaling 17K tokens and consisting of 8 unusual text types for out-of domain evaluation: dictionary entries, esports commentaries, legal documents, medical notes, poetry, mathematical proofs, syllabuses, and threat letters. GENTLE is manually annotated for a variety of popular NLP tasks, including syntactic dependency parsing, entity re… ▽ More

    Submitted 21 September, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Camera-ready for LAW-XVII collocated with ACL 2023

  5. arXiv:2305.12612  [pdf, other

    cs.CL

    PrOnto: Language Model Evaluations for 859 Languages

    Authors: Luke Gessler

    Abstract: Evaluation datasets are critical resources for measuring the quality of pretrained language models. However, due to the high cost of dataset annotation, these resources are scarce for most languages other than English, making it difficult to assess the quality of language models. In this work, we present a new method for evaluation dataset construction which enables any language with a New Testame… ▽ More

    Submitted 28 March, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at LREC-COLING 2024

  6. arXiv:2212.12510  [pdf, other

    cs.CL

    MicroBERT: Effective Training of Low-resource Monolingual BERTs through Parameter Reduction and Multitask Learning

    Authors: Luke Gessler, Amir Zeldes

    Abstract: Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervi… ▽ More

    Submitted 4 January, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

    Comments: Presented at MRL at EMNLP 2022 in Abu Dhabi. Code at https://github.com/lgessler/microbert and models at https://huggingface.co/lgessler

  7. arXiv:2109.09780  [pdf, other

    cs.CL

    BERT Has Uncommon Sense: Similarity Ranking for Word Sense BERTology

    Authors: Luke Gessler, Nathan Schneider

    Abstract: An important question concerning contextualized word embedding (CWE) models like BERT is how well they can represent different word senses, especially those in the long tail of uncommon senses. Rather than build a WSD system as in previous work, we investigate contextualized embedding neighborhoods directly, formulating a query-by-example nearest neighbor retrieval task and examining ranking perfo… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: Accepted at BlackboxNLP 2021

  8. arXiv:2109.09777  [pdf, other

    cs.CL

    DisCoDisCo at the DISRPT2021 Shared Task: A System for Discourse Segmentation, Classification, and Connective Detection

    Authors: Luke Gessler, Shabnam Behzad, Yang Janet Liu, Siyao Peng, Yilun Zhu, Amir Zeldes

    Abstract: This paper describes our submission to the DISRPT2021 Shared Task on Discourse Unit Segmentation, Connective Detection, and Relation Classification. Our system, called DisCoDisCo, is a Transformer-based neural classifier which enhances contextualized word embeddings (CWEs) with hand-crafted features, relying on tokenwise sequence tagging for discourse segmentation and connective detection, and a f… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: System submission for the CODI-DISRPT 2021 Shared Task on Discourse Processing across Formalisms. 1st place in all subtasks

  9. arXiv:2103.14961  [pdf, other

    cs.CL

    Supersense and Sensibility: Proxy Tasks for Semantic Annotation of Prepositions

    Authors: Luke Gessler, Shira Wein, Nathan Schneider

    Abstract: Prepositional supersense annotation is time-consuming and requires expert training. Here, we present two sensible methods for obtaining prepositional supersense annotations by eliciting surface substitution and similarity judgments. Four pilot studies suggest that both methods have potential for producing prepositional supersense annotations that are comparable in quality to expert annotations.

    Submitted 27 March, 2021; originally announced March 2021.

    Comments: Presented at LAW XIV in 2020

  10. arXiv:2006.10677  [pdf, other

    cs.CL

    AMALGUM -- A Free, Balanced, Multilayer English Web Corpus

    Authors: Luke Gessler, Siyao Peng, Yang Liu, Yilun Zhu, Shabnam Behzad, Amir Zeldes

    Abstract: We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manua… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

    Comments: Accepted at LREC 2020. See https://www.aclweb.org/anthology/2020.lrec-1.648/ (note: ACL Anthology's title is currently out of date)

    Journal ref: In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 5267-5275), 2020

  11. arXiv:2004.13203  [pdf, other

    cs.CL

    A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization

    Authors: Graham Neubig, Shruti Rijhwani, Alexis Palmer, Jordan MacKenzie, Hilaria Cruz, Xinjian Li, Matthew Lee, Aditi Chaudhary, Luke Gessler, Steven Abney, Shirley Anugrah Hayati, Antonios Anastasopoulos, Olga Zamaraeva, Emily Prud'hommeaux, Jennette Child, Sara Child, Rebecca Knowles, Sarah Moeller, Jeffrey Micher, Yiyuan Li, Sydney Zink, Mengzhou Xia, Roshan S Sharma, Patrick Littell

    Abstract: Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and cr… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: Accepted at SLTU-CCURL 2020

  12. arXiv:2004.10353  [pdf, other

    cs.CL

    Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi

    Authors: Aryaman Arora, Luke Gessler, Nathan Schneider

    Abstract: Hindi grapheme-to-phoneme (G2P) conversion is mostly trivial, with one exception: whether a schwa represented in the orthography is pronounced or unpronounced (deleted). Previous work has attempted to predict schwa deletion in a rule-based fashion using prosodic or phonetic analysis. We present the first statistical schwa deletion classifier for Hindi, which relies solely on the orthography as the… ▽ More

    Submitted 25 April, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

    Comments: 4 pages, 1 figure. To be published in the 2020 Annual Conference of the Association for Computational Linguistics (https://acl2020.org/)

    ACM Class: I.2.7