Zum Hauptinhalt springen

Showing 1–27 of 27 results for author: Gonen, H

.
  1. arXiv:2408.06518  [pdf, other

    cs.CL

    Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models

    Authors: Hila Gonen, Terra Blevins, Alisa Liu, Luke Zettlemoyer, Noah A. Smith

    Abstract: Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and a… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  2. arXiv:2407.08818  [pdf

    cs.CL

    MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

    Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hoffman, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith

    Abstract: In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptiv… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  3. arXiv:2406.19564  [pdf, other

    cs.CL

    Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

    Authors: Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov

    Abstract: Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  4. arXiv:2403.10691  [pdf, other

    cs.CL cs.AI cs.LG

    MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

    Authors: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer

    Abstract: A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically m… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

  5. arXiv:2401.10440  [pdf, other

    cs.CL

    Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models

    Authors: Terra Blevins, Tomasz Limisiewicz, Suchin Gururangan, Margaret Li, Hila Gonen, Noah A. Smith, Luke Zettlemoyer

    Abstract: Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while rem… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  6. arXiv:2311.09122  [pdf, other

    cs.CL

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

    Abstract: We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse langu… ▽ More

    Submitted 29 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Camera-ready

  7. arXiv:2310.14610  [pdf, other

    cs.CL

    That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context?

    Authors: Jaechan Lee, Alisa Liu, Orevaoghene Ahia, Hila Gonen, Noah A. Smith

    Abstract: The translation of ambiguous text presents a challenge for translation systems, as it requires using the surrounding context to disambiguate the intended meaning as much as possible. While prior work has studied ambiguities that result from different grammatical features of the source and target language, we study semantic ambiguities that exist in the source (English in this work) itself. In part… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  8. arXiv:2305.14857  [pdf, other

    cs.CL

    BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer

    Authors: Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, Hannaneh Hajishirzi

    Abstract: Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To facilitate research on few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructi… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: The data and code is available at https://buffetfs.github.io/

  9. arXiv:2305.13707  [pdf, other

    cs.CL

    Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

    Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov

    Abstract: Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of ``tokens'' processed or generated by the underlying language models. What constitutes a token, however, is training data… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  10. arXiv:2302.07856  [pdf, other

    cs.CL cs.LG

    Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation

    Authors: Marjan Ghazvininejad, Hila Gonen, Luke Zettlemoyer

    Abstract: Large language models (LLMs) demonstrate remarkable machine translation (MT) abilities via prompting, even though they were not explicitly trained for this task. However, even given the incredible quantities of data they are trained on, LLMs can struggle to translate inputs with rare words, which are common in low resource or domain transfer scenarios. We show that LLM prompting can provide an eff… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  11. arXiv:2301.10472  [pdf, other

    cs.CL cs.LG

    XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

    Authors: Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa

    Abstract: Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multili… ▽ More

    Submitted 13 October, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

    Comments: EMNLP 2023

  12. arXiv:2212.10539  [pdf, other

    cs.CL

    Toward Human Readable Prompt Tuning: Kubrick's The Shining is a good movie, and a good prompt too?

    Authors: Weijia Shi, Xiaochuang Han, Hila Gonen, Ari Holtzman, Yulia Tsvetkov, Luke Zettlemoyer

    Abstract: Large language models can perform new tasks in a zero-shot fashion, given natural language prompts that specify the desired behavior. Such prompts are typically hand engineered, but can also be learned with gradient-based methods from labeled data. However, it is underexplored what factors make the prompts effective, especially when the prompts are natural language. In this paper, we investigate c… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

  13. arXiv:2212.04037  [pdf, other

    cs.CL

    Demystifying Prompts in Language Models via Perplexity Estimation

    Authors: Hila Gonen, Srini Iyer, Terra Blevins, Noah A. Smith, Luke Zettlemoyer

    Abstract: Language models can be prompted to perform a wide variety of zero- and few-shot learning problems. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens or how to pick the best prompts. In this work, we analyze the factors that contribute to this variance and establish a new empirical hypothesis: the performance of a prompt is coupled wi… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

  14. arXiv:2211.07830  [pdf, other

    cs.CL

    Prompting Language Models for Linguistic Structure

    Authors: Terra Blevins, Hila Gonen, Luke Zettlemoyer

    Abstract: Although pretrained language models (PLMs) can be prompted to perform a wide range of language tasks, it remains an open question how much this ability comes from generalizable linguistic understanding versus surface-level lexical patterns. To test this, we present a structured prompting approach for linguistic structured prediction tasks, allowing us to perform zero- and few-shot sequence tagging… ▽ More

    Submitted 20 May, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

    Comments: ACL 2023

  15. arXiv:2205.11758  [pdf, other

    cs.CL

    Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models

    Authors: Terra Blevins, Hila Gonen, Luke Zettlemoyer

    Abstract: The emergent cross-lingual transfer seen in multilingual pretrained models has sparked significant interest in studying their behavior. However, because these analyses have focused on fully trained multilingual models, little is known about the dynamics of the multilingual pretraining process. We investigate when these models acquire their in-language and cross-lingual abilities by probing checkpo… ▽ More

    Submitted 22 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022

  16. arXiv:2204.09168  [pdf, other

    cs.CL

    Analyzing Gender Representation in Multilingual Models

    Authors: Hila Gonen, Shauli Ravfogel, Yoav Goldberg

    Abstract: Multilingual language models were shown to allow for nontrivial transfer across scripts and languages. In this work, we study the structure of the internal representations that enable this transfer. We focus on the representation of gender distinctions as a practical case study, and examine the extent to which the gender concept is encoded in shared subspaces across different languages. Our analys… ▽ More

    Submitted 12 August, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Published at RepL4NLP 2022

  17. arXiv:2112.14330  [pdf, other

    cs.CL

    Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora

    Authors: Hila Gonen, Ganesh Jawahar, Djamé Seddah, Yoav Goldberg

    Abstract: The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive f… ▽ More

    Submitted 28 December, 2021; originally announced December 2021.

    Comments: Published in ACL 2020

  18. arXiv:2104.09792  [pdf, other

    cs.CL cs.LG

    Identifying Helpful Sentences in Product Reviews

    Authors: Iftah Gamzu, Hila Gonen, Gilad Kutiel, Ran Levy, Eugene Agichtein

    Abstract: In recent years online shopping has gained momentum and became an important venue for customers wishing to save time and simplify their shopping process. A key advantage of shopping online is the ability to read what other customers are saying about products of interest. In this work, we aim to maintain this advantage in situations where extreme brevity is needed, for example, when shopping by voi… ▽ More

    Submitted 11 July, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

  19. arXiv:2011.00335  [pdf, other

    cs.CL

    Pick a Fight or Bite your Tongue: Investigation of Gender Differences in Idiomatic Language Usage

    Authors: Ella Rabinovich, Hila Gonen, Suzanne Stevenson

    Abstract: A large body of research on gender-linked language has established foundations regarding cross-gender differences in lexical, emotional, and topical preferences, along with their sociological underpinnings. We compile a novel, large and diverse corpus of spontaneous linguistic productions annotated with speakers' gender, and perform a first large-scale empirical study of distinctions in the usage… ▽ More

    Submitted 31 October, 2020; originally announced November 2020.

    Comments: COLING'2020, 12 pages

  20. arXiv:2010.08275  [pdf, other

    cs.CL

    It's not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT

    Authors: Hila Gonen, Shauli Ravfogel, Yanai Elazar, Yoav Goldberg

    Abstract: Recent works have demonstrated that multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages. We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning. The results suggest that most of this information is encoded in a non-linear way, while… ▽ More

    Submitted 16 October, 2020; originally announced October 2020.

    Comments: BlackboxNLP 2020

  21. arXiv:2004.14065  [pdf, other

    cs.CL

    Automatically Identifying Gender Issues in Machine Translation using Perturbations

    Authors: Hila Gonen, Kellie Webster

    Abstract: The successful application of neural methods to machine translation has realized huge quality advances for the community. With these improvements, many have noted outstanding challenges, including the modeling and treatment of gendered language. While previous studies have identified issues using synthetic examples, we develop a novel technique to mine examples from real world data to explore chal… ▽ More

    Submitted 14 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Findings of EMNLP 2020

  22. arXiv:2004.07667  [pdf, other

    cs.CL cs.LG

    Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

    Authors: Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, Yoav Goldberg

    Abstract: The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we ai… ▽ More

    Submitted 28 April, 2020; v1 submitted 16 April, 2020; originally announced April 2020.

    Comments: Accepted as a long paper in ACL 2020

  23. arXiv:1910.14161  [pdf, other

    cs.CL

    How does Grammatical Gender Affect Noun Representations in Gender-Marking Languages?

    Authors: Hila Gonen, Yova Kementchedjhieva, Yoav Goldberg

    Abstract: Many natural languages assign grammatical gender also to inanimate nouns in the language. In such languages, words that relate to the gender-marked nouns are inflected to agree with the noun's gender. We show that this affects the word representations of inanimate nouns, resulting in nouns with the same gender being closer to each other than nouns with different gender. While "embedding debiasing"… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: CONLL 2019

  24. arXiv:1909.00871  [pdf, other

    cs.CL cs.CY

    It's All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution

    Authors: Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell, Simone Teufel

    Abstract: This paper treats gender bias latent in word embeddings. Previous mitigation attempts rely on the operationalisation of gender bias as a projection over a linear subspace. An alternative approach is Counterfactual Data Augmentation (CDA), in which a corpus is duplicated and augmented to remove bias, e.g. by swapping all inherently-gendered words in the copy. We perform an empirical comparison of t… ▽ More

    Submitted 5 February, 2020; v1 submitted 2 September, 2019; originally announced September 2019.

    Comments: Correction to proof in appendix and minor changes

  25. arXiv:1903.03862  [pdf, other

    cs.CL

    Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

    Authors: Hila Gonen, Yoav Goldberg

    Abstract: Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society. This phenomenon is pervasive and consistent across different word embedding models, causing serious concern. Several recent works tackle this problem, and propose methods for significantly reducing this gender bias in word embeddings, demon… ▽ More

    Submitted 24 September, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

    Comments: Accepted to NAACL 2019

  26. arXiv:1810.11895  [pdf, ps, other

    cs.CL

    Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

    Authors: Hila Gonen, Yoav Goldberg

    Abstract: We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the o… ▽ More

    Submitted 10 November, 2019; v1 submitted 28 October, 2018; originally announced October 2018.

    Comments: EMNLP 2019

  27. arXiv:1611.08813  [pdf, other

    cs.CL

    Semi Supervised Preposition-Sense Disambiguation using Multilingual Data

    Authors: Hila Gonen, Yoav Goldberg

    Abstract: Prepositions are very common and very ambiguous, and understanding their sense is critical for understanding the meaning of the sentence. Supervised corpora for the preposition-sense disambiguation task are small, suggesting a semi-supervised approach to the task. We show that signals from unannotated multilingual data can be used to improve supervised preposition-sense disambiguation. Our approac… ▽ More

    Submitted 27 November, 2016; originally announced November 2016.

    Comments: 12 pages; COLING 2016