Zum Hauptinhalt springen

Showing 1–33 of 33 results for author: Pinter, Y

.
  1. arXiv:2407.01334  [pdf, other

    cs.CL cs.CR

    Protecting Privacy in Classifiers by Token Manipulation

    Authors: Re'em Harel, Yair Elboher, Yuval Pinter

    Abstract: Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized m… ▽ More

    Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: PrivateNLP@ACL 2024

  2. arXiv:2404.13292  [pdf, other

    cs.CL cs.AI

    Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

    Authors: Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

    Abstract: The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

  3. arXiv:2404.00397  [pdf, other

    cs.CL

    An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

    Authors: Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

    Abstract: We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 15 pages

  4. arXiv:2403.03521  [pdf, other

    cs.CL

    BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation

    Authors: Carinne Cherf, Yuval Pinter

    Abstract: Neural machine translation (NMT) has progressed rapidly in the past few years, promising improvements and quality translations for different languages. Evaluation of this task is crucial to determine the quality of the translation. Overall, insufficient emphasis is placed on the actual sense of the translation in traditional methods. We propose a bidirectional semantic-based evaluation method desi… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  5. arXiv:2403.01289  [pdf, other

    cs.CL

    Greed is All You Need: An Evaluation of Tokenizer Inference Methods

    Authors: Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

    Abstract: While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary siz… ▽ More

    Submitted 31 May, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: ACL 2024 (main)

  6. arXiv:2402.18376  [pdf, other

    cs.CL cs.AI

    Tokenization Is More Than Compression

    Authors: Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

    Abstract: Tokenization is a foundational step in Natural Language Processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    MSC Class: 68T50 ACM Class: I.2.7

  7. arXiv:2402.09126  [pdf, other

    cs.DC cs.AI cs.CL cs.LG cs.SE

    MPIrigen: MPI Code Generation through Domain-Specific Language Models

    Authors: Nadav Schneider, Niranjan Hasabnis, Vy A. Vo, Tal Kadosh, Neva Krien, Mihai Capotă, Guy Tamir, Ted Willke, Nesreen Ahmed, Yuval Pinter, Timothy Mattson, Gal Oren

    Abstract: The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generati… ▽ More

    Submitted 23 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

  8. arXiv:2312.13322  [pdf, other

    cs.PL cs.AI cs.LG cs.SE

    Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks

    Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

    Abstract: With easier access to powerful compute resources, there is a growing trend in AI for software development to develop larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because these LLMs for HPC tasks are obtained by… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  9. arXiv:2312.11779  [pdf, other

    cs.CL cs.AI cs.LG

    Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

    Authors: Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Yuval Pinter, Rahul Gupta

    Abstract: Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influ… ▽ More

    Submitted 6 April, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted to NAACL 2024 findings

  10. arXiv:2311.09122  [pdf, other

    cs.CL

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

    Abstract: We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse langu… ▽ More

    Submitted 29 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Camera-ready

  11. arXiv:2310.13348  [pdf, other

    cs.CL

    Analyzing Cognitive Plausibility of Subword Tokenization

    Authors: Lisa Beinborn, Yuval Pinter

    Abstract: Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognit… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 (main)

  12. arXiv:2310.11958  [pdf, other

    cs.CL cs.LG

    Emptying the Ocean with a Spoon: Should We Edit Models?

    Authors: Yuval Pinter, Michael Elhadad

    Abstract: We call into question the recently popularized method of direct model editing as a means of correcting factual errors in LLM generations. We contrast model editing with three similar but distinct approaches that pursue better defined objectives: (1) retrieval-based architectures, which decouple factual memory from inference and linguistic capabilities embodied in LLMs; (2) concept erasure methods,… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: Findings of ACL: EMNLP 2023

  13. arXiv:2308.09440  [pdf, other

    cs.CL cs.PL

    Scope is all you need: Transforming LLMs for HPC Code

    Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren

    Abstract: With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found… ▽ More

    Submitted 29 September, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

  14. arXiv:2308.08002  [pdf, ps, other

    cs.DC cs.DB

    Quantifying OpenMP: Statistical Insights into Usage and Adoption

    Authors: Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

    Abstract: In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of c… ▽ More

    Submitted 17 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

  15. arXiv:2305.11999  [pdf, other

    cs.DC cs.AI cs.LG cs.PF

    Advising OpenMP Parallelization via a Graph-Based Approach with Transformers

    Authors: Tal Kadosh, Nadav Schneider, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

    Abstract: There is an ever-present need for shared memory parallelization schemes to exploit the full potential of multi-core architectures. The most common parallelization API addressing this need today is OpenMP. Nevertheless, writing parallel code manually is complex and effort-intensive. Thus, many deterministic source-to-source (S2S) compilers have emerged, intending to automate the process of translat… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  16. arXiv:2305.09438  [pdf, other

    cs.DC cs.CL cs.LG

    MPI-rical: Data-Driven MPI Distributed Parallelism Assistance with Transformers

    Authors: Nadav Schneider, Tal Kadosh, Niranjan Hasabnis, Timothy Mattson, Yuval Pinter, Gal Oren

    Abstract: Message Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by developing MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain d… ▽ More

    Submitted 30 August, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

  17. arXiv:2210.07095  [pdf, other

    cs.CL

    Incorporating Context into Subword Vocabularies

    Authors: Shaked Yehezkel, Yuval Pinter

    Abstract: Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary… ▽ More

    Submitted 10 February, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  18. arXiv:2208.01561  [pdf, other

    cs.CL

    Lost in Space Marking

    Authors: Cassandra L. Jacobs, Yuval Pinter

    Abstract: We look at a decision taken early in training a subword tokenizer, namely whether it should be the word-initial token that carries a special mark, or the word-final one. Based on surface-level considerations of efficiency and cohesion, as well as morphological coverage, we find that a Unigram LM tokenizer trained on pre-tokenized English text is better off marking the word-initial token, while one… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    Comments: Submission to SIGMORPHON 2021

  19. arXiv:2205.03608  [pdf, other

    cs.CL

    UniMorph 4.0: Universal Morphology

    Authors: Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Benoît Sagot, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay , et al. (71 additional authors not shown)

    Abstract: The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This pa… ▽ More

    Submitted 19 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

    Comments: LREC 2022; The first two authors made equal contributions

  20. arXiv:2204.12835  [pdf, other

    cs.DC cs.CL cs.LG

    Learning to Parallelize in a Shared-Memory Environment with Transformers

    Authors: Re'em Harel, Yuval Pinter, Gal Oren

    Abstract: In past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications. OpenMP is the most comprehensive API that implements such schemes, characterized by a readable interface. Nevertheless, introducing OpenMP into code is challe… ▽ More

    Submitted 14 July, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

  21. arXiv:2109.04876  [pdf, ps, other

    cs.CL cs.LG

    Integrating Approaches to Word Representation

    Authors: Yuval Pinter

    Abstract: The problem of representing the atomic elements of language in modern neural learning systems is one of the central challenges of the field of natural language processing. I present a survey of the distributional, compositional, and relational approaches to addressing this task, and discuss various means of integrating them into systems, with special emphasis on the word level and the out-of-vocab… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: Adapted dissertation introduction

  22. arXiv:2108.00391  [pdf, other

    cs.CL

    Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

    Authors: Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

    Abstract: Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these mode… ▽ More

    Submitted 1 August, 2021; originally announced August 2021.

  23. arXiv:2105.05209  [pdf, other

    cs.CL

    Restoring Hebrew Diacritics Without a Dictionary

    Authors: Elazar Gershuni, Yuval Pinter

    Abstract: We demonstrate that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text. We present NAKDIMON, a two-layer character level LSTM, that performs on par with much more complicated curation-dependent systems, across a diverse array of modern Hebrew sources.

    Submitted 10 May, 2022; v1 submitted 11 May, 2021; originally announced May 2021.

    Comments: Findings of NAACL 2022 (in press). 6 pages, 1 figure

  24. arXiv:2009.09123  [pdf, other

    cs.CL cs.AI

    Will it Unblend?

    Authors: Yuval Pinter, Cassandra L. Jacobs, Jacob Eisenstein

    Abstract: Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV b… ▽ More

    Submitted 18 September, 2020; originally announced September 2020.

    Comments: Findings of EMNLP 2020

  25. arXiv:2005.00115  [pdf, other

    cs.CL cs.AI cs.LG

    Learning to Faithfully Rationalize by Construction

    Authors: Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron C. Wallace

    Abstract: In many settings it is important for one to be able to understand why a model made a particular prediction. In NLP this often entails extracting snippets of an input text `responsible for' corresponding model output; when such a snippet comprises tokens that indeed informed the model's prediction, it is a faithful explanation. In some settings, faithfulness may be critical to ensure transparency.… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

    Comments: ACL2020 Camera Ready Submission

  26. arXiv:2003.03444  [pdf, ps, other

    cs.CL

    NYTWIT: A Dataset of Novel Words in the New York Times

    Authors: Yuval Pinter, Cassandra L. Jacobs, Max Bittker

    Abstract: We present the New York Times Word Innovation Types dataset, or NYTWIT, a collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding). We present baseline results for both uncontextual and contextual prediction of novelty c… ▽ More

    Submitted 23 October, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

    Comments: COLING 2020

  27. arXiv:1912.06876  [pdf, other

    cs.LG stat.ML

    Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

    Authors: Nicolas Garneau, Jean-Samuel Leboeuf, Yuval Pinter, Luc Lamontagne

    Abstract: We propose a new contextual-compositional neural network layer that handles out-of-vocabulary (OOV) words in natural language processing (NLP) tagging tasks. This layer consists of a model that attends to both the character sequence and the context in which the OOV words appear. We show that our model learns to generate task-specific \textit{and} sentence-dependent OOV word representations without… ▽ More

    Submitted 14 December, 2019; originally announced December 2019.

  28. arXiv:1908.04626  [pdf, other

    cs.CL

    Attention is not not Explanation

    Authors: Sarah Wiegreffe, Yuval Pinter

    Abstract: Attention mechanisms play a central role in NLP systems, especially within recurrent neural network (RNN) models. Recently, there has been increasing interest in whether or not the intermediate representations offered by these modules may be used to explain the reasoning for a model's prediction, and consequently reach insights regarding the model's decision-making process. A recent paper claims t… ▽ More

    Submitted 5 September, 2019; v1 submitted 13 August, 2019; originally announced August 2019.

    Comments: Accepted to EMNLP 2019; related blog post at https://medium.com/@yuvalpinter/attention-is-not-not-explanation-dbc25b534017

  29. arXiv:1903.05041  [pdf, other

    cs.CL

    Character Eyes: Seeing Language through Character-Level Taggers

    Authors: Yuval Pinter, Marc Marone, Jacob Eisenstein

    Abstract: Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations. In one popular architecture, character-level LSTMs are used to feed token representations into a sequence tagger predicting token-level annotations such as part-of-speech (POS) tags. In this work, we examine the behavior of POS… ▽ More

    Submitted 12 March, 2019; originally announced March 2019.

  30. arXiv:1808.08644  [pdf, ps, other

    cs.CL

    Predicting Semantic Relations using Global Graph Properties

    Authors: Yuval Pinter, Jacob Eisenstein

    Abstract: Semantic graphs, such as WordNet, are resources which curate natural language on two distinguishable layers. On the local level, individual relations between synsets (semantic building blocks) such as hypernymy and meronymy enhance our understanding of the words used to express their meanings. Globally, analysis of graph-theoretic properties of the entire net sheds light on the structure of human… ▽ More

    Submitted 26 August, 2018; originally announced August 2018.

    Comments: EMNLP 2018

  31. arXiv:1804.05088  [pdf, ps, other

    cs.CL cs.SI

    Sí o no, què penses? Catalonian Independence and Linguistic Identity on Social Media

    Authors: Ian Stewart, Yuval Pinter, Jacob Eisenstein

    Abstract: Political identity is often manifested in language variation, but the relationship between the two is still relatively unexplored from a quantitative perspective. This study examines the use of Catalan, a language local to the semi-autonomous region of Catalonia in Spain, on Twitter in discourse related to the 2017 independence referendum. We corroborate prior findings that pro-independence tweets… ▽ More

    Submitted 13 April, 2018; originally announced April 2018.

    Comments: NAACL 2018

  32. arXiv:1707.06961  [pdf, other

    cs.CL

    Mimicking Word Embeddings using Subword RNNs

    Authors: Yuval Pinter, Robert Guthrie, Jacob Eisenstein

    Abstract: Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embed… ▽ More

    Submitted 21 July, 2017; originally announced July 2017.

    Comments: EMNLP 2017

  33. arXiv:1605.02945  [pdf, ps, other

    cs.CL cs.IR

    The Yahoo Query Treebank, V. 1.0

    Authors: Yuval Pinter, Roi Reichart, Idan Szpektor

    Abstract: A description and annotation guidelines for the Yahoo Webscope release of Query Treebank, Version 1.0, May 2016.

    Submitted 11 May, 2016; v1 submitted 10 May, 2016; originally announced May 2016.

    Comments: Co-released with the Webscope Dataset (L-28) and with Pinter et al., Syntactic Parsing of Web Queries with Question Intent, NAACL-HLT 2016