Zum Hauptinhalt springen

Showing 1–22 of 22 results for author: Goyal, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.09840  [pdf, other

    cs.LG math.NA physics.comp-ph

    Machine Learning with Physics Knowledge for Prediction: A Survey

    Authors: Joe Watson, Chen Song, Oliver Weeger, Theo Gruner, An T. Le, Kay Hansel, Ahmed Hendawy, Oleg Arenz, Will Trojak, Miles Cranmer, Carlo D'Eramo, Fabian Bülow, Tanmay Goyal, Jan Peters, Martin W. Hoffman

    Abstract: This survey examines the broad suite of methods and models for combining machine learning with physics knowledge for prediction and forecast, with a focus on partial differential equations. These methods have attracted significant interest due to their potential impact on advancing scientific research and industrial practices by improving predictive models with small- or large-scale datasets and e… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 56 pages, 8 figures, 2 tables

  2. arXiv:2407.18940  [pdf, other

    cs.IR cs.AI cs.CL cs.DL cs.LG

    LitSearch: A Retrieval Benchmark for Scientific Literature Search

    Authors: Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, Tianyu Gao

    Abstract: Literature search questions, such as "where can I find research on the evaluation of consistency in generated summaries?" pose significant challenges for modern search engines and retrieval systems. These questions often require a deep understanding of research concepts and the ability to reason over entire articles. In this work, we introduce LitSearch, a retrieval benchmark comprising 597 realis… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Dataset and code available at https://github.com/princeton-nlp/LitSearch

  3. arXiv:2407.17468  [pdf, other

    cs.CL cs.AI

    WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

    Authors: Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, Yejin Choi

    Abstract: While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about. To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined fr… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  4. arXiv:2406.16264  [pdf, other

    cs.CL cs.AI

    One Thousand and One Pairs: A "novel" challenge for long-context language models

    Authors: Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, wr… ▽ More

    Submitted 18 July, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

    Comments: preprint, 37 pages

  5. arXiv:2405.01511  [pdf, other

    cs.CL

    D2PO: Discriminator-Guided DPO with Response Evaluation Models

    Authors: Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg Durrett

    Abstract: Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate respon… ▽ More

    Submitted 6 August, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: 20 pages, 12 figures, Accepted to COLM 2024

  6. arXiv:2404.01261  [pdf, other

    cs.CL cs.AI

    FABLES: Evaluating faithfulness and content selection in book-length summarization

    Authors: Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: While long-context large language models (LLMs) can technically summarize book-length documents (>100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study miti… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: preprint - 39 pages

  7. arXiv:2403.20147  [pdf, other

    cs.CL

    IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context

    Authors: Nihar Ranjan Sahoo, Pranamya Prashant Kulkarni, Narjis Asad, Arif Ahmad, Tanu Goyal, Aparna Garimella, Pushpak Bhattacharyya

    Abstract: The pervasive influence of social biases in language data has sparked the need for benchmark datasets that capture and evaluate these biases in Large Language Models (LLMs). Existing efforts predominantly focus on English language and the Western context, leaving a void for a reliable dataset that encapsulates India's unique socio-cultural nuances. To bridge this gap, we introduce IndiBias, a comp… ▽ More

    Submitted 3 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

  8. arXiv:2310.07641  [pdf, other

    cs.CL cs.LG

    Evaluating Large Language Models at Evaluating Instruction Following

    Authors: Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen

    Abstract: As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres… ▽ More

    Submitted 16 April, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  9. arXiv:2310.03716  [pdf, other

    cs.CL cs.LG

    A Long Way to Go: Investigating Length Correlations in RLHF

    Authors: Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

    Abstract: Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three divers… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

    Comments: 21 pages, 13 figures, Accepted to COLM 2024

  10. arXiv:2310.00785  [pdf, other

    cs.CL cs.AI cs.LG

    BooookScore: A systematic exploration of book-length summarization in the era of LLMs

    Authors: Yapei Chang, Kyle Lo, Tanya Goyal, Mohit Iyyer

    Abstract: Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book… ▽ More

    Submitted 13 April, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 camera-ready (updated figure1 and table2; corrected minor details in the explanation of hierarchical merging)

  11. arXiv:2303.01432  [pdf, other

    cs.CL

    WiCE: Real-World Entailment for Claims in Wikipedia

    Authors: Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, Greg Durrett

    Abstract: Textual entailment models are increasingly applied in settings like fact-checking, presupposition verification in question answering, or summary evaluation. However, these represent a significant domain shift from existing entailment datasets, and models underperform as a result. We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from… ▽ More

    Submitted 22 October, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: EMNLP 2023

  12. arXiv:2210.06748  [pdf, other

    cs.CL

    Shortcomings of Question Answering Based Factuality Frameworks for Error Localization

    Authors: Ryo Kamoi, Tanya Goyal, Greg Durrett

    Abstract: Despite recent progress in abstractive summarization, models often generate summaries with factual errors. Numerous approaches to detect these errors have been proposed, the most popular of which are question answering (QA)-based factuality metrics. These have been shown to work well at predicting summary-level factuality and have potential to localize errors within summaries, but this latter capa… ▽ More

    Submitted 11 February, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  13. arXiv:2209.12356  [pdf, other

    cs.CL

    News Summarization and Evaluation in the Era of GPT-3

    Authors: Tanya Goyal, Junyi Jessy Li, Greg Durrett

    Abstract: The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3… ▽ More

    Submitted 23 May, 2023; v1 submitted 25 September, 2022; originally announced September 2022.

    Comments: All data shared at: https://tagoyal.github.io/zeroshot-news-annotations.html

  14. arXiv:2205.12854  [pdf, other

    cs.CL cs.AI

    Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

    Authors: Liyan Tang, Tanya Goyal, Alexander R. Fabbri, Philippe Laban, Jiacheng Xu, Semih Yavuz, Wojciech Kryściński, Justin F. Rousseau, Greg Durrett

    Abstract: The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has be… ▽ More

    Submitted 25 May, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: Accepted to ACL 2023

  15. arXiv:2205.09641  [pdf, other

    cs.CL

    SNaC: Coherence Error Detection for Narrative Summarization

    Authors: Tanya Goyal, Junyi Jessy Li, Greg Durrett

    Abstract: Progress in summarizing long texts is inhibited by the lack of appropriate evaluation frameworks. When a long summary must be produced to appropriately cover the facets of that text, that summary needs to present a coherent narrative to be understandable by a reader, but current automatic and human evaluation methods fail to identify gaps in coherence. In this work, we introduce SNaC, a narrative… ▽ More

    Submitted 28 October, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022

  16. arXiv:2112.02721  [pdf, other

    cs.CL cs.AI cs.LG

    NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

    Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

    Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More

    Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

    Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

  17. arXiv:2110.08370  [pdf, other

    cs.CL

    Training Dynamics for Text Summarization Models

    Authors: Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, Greg Durrett

    Abstract: Pre-trained language models (e.g. BART) have shown impressive results when fine-tuned on large summarization datasets. However, little is understood about this fine-tuning process, including what knowledge is retained from pre-training time or how content selection and generation strategies are learnt across iterations. In this work, we analyze the training dynamics for generation models, focusing… ▽ More

    Submitted 15 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: ACL 2022 Findings

  18. arXiv:2110.04400  [pdf, other

    cs.CL

    HydraSum: Disentangling Stylistic Features in Text Summarization using Multi-Decoder Models

    Authors: Tanya Goyal, Nazneen Fatema Rajani, Wenhao Liu, Wojciech Kryściński

    Abstract: Summarization systems make numerous "decisions" about summary properties during inference, e.g. degree of copying, specificity and length of outputs, etc. However, these are implicitly encoded within model parameters and specific styles cannot be enforced. To address this, we introduce HydraSum, a new summarization architecture that extends the single decoder framework of current models to a mixtu… ▽ More

    Submitted 21 October, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: EMNLP2022

  19. arXiv:2104.04302  [pdf, other

    cs.CL

    Annotating and Modeling Fine-grained Factuality in Summarization

    Authors: Tanya Goyal, Greg Durrett

    Abstract: Recent pre-trained abstractive summarization systems have started to achieve credible performance, but a major barrier to their use in practice is their propensity to output summaries that are not faithful to the input and that contain factual errors. While a number of annotated datasets and statistical models for assessing factuality have been explored, there is no clear picture of what errors ar… ▽ More

    Submitted 9 April, 2021; originally announced April 2021.

    Comments: NAACL 2021

  20. arXiv:2010.05478  [pdf, other

    cs.CL

    Evaluating Factuality in Generation with Dependency-level Entailment

    Authors: Tanya Goyal, Greg Durrett

    Abstract: Despite significant progress in text generation models, a serious limitation is their tendency to produce text that is factually inconsistent with information in the input. Recent work has studied whether textual entailment systems can be used to identify factual errors; however, these sentence-level entailment models are trained to solve a different problem than generation filtering and they do n… ▽ More

    Submitted 22 October, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

    Comments: Findings of Emnlp 2020

  21. arXiv:2005.02013  [pdf, other

    cs.CL

    Neural Syntactic Preordering for Controlled Paraphrase Generation

    Authors: Tanya Goyal, Greg Durrett

    Abstract: Paraphrasing natural language sentences is a multifaceted process: it might involve replacing individual words or short phrases, local rearrangement of content, or high-level restructuring like topicalization or passivization. Past approaches struggle to cover this space of paraphrase possibilities in an interpretable manner. Our work, inspired by pre-ordering literature in machine translation, us… ▽ More

    Submitted 5 May, 2020; originally announced May 2020.

    Comments: ACL 2020 camera ready

  22. arXiv:1906.08287  [pdf, other

    cs.CL

    Embedding time expressions for deep temporal ordering models

    Authors: Tanya Goyal, Greg Durrett

    Abstract: Data-driven models have demonstrated state-of-the-art performance in inferring the temporal ordering of events in text. However, these models often overlook explicit temporal signals, such as dates and time windows. Rule-based methods can be used to identify the temporal links between these time expressions (timexes), but they fail to capture timexes' interactions with events and are hard to integ… ▽ More

    Submitted 19 June, 2019; originally announced June 2019.

    Comments: Acl 2019 (short paper)