Zum Hauptinhalt springen

Showing 1–27 of 27 results for author: Ratner, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16008  [pdf, other

    cs.CL cs.AI cs.LG

    Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

    Authors: Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

    Abstract: Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between… ▽ More

    Submitted 3 July, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

    Comments: ACL Findings 2024

  2. arXiv:2308.00675  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

    Authors: Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

    Abstract: Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones t… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

  3. arXiv:2306.15895  [pdf, other

    cs.CL cs.AI cs.LG

    Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

    Authors: Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang

    Abstract: Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we invest… ▽ More

    Submitted 17 October, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023 (Datasets and Benchmarks Track)

    Journal ref: NeurIPS 2023

  4. arXiv:2305.12224  [pdf, other

    cs.LG stat.ML

    On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training

    Authors: Jieyu Zhang, Bohan Wang, Zhengyu Hu, Pang Wei Koh, Alexander Ratner

    Abstract: Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the… ▽ More

    Submitted 1 December, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  5. arXiv:2305.02375  [pdf, other

    cs.DB cs.LG cs.MM

    MaskSearch: Querying Image Masks at Scale

    Authors: Dong He, Jieyu Zhang, Maureen Daum, Alexander Ratner, Magdalena Balazinska

    Abstract: Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps, depth maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, e… ▽ More

    Submitted 8 January, 2024; v1 submitted 3 May, 2023; originally announced May 2023.

  6. arXiv:2305.02301  [pdf, other

    cs.CL cs.AI cs.LG

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Authors: Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

    Abstract: Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs.… ▽ More

    Submitted 5 July, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  7. arXiv:2304.14108  [pdf, other

    cs.CV cs.CL cs.LG

    DataComp: In search of the next generation of multimodal datasets

    Authors: Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song , et al. (9 additional authors not shown)

    Abstract: Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Commo… ▽ More

    Submitted 20 October, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  8. arXiv:2210.02724  [pdf, other

    cs.LG stat.ME stat.ML

    Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

    Authors: Jieyu Zhang, Linxin Song, Alexander Ratner

    Abstract: Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently. The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources abstracted as labeling functions (LFs). Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features w… ▽ More

    Submitted 9 October, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: 16 pages

  9. arXiv:2208.01704  [pdf, ps, other

    cs.LG

    Binary Classification with Positive Labeling Sources

    Authors: Jieyu Zhang, Yujing Wang, Yaming Yang, Yang Luo, Alexander Ratner

    Abstract: To create a large amount of training labels for machine learning models effectively and efficiently, researchers have turned to Weak Supervision (WS), which uses programmatic labeling sources rather than manual annotation. Existing works of WS for binary classification typically assume the presence of labeling sources that are able to assign both positive and negative labels to data in roughly bal… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    Comments: CIKM 2022 (short)

  10. arXiv:2205.12879  [pdf, other

    cs.LG stat.AP stat.ML

    Understanding Programmatic Weak Supervision via Source-aware Influence Function

    Authors: Jieyu Zhang, Haonan Wang, Cheng-Yu Hsieh, Alexander Ratner

    Abstract: Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels, which are in turn used to train an end model. With its increasing popularity, it is critical to have some tool for users to understand the influence of each component (e.g., the source vote or training data) in the pipeline and interpret the end model behavior. T… ▽ More

    Submitted 25 May, 2022; originally announced May 2022.

    Comments: 21 pages

  11. arXiv:2203.01382  [pdf, other

    cs.LG stat.ML

    Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

    Authors: Cheng-Yu Hsieh, Jieyu Zhang, Alexander Ratner

    Abstract: Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process of how these heuristics are created in practice has remained under-explored. In this work, we formalize the development process of labeling heuristi… ▽ More

    Submitted 23 October, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

    Comments: To appear in PVLDB 2022

    Journal ref: Proceedings of the VLDB Endowment, 15(13): 4093 - 4105, 2022

  12. arXiv:2202.05433  [pdf, other

    cs.LG cs.AI stat.AP

    A Survey on Programmatic Weak Supervision

    Authors: Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, Alexander Ratner

    Abstract: Labeling training data has become one of the major roadblocks to using machine learning. Among various weak supervision paradigms, programmatic weak supervision (PWS) has achieved remarkable success in easing the manual labeling bottleneck by programmatically synthesizing training labels from multiple potentially noisy supervision sources. This paper presents a comprehensive survey of recent advan… ▽ More

    Submitted 14 February, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

    Comments: 8 pages

  13. arXiv:2110.03484  [pdf, other

    cs.LG stat.AP stat.ML

    Creating Training Sets via Weak Indirect Supervision

    Authors: Jieyu Zhang, Bohan Wang, Xiangchen Song, Yujing Wang, Yaming Yang, Jing Bai, Alexander Ratner

    Abstract: Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent \emph{Weak Supervision (WS)} frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we for… ▽ More

    Submitted 14 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: ICLR 2022

  14. arXiv:2109.11377  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    WRENCH: A Comprehensive Benchmark for Weak Supervision

    Authors: Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, Alexander Ratner

    Abstract: Recent Weak Supervision (WS) approaches have had widespread success in easing the bottleneck of labeling training data for machine learning by synthesizing labels from multiple potentially noisy supervision sources. However, proper measurement and analysis of these approaches remain a challenge. First, datasets used in existing works are often private and/or custom, limiting standardization. Secon… ▽ More

    Submitted 11 October, 2021; v1 submitted 23 September, 2021; originally announced September 2021.

    Comments: NeurIPS 2021 Datasets and Benchmarks Track

  15. arXiv:2107.03690  [pdf, other

    cs.LG

    Proceedings of the First Workshop on Weakly Supervised Learning (WeaSuL)

    Authors: Michael A. Hedderich, Benjamin Roth, Katharina Kann, Barbara Plank, Alex Ratner, Dietrich Klakow

    Abstract: Welcome to WeaSuL 2021, the First Workshop on Weakly Supervised Learning, co-located with ICLR 2021. In this workshop, we want to advance theory, methods and tools for allowing experts to express prior coded knowledge for automatic data annotations that can be used to train arbitrary deep neural networks for prediction. The ICLR 2021 Workshop on Weak Supervision aims at advancing methods that help… ▽ More

    Submitted 8 July, 2021; originally announced July 2021.

  16. arXiv:1909.06349  [pdf, other

    cs.LG cs.AI stat.ML

    Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices

    Authors: Vincent S. Chen, Sen Wu, Zhenzhen Weng, Alexander Ratner, Christopher Ré

    Abstract: In real-world machine learning applications, data subsets correspond to especially critical outcomes: vulnerable cyclist detections are safety-critical in an autonomous driving task, and "question" sentences might be important to a dialogue agent's language understanding for product purposes. While machine learning models can achieve high quality performance on coarse-grained metrics like F1-score… ▽ More

    Submitted 29 February, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: NeurIPS 2019

  17. arXiv:1904.03257  [pdf, ps, other

    cs.LG cs.DB cs.DC cs.SE stat.ML

    MLSys: The New Frontier of Machine Learning Systems

    Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

    Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More

    Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

  18. arXiv:1903.11101  [pdf, other

    cs.LG eess.IV stat.ML

    Cross-Modal Data Programming Enables Rapid Medical Machine Learning

    Authors: Jared Dunnmon, Alexander Ratner, Nishith Khandwala, Khaled Saab, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew Lungren, Daniel Rubin, Christopher Ré

    Abstract: Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports associated with imaging studies. We propose cross-modal data programming, which generalizes this intuitive strategy in a theoretically-grounded way that enables si… ▽ More

    Submitted 26 March, 2019; originally announced March 2019.

  19. arXiv:1903.05844  [pdf, other

    stat.ML cs.LG

    Learning Dependency Structures for Weak Supervision Models

    Authors: Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, Christopher Ré

    Abstract: Labeling training data is a key bottleneck in the modern machine learning pipeline. Recent weak supervision approaches combine labels from multiple noisy sources by estimating their accuracies without access to ground truth labels; however, estimating the dependencies among these sources is a critical challenge. We focus on a robust PCA-based algorithm for learning these dependency structures, est… ▽ More

    Submitted 14 March, 2019; originally announced March 2019.

  20. arXiv:1812.00417  [pdf, other

    cs.LG stat.ML

    Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

    Authors: Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher Ré, Rob Malkin

    Abstract: Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for… ▽ More

    Submitted 3 June, 2019; v1 submitted 2 December, 2018; originally announced December 2018.

    Journal ref: Proceedings of the International Conference on Management of Data (SIGMOD), 2019

  21. arXiv:1810.02840  [pdf, other

    stat.ML cs.LG

    Training Complex Models with Multi-Task Weak Supervision

    Authors: Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, Christopher Ré

    Abstract: As machine learning models continue to increase in complexity, collecting large hand-labeled training sets has become one of the biggest roadblocks in practice. Instead, weaker forms of supervision that provide noisier but cheaper labels are often used. However, these weak supervision sources have diverse and unknown accuracies, may output correlated labels, and may label different tasks or apply… ▽ More

    Submitted 7 December, 2018; v1 submitted 5 October, 2018; originally announced October 2018.

  22. arXiv:1803.06084  [pdf, other

    cs.LG stat.ML

    A Kernel Theory of Modern Data Augmentation

    Authors: Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré

    Abstract: Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear natural… ▽ More

    Submitted 20 March, 2019; v1 submitted 16 March, 2018; originally announced March 2018.

  23. Snorkel: Rapid Training Data Creation with Weak Supervision

    Authors: Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Christopher Ré

    Abstract: Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs w… ▽ More

    Submitted 28 November, 2017; originally announced November 2017.

    Journal ref: Proceedings of the VLDB Endowment, 11(3), 269-282, 2017

  24. arXiv:1709.01643  [pdf, other

    stat.ML cs.CV cs.LG

    Learning to Compose Domain-Specific Transformations for Data Augmentation

    Authors: Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hussain, Jared Dunnmon, Christopher Ré

    Abstract: Data augmentation is a ubiquitous technique for increasing the size of labeled training sets by leveraging task-specific data transformations that preserve class labels. While it is often easy for domain experts to specify individual transformations, constructing and tuning the more sophisticated compositions typically needed to achieve state-of-the-art results is a time-consuming manual task in p… ▽ More

    Submitted 30 September, 2017; v1 submitted 5 September, 2017; originally announced September 2017.

    Comments: To appear at Neural Information Processing Systems (NIPS) 2017

    Journal ref: Advances in Neural Information Processing Systems 30, 2017, 3236--3246

  25. arXiv:1704.06360  [pdf, other

    cs.CL

    SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data

    Authors: Jason Fries, Sen Wu, Alex Ratner, Christopher Ré

    Abstract: We present SwellShark, a framework for building biomedical named entity recognition (NER) systems quickly and without hand-labeled data. Our approach views biomedical resources like lexicons as function primitives for autogenerating weak supervision. We then use a generative model to unify and denoise this supervision and construct large-scale, probabilistically labeled datasets for training high-… ▽ More

    Submitted 20 April, 2017; originally announced April 2017.

  26. arXiv:1703.00854  [pdf, other

    cs.LG stat.ML

    Learning the Structure of Generative Models without Labeled Data

    Authors: Stephen H. Bach, Bryan He, Alexander Ratner, Christopher Ré

    Abstract: Curating labeled training data has become the primary bottleneck in machine learning. Recent frameworks address this bottleneck with generative models to synthesize labels at scale from weak supervision sources. The generative model's dependency structure directly affects the quality of the estimated labels, but selecting a structure automatically without any labeled data is a distinct challenge.… ▽ More

    Submitted 9 September, 2017; v1 submitted 2 March, 2017; originally announced March 2017.

    Journal ref: Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017

  27. arXiv:1605.07723  [pdf, other

    stat.ML cs.AI cs.LG

    Data Programming: Creating Large Training Sets, Quickly

    Authors: Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré

    Abstract: Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express we… ▽ More

    Submitted 8 January, 2017; v1 submitted 25 May, 2016; originally announced May 2016.

    Journal ref: Advances in Neural Information Processing Systems 29, 2016, 3567--3575