-
SiTe CiM: Signed Ternary Computing-in-Memory for Ultra-Low Precision Deep Neural Networks
Authors:
Niharika Thakuria,
Akul Malhotra,
Sandeep K. Thirumala,
Reena Elangovan,
Anand Raghunathan,
Sumeet K. Gupta
Abstract:
Ternary Deep Neural Networks (DNN) have shown a large potential for highly energy-constrained systems by virtue of their low power operation (due to ultra-low precision) with only a mild degradation in accuracy. To enable an energy-efficient hardware substrate for such systems, we propose a compute-enabled memory design, referred to as SiTe-CiM, which features computing-in-memory (CiM) of dot prod…
▽ More
Ternary Deep Neural Networks (DNN) have shown a large potential for highly energy-constrained systems by virtue of their low power operation (due to ultra-low precision) with only a mild degradation in accuracy. To enable an energy-efficient hardware substrate for such systems, we propose a compute-enabled memory design, referred to as SiTe-CiM, which features computing-in-memory (CiM) of dot products between signed ternary (SiTe) inputs and weights. SiTe CiM is based on cross-coupling of two bit cells to enable CiM of dot products in the signed ternary regime. We explore SiTe CiM with 8T-SRAM, 3T-embedded DRAM (3T-eDRAM) and 3T-ferroelectric metal FET (FEMFET) memories. We propose two flavors of this technique, namely SiTe CiM I/II. In SiTe CiM I, we employ two additional transistors per cell for cross-coupling, achieving fast CiM operations, albeit incurring an area overhead ranging from 18% to 34% (compared to standard ternary memories). In SiTe CiM II, four extra transistors are utilized for every 16 cells in a column, thereby incurring only 6% area cost (but leading to slower CiM than SiTe CiM I). Based on the array analysis, our designs achieve up to 88% lower CiM latency and 78% CiM energy savings across various technologies considered, as compared to their respective near-memory computing counterparts. Further, we perform system level analysis by incorporating SiTe CiM I/II arrays in a ternary DNN accelerator and show up to 7X throughput boost and up to 2.5X energy reduction compared to the near-memory ternary DNN accelerators.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
Simultaneous Trajectory Optimization and Contact Selection for Contact-rich Manipulation with High-Fidelity Geometry
Authors:
Mengchao Zhang,
Devesh K. Jha,
Arvind U. Raghunathan,
Kris Hauser
Abstract:
Contact-implicit trajectory optimization (CITO) is an effective method to plan complex trajectories for various contact-rich systems including manipulation and locomotion. CITO formulates a mathematical program with complementarity constraints (MPCC) that enforces that contact forces must be zero when points are not in contact. However, MPCC solve times increase steeply with the number of allowabl…
▽ More
Contact-implicit trajectory optimization (CITO) is an effective method to plan complex trajectories for various contact-rich systems including manipulation and locomotion. CITO formulates a mathematical program with complementarity constraints (MPCC) that enforces that contact forces must be zero when points are not in contact. However, MPCC solve times increase steeply with the number of allowable points of contact, which limits CITO's applicability to problems in which only a few, simple geometries are allowed to make contact. This paper introduces simultaneous trajectory optimization and contact selection (STOCS), as an extension of CITO that overcomes this limitation. The innovation of STOCS is to identify salient contact points and times inside the iterative trajectory optimization process. This effectively reduces the number of variables and constraints in each MPCC invocation. The STOCS framework, instantiated with key contact identification subroutines, renders the optimization of manipulation trajectories computationally tractable even for high-fidelity geometries consisting of tens of thousands of vertices.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Understanding Finetuning for Factual Knowledge Extraction
Authors:
Gaurav Ghosal,
Tatsunori Hashimoto,
Aditi Raghunathan
Abstract:
In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model…
▽ More
In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Adversarial Attacks on Multimodal Agents
Authors:
Chen Henry Wu,
Jing Yu Koh,
Ruslan Salakhutdinov,
Daniel Fried,
Aditi Raghunathan
Abstract:
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-base…
▽ More
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning
Authors:
Jacob Mitchell Springer,
Vaishnavh Nagarajan,
Aditi Raghunathan
Abstract:
Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness doe…
▽ More
Sharpness-Aware Minimization (SAM) has emerged as a promising alternative optimizer to stochastic gradient descent (SGD). The originally-proposed motivation behind SAM was to bias neural networks towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, suggesting that flatness does fully explain SAM's success. Sidestepping this debate, we identify an orthogonal effect of SAM that is beneficial out-of-distribution: we argue that SAM implicitly balances the quality of diverse features. SAM achieves this effect by adaptively suppressing well-learned features which gives remaining features opportunity to be learned. We show that this mechanism is beneficial in datasets that contain redundant or spurious features where SGD falls for the simplicity bias and would not otherwise learn all available features. Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including CelebA, Waterbirds, CIFAR-MNIST, and DomainBed.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Why is SAM Robust to Label Noise?
Authors:
Christina Baek,
Zico Kolter,
Aditi Raghunathan
Abstract:
Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks. However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in "flatter" regions of the loss landscape. In pa…
▽ More
Sharpness-Aware Minimization (SAM) is most known for achieving state-of the-art performances on natural image and language tasks. However, its most pronounced improvements (of tens of percent) is rather in the presence of label noise. Understanding SAM's label noise robustness requires a departure from characterizing the robustness of minimas lying in "flatter" regions of the loss landscape. In particular, the peak performance under label noise occurs with early stopping, far before the loss converges. We decompose SAM's robustness into two effects: one induced by changes to the logit term and the other induced by changes to the network Jacobian. The first can be observed in linear logistic regression where SAM provably up-weights the gradient contribution from clean examples. Although this explicit up-weighting is also observable in neural networks, when we intervene and modify SAM to remove this effect, surprisingly, we see no visible degradation in performance. We infer that SAM's effect in deeper networks is instead explained entirely by the effect SAM has on the network Jacobian. We theoretically derive the implicit regularization induced by this Jacobian effect in two layer linear networks. Motivated by our analysis, we see that cheaper alternatives to SAM that explicitly induce these regularization effects largely recover the benefits in deep networks trained on real-world datasets.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic
Authors:
Sachin Goyal,
Pratyush Maini,
Zachary C. Lipton,
Aditi Raghunathan,
J. Zico Kolter
Abstract:
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of…
▽ More
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets. In recent times, data curation has gained prominence with several works developing strategies to retain 'high-quality' subsets of 'raw' scraped data. For instance, the LAION public dataset retained only 10% of the total crawled data. However, these strategies are typically developed agnostic of the available compute for training. In this paper, we first demonstrate that making filtering decisions independent of training compute is often suboptimal: the limited high-quality data rapidly loses its utility when repeated, eventually requiring the inclusion of 'unseen' but 'lower-quality' data. To address this quality-quantity tradeoff ($\texttt{QQT}$), we introduce neural scaling laws that account for the non-homogeneous nature of web data, an angle ignored in existing literature. Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various quality subsets of web data; (ii) account for how utility diminishes for a data point at its 'nth' repetition; and (iii) formulate the mutual interaction of various data pools when combined, enabling the estimation of model performance on a combination of multiple data pools without ever jointly training on them. Our key message is that data curation $\textit{cannot}$ be agnostic of the total compute that a model will be trained for. Our scaling laws allow us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets, carving out a pareto-frontier for data curation. Code is available at https://github.com/locuslab/scaling_laws_data_filtering.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Predicting the Performance of Foundation Models via Agreement-on-the-Line
Authors:
Aman Mehra,
Rahul Saxena,
Taeyoun Kim,
Christina Baek,
Zico Kolter,
Aditi Raghunathan
Abstract:
Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena ``agreement-on-the-line'', which can be leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution dat…
▽ More
Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena ``agreement-on-the-line'', which can be leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution data from scratch for numerous epochs, foundation models undergo minimal finetuning from heavily pretrained weights, which may reduce the ensemble diversity needed to observe agreement-on-the-line. In our work, we demonstrate that when lightly finetuning multiple runs from a $\textit{single}$ foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. Surprisingly, only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks. Second, we demonstrate that ensembles of $\textit{multiple}$ foundation models pretrained on different datasets but finetuned on the same task can also show agreement-on-the-line. In total, by careful construction of a diverse ensemble, we can utilize agreement-on-the-line-based methods to predict the OOD performance of foundation models with high precision.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Ev-Edge: Efficient Execution of Event-based Vision Algorithms on Commodity Edge Platforms
Authors:
Shrihari Sridharan,
Surya Selvam,
Kaushik Roy,
Anand Raghunathan
Abstract:
Event cameras have emerged as a promising sensing modality for autonomous navigation systems, owing to their high temporal resolution, high dynamic range and negligible motion blur. To process the asynchronous temporal event streams from such sensors, recent research has shown that a mix of Artificial Neural Networks (ANNs), Spiking Neural Networks (SNNs) as well as hybrid SNN-ANN algorithms are n…
▽ More
Event cameras have emerged as a promising sensing modality for autonomous navigation systems, owing to their high temporal resolution, high dynamic range and negligible motion blur. To process the asynchronous temporal event streams from such sensors, recent research has shown that a mix of Artificial Neural Networks (ANNs), Spiking Neural Networks (SNNs) as well as hybrid SNN-ANN algorithms are necessary to achieve high accuracies across a range of perception tasks. However, we observe that executing such workloads on commodity edge platforms which feature heterogeneous processing elements such as CPUs, GPUs and neural accelerators results in inferior performance. This is due to the mismatch between the irregular nature of event streams and diverse characteristics of algorithms on the one hand and the underlying hardware platform on the other. We propose Ev-Edge, a framework that contains three key optimizations to boost the performance of event-based vision systems on edge platforms: (1) An Event2Sparse Frame converter directly transforms raw event streams into sparse frames, enabling the use of sparse libraries with minimal encoding overheads (2) A Dynamic Sparse Frame Aggregator merges sparse frames at runtime by trading off the temporal granularity of events and computational demand thereby improving hardware utilization (3) A Network Mapper maps concurrently executing tasks to different processing elements while also selecting layer precision by considering both compute and communication overheads. On several state-of-art networks for a range of autonomous navigation tasks, Ev-Edge achieves 1.28x-2.05x improvements in latency and 1.23x-2.15x in energy over an all-GPU implementation on the NVIDIA Jetson Xavier AGX platform for single-task execution scenarios. Ev-Edge also achieves 1.43x-1.81x latency improvements over round-robin scheduling methods in multi-task execution scenarios.
△ Less
Submitted 23 March, 2024;
originally announced March 2024.
-
Testing the Limits of Jailbreaking Defenses with the Purple Problem
Authors:
Taeyoun Kim,
Suhas Kotha,
Aditi Raghunathan
Abstract:
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple a…
▽ More
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
△ Less
Submitted 24 June, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Repetition Improves Language Model Embeddings
Authors:
Jacob Mitchell Springer,
Suhas Kotha,
Daniel Fried,
Graham Neubig,
Aditi Raghunathan
Abstract:
Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear late…
▽ More
Recent approaches to improving the extraction of text embeddings from autoregressive large language models (LLMs) have largely focused on improvements to data, backbone pretrained language models, or improving task-differentiation via instructions. In this work, we address an architectural limitation of autoregressive models: token embeddings cannot contain information from tokens that appear later in the input. To address this limitation, we propose a simple approach, "echo embeddings," in which we repeat the input twice in context and extract embeddings from the second occurrence. We show that echo embeddings of early tokens can encode information about later tokens, allowing us to maximally leverage high-quality LLMs for embeddings. On the MTEB leaderboard, echo embeddings improve over classical embeddings by over 9% zero-shot and by around 0.7% when fine-tuned. Echo embeddings with a Mistral-7B model achieve state-of-the-art compared to prior open source models that do not leverage synthetic fine-tuning data.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
What Values Do ImageNet-trained Classifiers Enact?
Authors:
Will Penman,
Joshua Babu,
Abhinaya Raghunathan
Abstract:
We identify "values" as actions that classifiers take that speak to open questions of significant social concern. Investigating a classifier's values builds on studies of social bias that uncover how classifiers participate in social processes beyond their creators' forethought. In our case, this participation involves what counts as nutritious, what it means to be modest, and more. Unlike AI soci…
▽ More
We identify "values" as actions that classifiers take that speak to open questions of significant social concern. Investigating a classifier's values builds on studies of social bias that uncover how classifiers participate in social processes beyond their creators' forethought. In our case, this participation involves what counts as nutritious, what it means to be modest, and more. Unlike AI social bias, however, a classifier's values are not necessarily morally loathsome. Attending to image classifiers' values can facilitate public debate and introspection about the future of society. To substantiate these claims, we report on an extensive examination of both ImageNet training/validation data and ImageNet-trained classifiers with custom testing data. We identify perceptual decision boundaries in 118 categories that address open questions in society, and through quantitative testing of rival datasets we find that ImageNet-trained classifiers enact at least 7 values through their perceptual decisions. To contextualize these results, we develop a conceptual framework that integrates values, social bias, and accuracy, and we describe a rhetorical method for identifying how context affects the values that a classifier enacts. We also discover that classifier performance does not straightforwardly reflect the proportions of subgroups in a training set. Our findings bring a rich sense of the social world to ML researchers that can be applied to other domains beyond computer vision.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
AutoFT: Learning an Objective for Robust Fine-Tuning
Authors:
Caroline Choi,
Yoonho Lee,
Annie Chen,
Allan Zhou,
Aditi Raghunathan,
Chelsea Finn
Abstract:
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning. However, fine-tuning a model on one data distribution often degrades performance under distribution shifts. Current approaches to robust fine-tuning use hand-crafted regularization techniques to constrain the fine-tuning process towards the pretrained model. Yet, it is hard to specify how to adapt…
▽ More
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning. However, fine-tuning a model on one data distribution often degrades performance under distribution shifts. Current approaches to robust fine-tuning use hand-crafted regularization techniques to constrain the fine-tuning process towards the pretrained model. Yet, it is hard to specify how to adapt relevant characteristics of the foundation model during fine-tuning, as this depends on how the pre-training, fine-tuning, and test data distributions relate to each other. We propose AutoFT, a data-driven approach for robust fine-tuning. Given a task, AutoFT searches for a fine-tuning procedure that enhances out-of-distribution (OOD) generalization. Specifically, AutoFT uses bi-level optimization to search for an objective function and hyperparameters that maximize post-adaptation performance on a small OOD validation set. We evaluate AutoFT on nine natural distribution shifts. Our experiments show that AutoFT significantly improves generalization to OOD inputs, outperforming existing robust fine-tuning methods. Notably, AutoFT achieves a new state-of-the-art on the WILDS iWildCam and FMoW benchmarks, outperforming the previous best methods by $6.0\%$ and $1.5\%$, respectively.
△ Less
Submitted 7 March, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks
Authors:
Amrit Nagarajan,
Anand Raghunathan
Abstract:
Transformers have rapidly increased in popularity in recent years, achieving state-of-the-art performance in processing text, images, audio and video. However, Transformers present large computational requirements for both training and inference, and are prone to overfitting during training. To address these challenges, we present Input Compression with Positional Consistency (ICPC), a new data au…
▽ More
Transformers have rapidly increased in popularity in recent years, achieving state-of-the-art performance in processing text, images, audio and video. However, Transformers present large computational requirements for both training and inference, and are prone to overfitting during training. To address these challenges, we present Input Compression with Positional Consistency (ICPC), a new data augmentation method that, unlike prior augmentation techniques, simultaneously improves both generalization and training efficiency. ICPC applies varying levels of compression to each training sample in each epoch. This leads to smaller input sequences being processed by the Transformer, and hence faster training, while also alleviating overfitting by presenting each input with different compression levels. We introduce a consistency-aware position selection method in ICPC that enables accurate processing of compressed inputs without any changes to the underlying Transformer architecture. We detail compression-based augmentation methods for four different modalities -- insignificant word pruning for text, resolution modulation for images, spatio-temporal resolution modulation for videos, and spectogram size modulation for audio. ICPC also enables efficient variable-effort inference, where samples are first inferred at high compression levels, and progressively re-evaluated with lower compression for more challenging inputs. On 9 diverse tasks spanning 4 different modalities, ICPC improves accuracy by up to 1%, while also accelerating training and inference by up to 2.9X and 2.6X, respectively. Code is available at https://github.com/amrnag/ICPC.
△ Less
Submitted 22 November, 2023;
originally announced December 2023.
-
Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift
Authors:
Saurabh Garg,
Amrith Setlur,
Zachary Chase Lipton,
Sivaraman Balakrishnan,
Virginia Smith,
Aditi Raghunathan
Abstract:
Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investi…
▽ More
Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Multitask Learning Can Improve Worst-Group Outcomes
Authors:
Atharva Kulkarni,
Lucio Dery,
Amrith Setlur,
Aditi Raghunathan,
Ameet Talwalkar,
Graham Neubig
Abstract:
In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is on…
▽ More
In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is one such widely used technique. In this paper, we seek not only to understand the impact of MTL on worst-group accuracy but also to explore its potential as a tool to address the challenge of group-wise fairness. We primarily consider the standard setting of fine-tuning a pre-trained model, where, following recent work \citep{gururangan2020don, dery2023aang}, we multitask the end task with the pre-training objective constructed from the end task data itself. In settings with few or no group annotations, we find that multitasking often, but not consistently, achieves better worst-group accuracy than Just-Train-Twice (JTT; \citet{pmlr-v139-liu21f}) -- a representative distributionally robust optimization (DRO) method. Leveraging insights from synthetic data experiments, we propose to modify standard MTL by regularizing the joint multitask representation space. We run a large number of fine-tuning experiments across computer vision and natural language processing datasets and find that our regularized MTL approach \emph{consistently} outperforms JTT on both average and worst-group outcomes. Our official code can be found here: \href{https://github.com/atharvajk98/MTL-group-robustness.git}{\url{https://github.com/atharvajk98/MTL-group-robustness}}.
△ Less
Submitted 28 February, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
LRMP: Layer Replication with Mixed Precision for Spatial In-memory DNN Accelerators
Authors:
Abinand Nallathambi,
Christin David Bose,
Wilfried Haensch,
Anand Raghunathan
Abstract:
In-memory computing (IMC) with non-volatile memories (NVMs) has emerged as a promising approach to address the rapidly growing computational demands of Deep Neural Networks (DNNs). Mapping DNN layers spatially onto NVM-based IMC accelerators achieves high degrees of parallelism. However, two challenges that arise in this approach are the highly non-uniform distribution of layer processing times an…
▽ More
In-memory computing (IMC) with non-volatile memories (NVMs) has emerged as a promising approach to address the rapidly growing computational demands of Deep Neural Networks (DNNs). Mapping DNN layers spatially onto NVM-based IMC accelerators achieves high degrees of parallelism. However, two challenges that arise in this approach are the highly non-uniform distribution of layer processing times and high area requirements. We propose LRMP, a method to jointly apply layer replication and mixed precision quantization to improve the performance of DNNs when mapped to area-constrained NVM-based IMC accelerators. LRMP uses a combination of reinforcement learning and integer linear programming to search the replication-quantization design space using a model that is closely informed by the target hardware architecture. Across five DNN benchmarks, LRMP achieves 2.8-9$\times$ latency and 11.8-19$\times$ throughput improvement at iso-accuracy.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Reliable Test-Time Adaptation via Agreement-on-the-Line
Authors:
Eungyeup Kim,
Mingjie Sun,
Aditi Raghunathan,
Zico Kolter
Abstract:
Test-time adaptation (TTA) methods aim to improve robustness to distribution shifts by adapting models using unlabeled data from the shifted test distribution. However, there remain unresolved challenges that undermine the reliability of TTA, which include difficulties in evaluating TTA performance, miscalibration after TTA, and unreliable hyperparameter tuning for adaptation. In this work, we mak…
▽ More
Test-time adaptation (TTA) methods aim to improve robustness to distribution shifts by adapting models using unlabeled data from the shifted test distribution. However, there remain unresolved challenges that undermine the reliability of TTA, which include difficulties in evaluating TTA performance, miscalibration after TTA, and unreliable hyperparameter tuning for adaptation. In this work, we make a notable and surprising observation that TTAed models strongly show the agreement-on-the-line phenomenon (Baek et al., 2022) across a wide range of distribution shifts. We find such linear trends occur consistently in a wide range of models adapted with various hyperparameters, and persist in distributions where the phenomenon fails to hold in vanilla models (i.e., before adaptation). We leverage these observations to make TTA methods more reliable in three perspectives: (i) estimating OOD accuracy (without labeled data) to determine when TTA helps and when it hurts, (ii) calibrating TTAed models without label information, and (iii) reliably determining hyperparameters for TTA without any labeled validation data. Through extensive experiments, we demonstrate that various TTA methods can be precisely evaluated, both in terms of their improvements and degradations. Moreover, our proposed methods on unsupervised calibration and hyperparameters tuning for TTA achieve results close to the ones assuming access to ground-truth labels, in terms of both OOD accuracy and calibration error.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
Understanding Catastrophic Forgetting in Language Models via Implicit Inference
Authors:
Suhas Kotha,
Jacob Mitchell Springer,
Aditi Raghunathan
Abstract:
We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of capabilities on other tasks. We hypoth…
▽ More
We lack a systematic understanding of the effects of fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback), particularly on tasks outside the narrow fine-tuning distribution. In a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of capabilities on other tasks. We hypothesize that language models implicitly infer the task of the prompt and that fine-tuning skews this inference towards tasks in the fine-tuning distribution. To test this, we propose Conjugate Prompting, which artificially makes the task look farther from the fine-tuning distribution while requiring the same capability, and we find that this recovers some of the pretraining capabilities in our synthetic setup. Since real-world fine-tuning distributions are predominantly English, we apply conjugate prompting to recover pretrained capabilities in LLMs by simply translating the prompts to different languages. This allows us to recover in-context learning abilities lost via instruction tuning, natural reasoning capability lost during code fine-tuning, and, more concerningly, harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT.
△ Less
Submitted 13 April, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators
Authors:
Sourjya Roy,
Cheng Wang,
Anand Raghunathan
Abstract:
Progress in artificial intelligence and machine learning over the past decade has been driven by the ability to train larger deep neural networks (DNNs), leading to a compute demand that far exceeds the growth in hardware performance afforded by Moore's law. Training DNNs is an extremely memory-intensive process, requiring not just the model weights but also activations and gradients for an entire…
▽ More
Progress in artificial intelligence and machine learning over the past decade has been driven by the ability to train larger deep neural networks (DNNs), leading to a compute demand that far exceeds the growth in hardware performance afforded by Moore's law. Training DNNs is an extremely memory-intensive process, requiring not just the model weights but also activations and gradients for an entire minibatch to be stored. The need to provide high-density and low-leakage on-chip memory motivates the exploration of emerging non-volatile memory for training accelerators. Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators, including 3-4x higher density than SRAM, significantly reduced leakage power, high endurance and reasonable access time. On the one hand, MRAM write operations require high write energy and latency due to the need to ensure reliable switching.
In this study, we perform a comprehensive device-to-system evaluation and co-optimization of STT-MRAM for efficient ML training accelerator design. We devised a cross-layer simulation framework to evaluate the effectiveness of STT-MRAM as a scratchpad replacing SRAM in a systolic-array-based DNN accelerator. To address the inefficiency of writes in STT-MRAM, we propose to reduce write voltage and duration. To evaluate the ensuing accuracy-efficiency trade-off, we conduct a thorough analysis of the error tolerance of input activations, weights, and errors during the training. We propose heterogeneous memory configurations that enable training convergence with good accuracy. We show that MRAM provide up to 15-22x improvement in system level energy across a suite of DNN benchmarks under iso-capacity and iso-area scenarios. Further optimizing STT-MRAM write operations can provide over 2x improvement in write energy for minimal degradation in application-level training accuracy.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
Contextual Reliability: When Different Features Matter in Different Contexts
Authors:
Gaurav Ghosal,
Amrith Setlur,
Daniel S. Brown,
Anca D. Dragan,
Aditi Raghunathan
Abstract:
Deep neural networks often fail catastrophically by relying on spurious correlations. Most prior work assumes a clear dichotomy into spurious and reliable features; however, this is often unrealistic. For example, most of the time we do not want an autonomous car to simply copy the speed of surrounding cars -- we don't want our car to run a red light if a neighboring car does so. However, we canno…
▽ More
Deep neural networks often fail catastrophically by relying on spurious correlations. Most prior work assumes a clear dichotomy into spurious and reliable features; however, this is often unrealistic. For example, most of the time we do not want an autonomous car to simply copy the speed of surrounding cars -- we don't want our car to run a red light if a neighboring car does so. However, we cannot simply enforce invariance to next-lane speed, since it could provide valuable information about an unobservable pedestrian at a crosswalk. Thus, universally ignoring features that are sometimes (but not always) reliable can lead to non-robust performance. We formalize a new setting called contextual reliability which accounts for the fact that the "right" features to use may vary depending on the context. We propose and analyze a two-stage framework called Explicit Non-spurious feature Prediction (ENP) which first identifies the relevant features to use for a given context, then trains a model to rely exclusively on these features. Our work theoretically and empirically demonstrates the advantages of ENP over existing methods and provides new benchmarks for contextual reliability.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Authors:
Pratyush Maini,
Sachin Goyal,
Zachary C. Lipton,
J. Zico Kolter,
Aditi Raghunathan
Abstract:
Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only im…
▽ More
Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.
△ Less
Submitted 18 March, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
ALP: Action-Aware Embodied Learning for Perception
Authors:
Xinran Liang,
Anthony Han,
Wilson Yan,
Aditi Raghunathan,
Pieter Abbeel
Abstract:
Current methods in training and benchmarking vision models exhibit an over-reliance on passive, curated datasets. Although models trained on these datasets have shown strong performance in a wide variety of tasks such as classification, detection, and segmentation, they fundamentally are unable to generalize to an ever-evolving world due to constant out-of-distribution shifts of input data. Theref…
▽ More
Current methods in training and benchmarking vision models exhibit an over-reliance on passive, curated datasets. Although models trained on these datasets have shown strong performance in a wide variety of tasks such as classification, detection, and segmentation, they fundamentally are unable to generalize to an ever-evolving world due to constant out-of-distribution shifts of input data. Therefore, instead of training on fixed datasets, can we approach learning in a more human-centric and adaptive manner? In this paper, we introduce Action-Aware Embodied Learning for Perception (ALP), an embodied learning framework that incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. Our method actively explores in complex 3D environments to both learn generalizable task-agnostic visual representations as well as collect downstream training data. We show that ALP outperforms existing baselines in several downstream perception tasks. In addition, we show that by training on actively collected data more relevant to the environment and task, our method generalizes more robustly to downstream tasks compared to models pre-trained on fixed datasets such as ImageNet.
△ Less
Submitted 17 October, 2023; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Simultaneous Trajectory Optimization and Contact Selection for Multi-Modal Manipulation Planning
Authors:
Mengchao Zhang,
Devesh K. Jha,
Arvind U. Raghunathan,
Kris Hauser
Abstract:
Complex dexterous manipulations require switching between prehensile and non-prehensile grasps, and sliding and pivoting the object against the environment. This paper presents a manipulation planner that is able to reason about diverse changes of contacts to discover such plans. It implements a hybrid approach that performs contact-implicit trajectory optimization for pivoting and sliding manipul…
▽ More
Complex dexterous manipulations require switching between prehensile and non-prehensile grasps, and sliding and pivoting the object against the environment. This paper presents a manipulation planner that is able to reason about diverse changes of contacts to discover such plans. It implements a hybrid approach that performs contact-implicit trajectory optimization for pivoting and sliding manipulation primitives and sampling-based planning to change between manipulation primitives and target object poses. The optimization method, simultaneous trajectory optimization and contact selection (STOCS), introduces an infinite programming framework to dynamically select from contact points and support forces between the object and environment during a manipulation primitive. To sequence manipulation primitives, a sampling-based tree-growing planner uses STOCS to construct a manipulation tree. We show that by using a powerful trajectory optimizer, the proposed planner can discover multi-modal manipulation trajectories involving grasping, sliding, and pivoting within a few dozen samples. The resulting trajectories are verified to enable a 6 DoF manipulator to manipulate physical objects successfully.
△ Less
Submitted 10 June, 2023;
originally announced June 2023.
-
Covariance Steering for Uncertain Contact-rich Systems
Authors:
Yuki Shirai,
Devesh K. Jha,
Arvind U. Raghunathan
Abstract:
Planning and control for uncertain contact systems is challenging as it is not clear how to propagate uncertainty for planning. Contact-rich tasks can be modeled efficiently using complementarity constraints among other techniques. In this paper, we present a stochastic optimization technique with chance constraints for systems with stochastic complementarity constraints. We use a particle filter-…
▽ More
Planning and control for uncertain contact systems is challenging as it is not clear how to propagate uncertainty for planning. Contact-rich tasks can be modeled efficiently using complementarity constraints among other techniques. In this paper, we present a stochastic optimization technique with chance constraints for systems with stochastic complementarity constraints. We use a particle filter-based approach to propagate moments for stochastic complementarity system. To circumvent the issues of open-loop chance constrained planning, we propose a contact-aware controller for covariance steering of the complementarity system. Our optimization problem is formulated as Non-Linear Programming (NLP) using bilevel optimization. We present an important-particle algorithm for numerical efficiency for the underlying control problem. We verify that our contact-aware closed-loop controller is able to steer the covariance of the states under stochastic contact-rich tasks.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Robust Pivoting Manipulation using Contact Implicit Bilevel Optimization
Authors:
Yuki Shirai,
Devesh K. Jha,
Arvind U. Raghunathan
Abstract:
Generalizable manipulation requires that robots be able to interact with novel objects and environment. This requirement makes manipulation extremely challenging as a robot has to reason about complex frictional interactions with uncertainty in physical properties of the object and the environment. In this paper, we study robust optimization for planning of pivoting manipulation in the presence of…
▽ More
Generalizable manipulation requires that robots be able to interact with novel objects and environment. This requirement makes manipulation extremely challenging as a robot has to reason about complex frictional interactions with uncertainty in physical properties of the object and the environment. In this paper, we study robust optimization for planning of pivoting manipulation in the presence of uncertainties. We present insights about how friction can be exploited to compensate for inaccuracies in the estimates of the physical properties during manipulation. Under certain assumptions, we derive analytical expressions for stability margin provided by friction during pivoting manipulation. This margin is then used in a Contact Implicit Bilevel Optimization (CIBO) framework to optimize a trajectory that maximizes this stability margin to provide robustness against uncertainty in several physical parameters of the object. We present analysis of the stability margin with respect to several parameters involved in the underlying bilevel optimization problem. We demonstrate our proposed method using a 6 DoF manipulator for manipulating several different objects. We also design and validate an MPC controller using the proposed algorithm which can track and regulate the position of the object during manipulation.
△ Less
Submitted 4 July, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
X-Former: In-Memory Acceleration of Transformers
Authors:
Shrihari Sridharan,
Jacob R. Stevens,
Kaushik Roy,
Anand Raghunathan
Abstract:
Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural net…
▽ More
Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural network (DNN) accelerators such as GPUs and TPUs face limitations in processing Transformers efficiently. In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix vector multiplications within memory arrays. However, attention score computations, which are frequently used in Transformers (unlike CNNs and RNNs), require matrix vector multiplications (MVM) where both operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers, and further suffer from the low endurance of most NVM technologies. To address these challenges, we present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of X-Former, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that X-Former achieves upto 85x and 7.5x improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Automatically Auditing Large Language Models via Discrete Optimization
Authors:
Erik Jones,
Anca Dragan,
Aditi Raghunathan,
Jacob Steinhardt
Abstract:
Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as an optimization problem, where we automatically search for input-output pairs that match a desired target behavior. For example, we might aim to find a non-toxic input that starts with "Barack Obama" that a model maps to a toxic output.…
▽ More
Auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. In this work, we cast auditing as an optimization problem, where we automatically search for input-output pairs that match a desired target behavior. For example, we might aim to find a non-toxic input that starts with "Barack Obama" that a model maps to a toxic output. This optimization problem is difficult to solve as the set of feasible points is sparse, the space is discrete, and the language models we audit are non-linear and high-dimensional. To combat these challenges, we introduce a discrete optimization algorithm, ARCA, that jointly and efficiently optimizes over inputs and outputs. Our approach automatically uncovers derogatory completions about celebrities (e.g. "Barack Obama is a legalized unborn" -> "child murderer"), produces French inputs that complete to English outputs, and finds inputs that generate a specific name. Our work offers a promising new tool to uncover models' failure-modes before deployment.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts
Authors:
Amrith Setlur,
Don Dennis,
Benjamin Eysenbach,
Aditi Raghunathan,
Chelsea Finn,
Virginia Smith,
Sergey Levine
Abstract:
Training machine learning models robust to distribution shifts is critical for real-world applications. Some robust training algorithms (e.g., Group DRO) specialize to group shifts and require group information on all training points. Other methods (e.g., CVaR DRO) that do not need group annotations can be overly conservative, since they naively upweight high loss points which may form a contrived…
▽ More
Training machine learning models robust to distribution shifts is critical for real-world applications. Some robust training algorithms (e.g., Group DRO) specialize to group shifts and require group information on all training points. Other methods (e.g., CVaR DRO) that do not need group annotations can be overly conservative, since they naively upweight high loss points which may form a contrived set that does not correspond to any meaningful group in the real world (e.g., when the high loss points are randomly mislabeled training points). In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function (indicator over group) is simple. For example, we may expect that group shifts occur along low bitrate features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these low bitrate features, that need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this, we consider the two-player game formulation of DRO where the adversary's capacity is bitrate-constrained. Our resulting practical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group information on training samples yet matches the performance of Group DRO on datasets that have training group annotations and that of CVaR DRO on long-tailed distributions. Our theoretical analysis reveals that in some settings BR-DRO objective can provably yield statistically efficient and less conservative solutions than unconstrained CVaR DRO.
△ Less
Submitted 11 October, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Tactile Tool Manipulation
Authors:
Yuki Shirai,
Devesh K. Jha,
Arvind U. Raghunathan,
Dennis Hong
Abstract:
Humans can effortlessly perform very complex, dexterous manipulation tasks by reacting to sensor observations. In contrast, robots can not perform reactive manipulation and they mostly operate in open-loop while interacting with their environment. Consequently, the current manipulation algorithms either are inefficient in performance or can only work in highly structured environments. In this pape…
▽ More
Humans can effortlessly perform very complex, dexterous manipulation tasks by reacting to sensor observations. In contrast, robots can not perform reactive manipulation and they mostly operate in open-loop while interacting with their environment. Consequently, the current manipulation algorithms either are inefficient in performance or can only work in highly structured environments. In this paper, we present closed-loop control of a complex manipulation task where a robot uses a tool to interact with objects. Manipulation using a tool leads to complex kinematics and contact constraints that need to be satisfied for generating feasible manipulation trajectories. We first present an open-loop controller design using Non-Linear Programming (NLP) that satisfies these constraints. In order to design a closed-loop controller, we present a pose estimator of objects and tools using tactile sensors. Using our tactile estimator, we design a closed-loop controller based on Model Predictive Control (MPC). The proposed algorithm is verified using a 6 DoF manipulator on tasks using a variety of objects and tools. We verify that our closed-loop controller can successfully perform tool manipulation under several unexpected contacts. Video summarizing this work and hardware experiments are found https://youtu.be/VsClK04qDhk.
△ Less
Submitted 23 March, 2023; v1 submitted 16 January, 2023;
originally announced January 2023.
-
Learning Representations that Enable Generalization in Assistive Tasks
Authors:
Jerry Zhi-Yang He,
Aditi Raghunathan,
Daniel S. Brown,
Zackory Erickson,
Anca D. Dragan
Abstract:
Recent work in sim2real has successfully enabled robots to act in physical environments by training in simulation with a diverse ''population'' of environments (i.e. domain randomization). In this work, we focus on enabling generalization in assistive tasks: tasks in which the robot is acting to assist a user (e.g. helping someone with motor impairments with bathing or with scratching an itch). Su…
▽ More
Recent work in sim2real has successfully enabled robots to act in physical environments by training in simulation with a diverse ''population'' of environments (i.e. domain randomization). In this work, we focus on enabling generalization in assistive tasks: tasks in which the robot is acting to assist a user (e.g. helping someone with motor impairments with bathing or with scratching an itch). Such tasks are particularly interesting relative to prior sim2real successes because the environment now contains a human who is also acting. This complicates the problem because the diversity of human users (instead of merely physical environment parameters) is more difficult to capture in a population, thus increasing the likelihood of encountering out-of-distribution (OOD) human policies at test time. We advocate that generalization to such OOD policies benefits from (1) learning a good latent representation for human policies that test-time humans can accurately be mapped to, and (2) making that representation adaptable with test-time interaction data, instead of relying on it to perfectly capture the space of human policies based on the simulated population only. We study how to best learn such a representation by evaluating on purposefully constructed OOD test policies. We find that sim2real methods that encode environment (or population) parameters and work well in tasks that robots do in isolation, do not work well in assistance. In assistance, it seems crucial to train the representation based on the history of interaction directly, because that is what the robot will have access to at test time. Further, training these representations to then predict human actions not only gives them better structure, but also enables them to be fine-tuned at test-time, when the robot observes the partner act. https://adaptive-caregiver.github.io.
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
Finetune like you pretrain: Improved finetuning of zero-shot vision models
Authors:
Sachin Goyal,
Ananya Kumar,
Sankalp Garg,
Zico Kolter,
Aditi Raghunathan
Abstract:
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In…
▽ More
Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning).
Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Using Language to Extend to Unseen Domains
Authors:
Lisa Dunlap,
Clara Mohri,
Devin Guillory,
Han Zhang,
Trevor Darrell,
Joseph E. Gonzalez,
Aditi Raghunathan,
Anja Rohrbach
Abstract:
It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our m…
▽ More
It is expensive to collect training data for every possible domain that a vision model may encounter when deployed. We instead consider how simply verbalizing the training domain (e.g. "photos of birds") as well as domains we want to extend to but do not have data for (e.g. "paintings of birds") can improve robustness. Using a multimodal model with a joint image and language embedding space, our method LADS learns a transformation of the image embeddings from the training domain to each unseen test domain, while preserving task relevant information. Without using any images from the unseen test domain, we show that over the extended domain containing both training and unseen test domains, LADS outperforms standard fine-tuning and ensemble approaches over a suite of four benchmarks targeting domain adaptation and dataset bias.
△ Less
Submitted 29 April, 2023; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Approximate Computing and the Efficient Machine Learning Expedition
Authors:
Jörg Henkel,
Hai Li,
Anand Raghunathan,
Mehdi B. Tahoori,
Swagath Venkataramani,
Xiaoxuan Yang,
Georgios Zervakis
Abstract:
Approximate computing (AxC) has been long accepted as a design alternative for efficient system implementation at the cost of relaxed accuracy requirements. Despite the AxC research activities in various application domains, AxC thrived the past decade when it was applied in Machine Learning (ML). The by definition approximate notion of ML models but also the increased computational overheads asso…
▽ More
Approximate computing (AxC) has been long accepted as a design alternative for efficient system implementation at the cost of relaxed accuracy requirements. Despite the AxC research activities in various application domains, AxC thrived the past decade when it was applied in Machine Learning (ML). The by definition approximate notion of ML models but also the increased computational overheads associated with ML applications-that were effectively mitigated by corresponding approximations-led to a perfect matching and a fruitful synergy. AxC for AI/ML has transcended beyond academic prototypes. In this work, we enlighten the synergistic nature of AxC and ML and elucidate the impact of AxC in designing efficient ML systems. To that end, we present an overview and taxonomy of AxC for ML and use two descriptive application scenarios to demonstrate how AxC boosts the efficiency of ML systems.
△ Less
Submitted 2 October, 2022;
originally announced October 2022.
-
Constrained Dynamic Movement Primitives for Safe Learning of Motor Skills
Authors:
Seiji Shaw,
Devesh K. Jha,
Arvind Raghunathan,
Radu Corcodel,
Diego Romeres,
George Konidaris,
Daniel Nikovski
Abstract:
Dynamic movement primitives are widely used for learning skills which can be demonstrated to a robot by a skilled human or controller. While their generalization capabilities and simple formulation make them very appealing to use, they possess no strong guarantees to satisfy operational safety constraints for a task. In this paper, we present constrained dynamic movement primitives (CDMP) which ca…
▽ More
Dynamic movement primitives are widely used for learning skills which can be demonstrated to a robot by a skilled human or controller. While their generalization capabilities and simple formulation make them very appealing to use, they possess no strong guarantees to satisfy operational safety constraints for a task. In this paper, we present constrained dynamic movement primitives (CDMP) which can allow for constraint satisfaction in the robot workspace. We present a formulation of a non-linear optimization to perturb the DMP forcing weights regressed by locally-weighted regression to admit a Zeroing Barrier Function (ZBF), which certifies workspace constraint satisfaction. We demonstrate the proposed CDMP under different constraints on the end-effector movement such as obstacle avoidance and workspace constraints on a physical robot. A video showing the implementation of the proposed algorithm using different manipulators in different environments could be found here https://youtu.be/hJegJJkJfys.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
Transformer Networks for Predictive Group Elevator Control
Authors:
Jing Zhang,
Athanasios Tsiligkaridis,
Hiroshi Taguchi,
Arvind Raghunathan,
Daniel Nikovski
Abstract:
We propose a Predictive Group Elevator Scheduler by using predictive information of passengers arrivals from a Transformer based destination predictor and a linear regression model that predicts remaining time to destinations. Through extensive empirical evaluation, we find that the savings of Average Waiting Time (AWT) could be as high as above 50% for light arrival streams and around 15% for med…
▽ More
We propose a Predictive Group Elevator Scheduler by using predictive information of passengers arrivals from a Transformer based destination predictor and a linear regression model that predicts remaining time to destinations. Through extensive empirical evaluation, we find that the savings of Average Waiting Time (AWT) could be as high as above 50% for light arrival streams and around 15% for medium arrival streams in afternoon down-peak traffic regimes. Such results can be obtained after carefully setting the Predicted Probability of Going to Elevator (PPGE) threshold, thus avoiding a majority of false predictions for people heading to the elevator, while achieving as high as 80% of true predictive elevator landings as early as after having seen only 60% of the whole trajectory of a passenger.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Test-Time Adaptation via Conjugate Pseudo-labels
Authors:
Sachin Goyal,
Mingjie Sun,
Aditi Raghunathan,
Zico Kolter
Abstract:
Test-time adaptation (TTA) refers to adapting neural networks to distribution shifts, with access to only the unlabeled test samples from the new domain at test-time. Prior TTA methods optimize over unsupervised objectives such as the entropy of model predictions in TENT [Wang et al., 2021], but it is unclear what exactly makes a good TTA loss. In this paper, we start by presenting a surprising ph…
▽ More
Test-time adaptation (TTA) refers to adapting neural networks to distribution shifts, with access to only the unlabeled test samples from the new domain at test-time. Prior TTA methods optimize over unsupervised objectives such as the entropy of model predictions in TENT [Wang et al., 2021], but it is unclear what exactly makes a good TTA loss. In this paper, we start by presenting a surprising phenomenon: if we attempt to meta-learn the best possible TTA loss over a wide class of functions, then we recover a function that is remarkably similar to (a temperature-scaled version of) the softmax-entropy employed by TENT. This only holds, however, if the classifier we are adapting is trained via cross-entropy; if trained via squared loss, a different best TTA loss emerges. To explain this phenomenon, we analyze TTA through the lens of the training losses's convex conjugate. We show that under natural conditions, this (unsupervised) conjugate function can be viewed as a good local approximation to the original supervised loss and indeed, it recovers the best losses found by meta-learning. This leads to a generic recipe that can be used to find a good TTA loss for any given supervised training loss function of a general class. Empirically, our approach consistently dominates other baselines over a wide range of benchmarks. Our approach is particularly of interest when applied to classifiers trained with novel loss functions, e.g., the recently-proposed PolyLoss, where it differs substantially from (and outperforms) an entropy-based loss. Further, we show that our approach can also be interpreted as a kind of self-training using a very specific soft label, which we refer to as the conjugate pseudolabel. Overall, our method provides a broad framework for better understanding and improving test-time adaptation. Code is available at https://github.com/locuslab/tta_conjugate.
△ Less
Submitted 22 November, 2022; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift
Authors:
Ananya Kumar,
Tengyu Ma,
Percy Liang,
Aditi Raghunathan
Abstract:
We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy: a robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via ERM. In this paper, we find that ID-calibrated ensembles -- where we s…
▽ More
We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy: a robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via ERM. In this paper, we find that ID-calibrated ensembles -- where we simply ensemble the standard and robust models after calibrating on only ID data -- outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy. On eleven natural distribution shift datasets, ID-calibrated ensembles obtain the best of both worlds: strong ID accuracy and OOD accuracy. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well both ID and OOD: (1) we need to calibrate the standard and robust models (on ID data, because OOD data is unavailable), (2) OOD has no anticorrelated spurious features.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Recursive McCormick Linearization of Multilinear Programs
Authors:
Arvind U Raghunathan,
Carlos Cardonha,
David Bergman,
Carlos J Nohra
Abstract:
Linear programming (LP) relaxations are widely employed in exact solution methods for multilinear programs (MLP). One example is the family of Recursive McCormick Linearization (RML) strategies, where bilinear products are substituted for artificial variables, which deliver a relaxation of the original problem when introduced together with concave and convex envelopes. In this article, we introduc…
▽ More
Linear programming (LP) relaxations are widely employed in exact solution methods for multilinear programs (MLP). One example is the family of Recursive McCormick Linearization (RML) strategies, where bilinear products are substituted for artificial variables, which deliver a relaxation of the original problem when introduced together with concave and convex envelopes. In this article, we introduce the first systematic approach for identifying RMLs, in which we focus on the identification of linear relaxation with a small number of artificial variables and with strong LP bounds. We present a novel mechanism for representing all the possible RMLs, which we use to design an exact mixed-integer programming (MIP) formulation for the identification of minimum-size RMLs; we show that this problem is NP-hard in general, whereas a special case is fixed-parameter tractable. Moreover, we explore structural properties of our formulation to derive an exact MIP model that identifies RMLs of a given size with the best possible relaxation bound is optimal. Our numerical results on a collection of benchmarks indicate that our algorithms outperform the RML strategy implemented in state-of-the-art global optimization solvers.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift
Authors:
Christina Baek,
Yiding Jiang,
Aditi Raghunathan,
Zico Kolter
Abstract:
Recently, Miller et al. showed that a model's in-distribution (ID) accuracy has a strong linear correlation with its out-of-distribution (OOD) accuracy on several OOD benchmarks -- a phenomenon they dubbed ''accuracy-on-the-line''. While a useful tool for model selection (i.e., the model most likely to perform the best OOD is the one with highest ID accuracy), this fact does not help estimate the…
▽ More
Recently, Miller et al. showed that a model's in-distribution (ID) accuracy has a strong linear correlation with its out-of-distribution (OOD) accuracy on several OOD benchmarks -- a phenomenon they dubbed ''accuracy-on-the-line''. While a useful tool for model selection (i.e., the model most likely to perform the best OOD is the one with highest ID accuracy), this fact does not help estimate the actual OOD performance of models without access to a labeled OOD validation set. In this paper, we show a similar but surprising phenomenon also holds for the agreement between pairs of neural network classifiers: whenever accuracy-on-the-line holds, we observe that the OOD agreement between the predictions of any two pairs of neural networks (with potentially different architectures) also observes a strong linear correlation with their ID agreement. Furthermore, we observe that the slope and bias of OOD vs ID agreement closely matches that of OOD vs ID accuracy. This phenomenon, which we call agreement-on-the-line, has important practical applications: without any labeled data, we can predict the OOD accuracy of classifiers}, since OOD agreement can be estimated with just unlabeled data. Our prediction algorithm outperforms previous methods both in shifts where agreement-on-the-line holds and, surprisingly, when accuracy is not on the line. This phenomenon also provides new insights into deep neural networks: unlike accuracy-on-the-line, agreement-on-the-line appears to only hold for neural network classifiers.
△ Less
Submitted 10 May, 2023; v1 submitted 27 June, 2022;
originally announced June 2022.
-
A Co-design view of Compute in-Memory with Non-Volatile Elements for Neural Networks
Authors:
Wilfried Haensch,
Anand Raghunathan,
Kaushik Roy,
Bhaswar Chakrabarti,
Charudatta M. Phatak,
Cheng Wang,
Supratik Guha
Abstract:
Deep Learning neural networks are pervasive, but traditional computer architectures are reaching the limits of being able to efficiently execute them for the large workloads of today. They are limited by the von Neumann bottleneck: the high cost in energy and latency incurred in moving data between memory and the compute engine. Today, special CMOS designs address this bottleneck. The next generat…
▽ More
Deep Learning neural networks are pervasive, but traditional computer architectures are reaching the limits of being able to efficiently execute them for the large workloads of today. They are limited by the von Neumann bottleneck: the high cost in energy and latency incurred in moving data between memory and the compute engine. Today, special CMOS designs address this bottleneck. The next generation of computing hardware will need to eliminate or dramatically mitigate this bottleneck. We discuss how compute-in-memory can play an important part in this development. Here, a non-volatile memory based cross-bar architecture forms the heart of an engine that uses an analog process to parallelize the matrix vector multiplication operation, repeatedly used in all neural network workloads. The cross-bar architecture, at times referred to as a neuromorphic approach, can be a key hardware element in future computing machines. In the first part of this review we take a co-design view of the design constraints and the demands it places on the new materials and memory devices that anchor the cross-bar architecture. In the second part, we review what is knows about the different new non-volatile memory materials and devices suited for compute in-memory, and discuss the outlook and challenges.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
STeP-CiM: Strain-enabled Ternary Precision Computation-in-Memory based on Non-Volatile 2D Piezoelectric Transistors
Authors:
Niharika Thakuria,
Reena Elangovan,
Sandeep K Thirumala,
Anand Raghunathan,
Sumeet K. Gupta
Abstract:
We propose 2D Piezoelectric FET (PeFET) based compute-enabled non-volatile memory for ternary deep neural networks (DNNs). PeFETs consist of a material with ferroelectric and piezoelectric properties coupled with Transition Metal Dichalcogenide channel. We utilize (a) ferroelectricity to store binary bits (0/1) in the form of polarization (-P/+P) and (b) polarization dependent piezoelectricity to…
▽ More
We propose 2D Piezoelectric FET (PeFET) based compute-enabled non-volatile memory for ternary deep neural networks (DNNs). PeFETs consist of a material with ferroelectric and piezoelectric properties coupled with Transition Metal Dichalcogenide channel. We utilize (a) ferroelectricity to store binary bits (0/1) in the form of polarization (-P/+P) and (b) polarization dependent piezoelectricity to read the stored state by means of strain-induced bandgap change in Transition Metal Dichalcogenide channel. The unique read mechanism of PeFETs enables us to expand the traditional association of +P (-P) with low (high) resistance states to their dual high (low) resistance depending on read voltage. Specifically, we demonstrate that +P (-P) stored in PeFETs can be dynamically configured in (a) a low (high) resistance state for positive read voltages and (b) their dual high (low) resistance states for negative read voltages, without afflicting a read disturb. Such a feature, which we name as Polarization Preserved Piezoelectric Effect Reversal with Dual Voltage Polarity (PiER), is unique to PeFETs and has not been shown in hitherto explored memories. We leverage PiER to propose a Strain-enabled Ternary Precision Computation-in-Memory (STeP-CiM) cell with capabilities of computing the scalar product of the stored weight and input, both of which are represented with signed ternary precision. Further, using multi word-line assertion of STeP-CiM cells, we achieve massively parallel computation of dot products of signed ternary inputs and weights. Our array level analysis shows 91% lower delay and improvements of 15% and 91% in energy for in-memory multiply-and-accumulate operations compared to near-memory design approaches based on SRAM and PeFET respectively. STeP-CiM exhibits upto 8.91x improvement in performance and 6.07x average improvement in energy over SRAM/PeFET based near-memory design.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Robust Pivoting: Exploiting Frictional Stability Using Bilevel Optimization
Authors:
Yuki Shirai,
Devesh K. Jha,
Arvind Raghunathan,
Diego Romeres
Abstract:
Generalizable manipulation requires that robots be able to interact with novel objects and environment. This requirement makes manipulation extremely challenging as a robot has to reason about complex frictional interaction with uncertainty in physical properties of the object. In this paper, we study robust optimization for control of pivoting manipulation in the presence of uncertainties. We pre…
▽ More
Generalizable manipulation requires that robots be able to interact with novel objects and environment. This requirement makes manipulation extremely challenging as a robot has to reason about complex frictional interaction with uncertainty in physical properties of the object. In this paper, we study robust optimization for control of pivoting manipulation in the presence of uncertainties. We present insights about how friction can be exploited to compensate for the inaccuracies in the estimates of the physical properties during manipulation. In particular, we derive analytical expressions for stability margin provided by friction during pivoting manipulation. This margin is then used in a bilevel trajectory optimization algorithm to design a controller that maximizes this stability margin to provide robustness against uncertainty in physical properties of the object. We demonstrate our proposed method using a 6 DoF manipulator for manipulating several different objects.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
PYROBOCOP: Python-based Robotic Control & Optimization Package for Manipulation
Authors:
Arvind Raghunathan,
Devesh K. Jha,
Diego Romeres
Abstract:
PYROBOCOP is a Python-based package for control, optimization and estimation of robotic systems described by nonlinear Differential Algebraic Equations (DAEs). In particular, the package can handle systems with contacts that are described by complementarity constraints and provides a general framework for specifying obstacle avoidance constraints. The package performs direct transcription of the D…
▽ More
PYROBOCOP is a Python-based package for control, optimization and estimation of robotic systems described by nonlinear Differential Algebraic Equations (DAEs). In particular, the package can handle systems with contacts that are described by complementarity constraints and provides a general framework for specifying obstacle avoidance constraints. The package performs direct transcription of the DAEs into a set of nonlinear equations by performing orthogonal collocation on finite elements. PYROBOCOP provides automatic reformulation of the complementarity constraints that are tractable to NLP solvers to perform optimization of robotic systems. The package is interfaced with ADOL-C[1] for obtaining sparse derivatives by automatic differentiation and IPOPT[2] for performing optimization. We evaluate PYROBOCOP on several manipulation problems for control and estimation.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Chance-Constrained Optimization in Contact-Rich Systems for Robust Manipulation
Authors:
Yuki Shirai,
Devesh K. Jha,
Arvind Raghunathan,
Diego Romeres
Abstract:
This paper presents a chance-constrained formulation for robust trajectory optimization during manipulation. In particular, we present a chance-constrained optimization for Stochastic Discrete-time Linear Complementarity Systems (SDLCS). To solve the optimization problem, we formulate Mixed-Integer Quadratic Programming with Chance Constraints (MIQPCC). In our formulation, we explicitly consider j…
▽ More
This paper presents a chance-constrained formulation for robust trajectory optimization during manipulation. In particular, we present a chance-constrained optimization for Stochastic Discrete-time Linear Complementarity Systems (SDLCS). To solve the optimization problem, we formulate Mixed-Integer Quadratic Programming with Chance Constraints (MIQPCC). In our formulation, we explicitly consider joint chance constraints for complementarity as well as states to capture the stochastic evolution of dynamics. We evaluate robustness of our optimized trajectories in simulation on several systems. The proposed approach outperforms some recent approaches for robust trajectory optimization for SDLCS.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
Piezoelectric Strain FET (PeFET) based Non-Volatile Memories
Authors:
Niharika Thakuria,
Reena Elangovan,
Anand Raghunathan,
Sumeet K. Gupta
Abstract:
We propose non-volatile memory (NVM) designs based on Piezoelectric Strain FET (PeFET) utilizing a piezoelectric/ferroelectric (PE/FE such as PZT) coupled with 2D Transition Metal Dichalcogenide (2D-TMD such as MoS2) transistor. The proposed NVMs store bit information in the form of polarization (P) of the FE/PE, use electric-field driven P-switching for write and employ piezoelectricity induced d…
▽ More
We propose non-volatile memory (NVM) designs based on Piezoelectric Strain FET (PeFET) utilizing a piezoelectric/ferroelectric (PE/FE such as PZT) coupled with 2D Transition Metal Dichalcogenide (2D-TMD such as MoS2) transistor. The proposed NVMs store bit information in the form of polarization (P) of the FE/PE, use electric-field driven P-switching for write and employ piezoelectricity induced dynamic bandgap modulation of 2D-TMD channel for bit sensing. We analyze PeFET with COMSOL based 3D modeling showing that the circuit-driven optimization of PeFET geometry is essential to achieve effective hammer-and-nail effect and adequate bandgap modulation for NVM read. Our results show that distinguishability of binary states to up to 11X is achieved in PeFETs.We propose various flavors of PeFET NVMs, namely (a) high density (HD) NVM featuring a compact access-transistor-less bit-cell, (b) 1T-1PeFET NVM with segmented architecture, targeted for optimized write energy and latency and (c) cross-coupled (CC) NVM offering a trade-off between area and latency.PeFET NVMs offer up to 7X smaller cell area, 66% lower write energy, 87% lower read energy and 44% faster read compared to 2D-FET SRAM. This comes at the cost of high write latency in PeFET NVMs, which can be minimized by virtue of optimized PE geometry.
△ Less
Submitted 5 April, 2022; v1 submitted 28 February, 2022;
originally announced March 2022.
-
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
Authors:
Ananya Kumar,
Aditi Raghunathan,
Robbie Jones,
Tengyu Ma,
Percy Liang
Abstract:
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribu…
▽ More
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
△ Less
Submitted 21 February, 2022;
originally announced February 2022.
-
An Explanation of In-context Learning as Implicit Bayesian Inference
Authors:
Sang Michael Xie,
Aditi Raghunathan,
Percy Liang,
Tengyu Ma
Abstract:
Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning ca…
▽ More
Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.
△ Less
Submitted 21 July, 2022; v1 submitted 3 November, 2021;
originally announced November 2021.
-
On the Opportunities and Risks of Foundation Models
Authors:
Rishi Bommasani,
Drew A. Hudson,
Ehsan Adeli,
Russ Altman,
Simran Arora,
Sydney von Arx,
Michael S. Bernstein,
Jeannette Bohg,
Antoine Bosselut,
Emma Brunskill,
Erik Brynjolfsson,
Shyamal Buch,
Dallas Card,
Rodrigo Castellon,
Niladri Chatterji,
Annie Chen,
Kathleen Creel,
Jared Quincy Davis,
Dora Demszky,
Chris Donahue,
Moussa Doumbouya,
Esin Durmus,
Stefano Ermon,
John Etchemendy,
Kawin Ethayarajh
, et al. (89 additional authors not shown)
Abstract:
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap…
▽ More
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
△ Less
Submitted 12 July, 2022; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Just Train Twice: Improving Group Robustness without Training Group Information
Authors:
Evan Zheran Liu,
Behzad Haghgoo,
Annie S. Chen,
Aditi Raghunathan,
Pang Wei Koh,
Shiori Sagawa,
Percy Liang,
Chelsea Finn
Abstract:
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training…
▽ More
Standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on certain groups, especially in the presence of spurious correlations between the input and label. Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point, whereas approaches that do not use such group annotations typically achieve unsatisfactory worst-group accuracy. In this paper, we propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified. Intuitively, this upweights examples from groups on which standard ERM models perform poorly, leading to improved worst-group performance. Averaged over four image classification and natural language processing tasks with spurious correlations, JTT closes 75% of the gap in worst-group accuracy between standard ERM and group DRO, while only requiring group annotations on a small validation set in order to tune hyperparameters.
△ Less
Submitted 27 September, 2021; v1 submitted 19 July, 2021;
originally announced July 2021.