-
Data curation via joint example selection further accelerates multimodal learning
Authors:
Talfan Evans,
Nikhil Parthasarathy,
Hamza Merzic,
Olivier J. Henaff
Abstract:
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorit…
▽ More
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
High Performance P300 Spellers Using GPT2 Word Prediction With Cross-Subject Training
Authors:
Nithin Parthasarathy,
James Soetedjo,
Saarang Panchavati,
Nitya Parthasarathy,
Corey Arnold,
Nader Pouratian,
William Speier
Abstract:
Amyotrophic lateral sclerosis (ALS) severely impairs patients' ability to communicate, often leading to a decline in their quality of life within a few years of diagnosis. The P300 speller brain-computer interface (BCI) offers an alternative communication method by interpreting a subject's EEG response to characters presented on a grid interface.
This paper addresses the common speed limitations…
▽ More
Amyotrophic lateral sclerosis (ALS) severely impairs patients' ability to communicate, often leading to a decline in their quality of life within a few years of diagnosis. The P300 speller brain-computer interface (BCI) offers an alternative communication method by interpreting a subject's EEG response to characters presented on a grid interface.
This paper addresses the common speed limitations encountered in training efficient P300-based multi-subject classifiers by introducing innovative "across-subject" classifiers. We leverage a combination of the second-generation Generative Pre-Trained Transformer (GPT2) and Dijkstra's algorithm to optimize stimuli and suggest word completion choices based on typing history. Additionally, we employ a multi-layered smoothing technique to accommodate out-of-vocabulary (OOV) words.
Through extensive simulations involving random sampling of EEG data from subjects, we demonstrate significant speed enhancements in typing passages containing rare and OOV words. These optimizations result in approximately 10% improvement in character-level typing speed and up to 40% improvement in multi-word prediction. We demonstrate that augmenting standard row/column highlighting techniques with layered word prediction yields close-to-optimal performance.
Furthermore, we explore both "within-subject" and "across-subject" training techniques, showing that speed improvements are consistent across both approaches.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Probing Biological and Artificial Neural Networks with Task-dependent Neural Manifolds
Authors:
Michael Kuoch,
Chi-Ning Chou,
Nikhil Parthasarathy,
Joel Dapello,
James J. DiCarlo,
Haim Sompolinsky,
SueYeon Chung
Abstract:
Recently, growth in our understanding of the computations performed in both biological and artificial neural networks has largely been driven by either low-level mechanistic studies or global normative approaches. However, concrete methodologies for bridging the gap between these levels of abstraction remain elusive. In this work, we investigate the internal mechanisms of neural networks through t…
▽ More
Recently, growth in our understanding of the computations performed in both biological and artificial neural networks has largely been driven by either low-level mechanistic studies or global normative approaches. However, concrete methodologies for bridging the gap between these levels of abstraction remain elusive. In this work, we investigate the internal mechanisms of neural networks through the lens of neural population geometry, aiming to provide understanding at an intermediate level of abstraction, as a way to bridge that gap. Utilizing manifold capacity theory (MCT) from statistical physics and manifold alignment analysis (MAA) from high-dimensional statistics, we probe the underlying organization of task-dependent manifolds in deep neural networks and macaque neural recordings. Specifically, we quantitatively characterize how different learning objectives lead to differences in the organizational strategies of these models and demonstrate how these geometric analyses are connected to the decodability of task-relevant information. These analyses present a strong direction for bridging mechanistic and normative theories in neural networks through neural population geometry, potentially opening up many future research avenues in both machine learning and neuroscience.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Layerwise complexity-matched learning yields an improved model of cortical area V2
Authors:
Nikhil Parthasarathy,
Olivier J. Hénaff,
Eero P. Simoncelli
Abstract:
Human ability to recognize complex visual patterns arises through transformations performed by successive areas in the ventral visual cortex. Deep neural networks trained end-to-end for object recognition approach human capabilities, and offer the best descriptions to date of neural responses in the late stages of the hierarchy. But these networks provide a poor account of the early stages, compar…
▽ More
Human ability to recognize complex visual patterns arises through transformations performed by successive areas in the ventral visual cortex. Deep neural networks trained end-to-end for object recognition approach human capabilities, and offer the best descriptions to date of neural responses in the late stages of the hierarchy. But these networks provide a poor account of the early stages, compared to traditional hand-engineered models, or models optimized for coding efficiency or prediction. Moreover, the gradient backpropagation used in end-to-end learning is generally considered to be biologically implausible. Here, we overcome both of these limitations by developing a bottom-up self-supervised training methodology that operates independently on successive layers. Specifically, we maximize feature similarity between pairs of locally-deformed natural image patches, while decorrelating features across patches sampled from other images. Crucially, the deformation amplitudes are adjusted proportionally to receptive field sizes in each layer, thus matching the task complexity to the capacity at each stage of processing. In comparison with architecture-matched versions of previous models, we demonstrate that our layerwise complexity-matched learning (LCL) formulation produces a two-stage model (LCL-V2) that is better aligned with selectivity properties and neural activity in primate area V2. We demonstrate that the complexity-matched learning paradigm is responsible for much of the emergence of the improved biological alignment. Finally, when the two-stage model is used as a fixed front-end for a deep network trained to perform object recognition, the resultant model (LCL-V2Net) is significantly better than standard end-to-end self-supervised, supervised, and adversarially-trained models in terms of generalization to out-of-distribution tasks and alignment with human behavior.
△ Less
Submitted 18 July, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Fifty Years of ISCA: A data-driven retrospective on key trends
Authors:
Gaurang Upasani,
Matthew D. Sinclair,
Adrian Sampson,
Parthasarathy Ranganathan,
David Patterson,
Shaan Shah,
Nidhi Parthasarathy,
Rutwik Jain
Abstract:
Computer Architecture, broadly, involves optimizing hardware and software for current and future processing systems. Although there are several other top venues to publish Computer Architecture research, including ASPLOS, HPCA, and MICRO, ISCA (the International Symposium on Computer Architecture) is one of the oldest, longest running, and most prestigious venues for publishing Computer Architectu…
▽ More
Computer Architecture, broadly, involves optimizing hardware and software for current and future processing systems. Although there are several other top venues to publish Computer Architecture research, including ASPLOS, HPCA, and MICRO, ISCA (the International Symposium on Computer Architecture) is one of the oldest, longest running, and most prestigious venues for publishing Computer Architecture research. Since 1973, except for 1975, ISCA has been organized annually. Accordingly, this year will be the 50th year of ISCA. Thus, we set out to analyze the past 50 years of ISCA to understand who and what has been driving and innovating computing systems thus far. Our analysis identifies several interesting trends that reflect how ISCA, and Computer Architecture in general, has grown and evolved in the past 50 years, including minicomputers, general-purpose uniprocessor CPUs, multiprocessor and multi-core CPUs, general-purpose GPUs, and accelerators.
△ Less
Submitted 18 November, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
Towards In-context Scene Understanding
Authors:
Ivana Balažević,
David Steiner,
Nikhil Parthasarathy,
Relja Arandjelović,
Olivier J. Hénaff
Abstract:
In-context learning$\unicode{x2013}$the ability to configure a model's behavior with different prompts$\unicode{x2013}$has revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and…
▽ More
In-context learning$\unicode{x2013}$the ability to configure a model's behavior with different prompts$\unicode{x2013}$has revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and finetuning protocols are generally required to perform dense tasks such as semantic segmentation and depth estimation. In this work we explore a simple mechanism for in-context learning of such scene understanding tasks: nearest neighbor retrieval from a prompt of annotated features. We propose a new pretraining protocol$\unicode{x2013}$leveraging attention within and across images$\unicode{x2013}$which yields representations particularly useful in this regime. The resulting Hummingbird model, suitably prompted, performs various scene understanding tasks without modification while approaching the performance of specialists that have been finetuned for each task. Moreover, Hummingbird can be configured to perform new tasks much more efficiently than finetuned models, raising the possibility of scene understanding in the interactive assistant regime.
△ Less
Submitted 31 October, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
Self-supervised video pretraining yields human-aligned visual representations
Authors:
Nikhil Parthasarathy,
S. M. Ali Eslami,
João Carreira,
Olivier J. Hénaff
Abstract:
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human pe…
▽ More
Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.
△ Less
Submitted 25 July, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods
Authors:
Skanda Koppula,
Yazhe Li,
Evan Shelhamer,
Andrew Jaegle,
Nikhil Parthasarathy,
Relja Arandjelovic,
João Carreira,
Olivier Hénaff
Abstract:
Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what…
▽ More
Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what are the best datasets, models, and (self-)supervised training methods for obtaining high accuracy on representative visual tasks? Given the availability of large datasets, this setting is often more relevant for both academic and industry labs alike. We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised). In a like-for-like fashion, we characterize their FLOP and CO$_2$ footprints, relative to their accuracy when transferred to a canonical image segmentation task. Our analysis reveals strong disparities in the computational efficiency of pre-training methods and their dependence on dataset quality. In particular, our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data. We therefore advocate for (1) paying closer attention to dataset curation and (2) reporting of accuracies in context of the total computational cost.
△ Less
Submitted 18 October, 2022; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Self-Supervised Learning of a Biologically-Inspired Visual Texture Model
Authors:
Nikhil Parthasarathy,
Eero P. Simoncelli
Abstract:
We develop a model for representing visual texture in a low-dimensional feature space, along with a novel self-supervised learning objective that is used to train it on an unlabeled database of texture images. Inspired by the architecture of primate visual cortex, the model uses a first stage of oriented linear filters (corresponding to cortical area V1), consisting of both rectified units (simple…
▽ More
We develop a model for representing visual texture in a low-dimensional feature space, along with a novel self-supervised learning objective that is used to train it on an unlabeled database of texture images. Inspired by the architecture of primate visual cortex, the model uses a first stage of oriented linear filters (corresponding to cortical area V1), consisting of both rectified units (simple cells) and pooled phase-invariant units (complex cells). These responses are processed by a second stage (analogous to cortical area V2) consisting of convolutional filters followed by half-wave rectification and pooling to generate V2 'complex cell' responses. The second stage filters are trained on a set of unlabeled homogeneous texture images, using a novel contrastive objective that maximizes the distance between the distribution of V2 responses to individual images and the distribution of responses across all images. When evaluated on texture classification, the trained model achieves substantially greater data-efficiency than a variety of deep hierarchical model architectures. Moreover, we show that the learned model exhibits stronger representational similarity to texture responses of neural populations recorded in primate V2 than pre-trained deep CNNs.
△ Less
Submitted 30 June, 2020;
originally announced June 2020.
-
A Linear Systems Theory of Normalizing Flows
Authors:
Reuben Feinman,
Nikhil Parthasarathy
Abstract:
Normalizing Flows are a promising new class of algorithms for unsupervised learning based on maximum likelihood optimization with change of variables. They offer to learn a factorized component representation for complex nonlinear data and, simultaneously, yield a density function that can evaluate likelihoods and generate samples. Despite these diverse offerings, applications of Normalizing Flows…
▽ More
Normalizing Flows are a promising new class of algorithms for unsupervised learning based on maximum likelihood optimization with change of variables. They offer to learn a factorized component representation for complex nonlinear data and, simultaneously, yield a density function that can evaluate likelihoods and generate samples. Despite these diverse offerings, applications of Normalizing Flows have focused primarily on sampling and likelihoods, with little emphasis placed on feature representation. A lack of theoretical foundation has left many open questions about how to interpret and apply the learned components of the model. We provide a new theoretical perspective of Normalizing Flows using the lens of linear systems theory, showing that optimal flows learn to represent the local covariance at each region of input space. Using this insight, we develop a new algorithm to extract interpretable component representations from the learned model, where components correspond to Cartesian dimensions and are scaled according to their manifold significance. In addition, we highlight a stability concern for the learning algorithm that was previously unaddressed, providing a theoretically-grounded solution to mediate the problem. Experiments with toy manifold learning datasets, as well as the MNIST image dataset, provide convincing support for our theory and tools.
△ Less
Submitted 14 February, 2020; v1 submitted 15 July, 2019;
originally announced July 2019.