-
Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators
Authors:
Paolo D'Alberto,
Taehee Jeong,
Akshai Jain,
Shreyas Manjunath,
Mrinal Sarmah,
Samuel Hsu,
Yaswanth Raparti,
Nitesh Pipralia
Abstract:
Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connect…
▽ More
Nowadays, increasingly larger Deep Neural Networks (DNNs) are being developed, trained, and utilized. These networks require significant computational resources, putting a strain on both advanced and limited devices. Our solution is to implement {\em weight block sparsity}, which is a structured sparsity that is friendly to hardware. By zeroing certain sections of the convolution and fully connected layers parameters of pre-trained DNN models, we can efficiently speed up the DNN's inference process. This results in a smaller memory footprint, faster communication, and fewer operations.
Our work presents a vertical system that allows for the training of convolution and matrix multiplication weights to exploit 8x8 block sparsity on a single GPU within a reasonable amount of time. Compilers recognize this sparsity and use it for both data compaction and computation splitting into threads. Blocks like these take full advantage of both spatial and temporal locality, paving the way for fast vector operations and memory reuse. By using this system on a Resnet50 model, we were able to reduce the weight by half with minimal accuracy loss, resulting in a two-times faster inference speed. We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16 to demonstrate the necessary synergy between hardware overlay designs and software stacks for compiling and executing machine learning applications.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Calibrated Forecasting and Persuasion
Authors:
Atulya Jain,
Vianney Perchet
Abstract:
How should an expert send forecasts to maximize her utility subject to passing a calibration test? We consider a dynamic game where an expert sends probabilistic forecasts to a decision maker. The decision maker uses a calibration test based on past outcomes to verify the expert's forecasts. We characterize the optimal forecasting strategy by reducing the dynamic game to a static persuasion proble…
▽ More
How should an expert send forecasts to maximize her utility subject to passing a calibration test? We consider a dynamic game where an expert sends probabilistic forecasts to a decision maker. The decision maker uses a calibration test based on past outcomes to verify the expert's forecasts. We characterize the optimal forecasting strategy by reducing the dynamic game to a static persuasion problem. A distribution of forecasts is implementable by a calibrated strategy if and only if it is a mean-preserving contraction of the distribution of conditionals (honest forecasts). We characterize the value of information by comparing what an informed and uninformed expert can attain. Moreover, we consider a decision maker who uses regret minimization, instead of the calibration test, to take actions. We show that the expert can achieve the same payoff against a regret minimizer as under the calibration test, and in some instances, she can achieve strictly more.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
A machine learning pipeline for automated insect monitoring
Authors:
Aditya Jain,
Fagner Cunha,
Michael Bunsen,
Léonard Pasi,
Anna Viklund,
Maxim Larrivée,
David Rolnick
Abstract:
Climate change and other anthropogenic factors have led to a catastrophic decline in insects, endangering both biodiversity and the ecosystem services on which human society depends. Data on insect abundance, however, remains woefully inadequate. Camera traps, conventionally used for monitoring terrestrial vertebrates, are now being modified for insects, especially moths. We describe a complete, o…
▽ More
Climate change and other anthropogenic factors have led to a catastrophic decline in insects, endangering both biodiversity and the ecosystem services on which human society depends. Data on insect abundance, however, remains woefully inadequate. Camera traps, conventionally used for monitoring terrestrial vertebrates, are now being modified for insects, especially moths. We describe a complete, open-source machine learning-based software pipeline for automated monitoring of moths via camera traps, including object detection, moth/non-moth classification, fine-grained identification of moth species, and tracking individuals. We believe that our tools, which are already in use across three continents, represent the future of massively scalable data collection in entomology.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Insect Identification in the Wild: The AMI Dataset
Authors:
Aditya Jain,
Fagner Cunha,
Michael James Bunsen,
Juan Sebastián Cañas,
Léonard Pasi,
Nathan Pinoy,
Flemming Helsing,
JoAnne Russo,
Marc Botham,
Michael Sabourin,
Jonathan Fréchette,
Alexandre Anctil,
Yacksecari Lopez,
Eduardo Navarro,
Filonila Perez Pimentel,
Ana Cecilia Zamora,
José Alejandro Ramirez Silva,
Jonathan Gagnon,
Tom August,
Kim Bjerge,
Alba Gomez Segura,
Marc Bélisle,
Yves Basset,
Kent P. McFarland,
David Roy
, et al. (3 additional authors not shown)
Abstract:
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study inse…
▽ More
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets are made publicly available.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
ESI-GAL: EEG Source Imaging-based Kinematics Parameter Estimation for Grasp and Lift Task
Authors:
Anant Jain,
Lalan Kumar
Abstract:
Electroencephalogram (EEG) signals-based motor kinematics prediction (MKP) has been an active area of research to develop brain-computer interface (BCI) systems such as exosuits, prostheses, and rehabilitation devices. However, EEG source imaging (ESI) based kinematics prediction is sparsely explored in the literature. In this study, pre-movement EEG features are utilized to predict three-dimensio…
▽ More
Electroencephalogram (EEG) signals-based motor kinematics prediction (MKP) has been an active area of research to develop brain-computer interface (BCI) systems such as exosuits, prostheses, and rehabilitation devices. However, EEG source imaging (ESI) based kinematics prediction is sparsely explored in the literature. In this study, pre-movement EEG features are utilized to predict three-dimensional (3D) hand kinematics for the grasp-and-lift motor task. A public dataset, WAY-EEG-GAL, is utilized for MKP analysis. In particular, sensor-domain (EEG data) and source-domain (ESI data) based features from the frontoparietal region are explored for MKP. Deep learning-based models are explored to achieve efficient kinematics decoding. Various time-lagged and window sizes are analyzed for hand kinematics prediction. Subsequently, intra-subject and inter-subject MKP analysis is performed to investigate the subject-specific and subject-independent motor-learning capabilities of the neural decoders. The Pearson correlation coefficient (PCC) is used as the performance metric for kinematics trajectory decoding. The rEEGNet neural decoder achieved the best performance with sensor-domain and source-domain features with a time lag and window size of 100 ms and 450 ms, respectively. The highest mean PCC values of 0.790, 0.795, and 0.637 are achieved using sensor-domain features, while 0.769, 0.777, and 0.647 are achieved using source-domain features in x, y, and z-directions, respectively. This study explores the feasibility of trajectory prediction using EEG sensor-domain and source-domain EEG features for the grasp-and-lift task. Furthermore, inter-subject trajectory estimation is performed using the proposed deep learning decoder with EEG source domain features.
△ Less
Submitted 4 July, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
Authors:
Benjamin Biggs,
Arjun Seshadri,
Yang Zou,
Achin Jain,
Aditya Golatkar,
Yusheng Xie,
Alessandro Achille,
Ashwin Swaminathan,
Stefano Soatto
Abstract:
We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup…
▽ More
We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 $\to$ .44) on domain sharded data, and a 59% improvement in IR (.37 $\to$ .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 $\to$ 86.5 and 85.6 $\to$ 86.8). We demonstrate robust unlearning -- removing any individual domain shard only lowers performance by 1% in IR (.45 $\to$ .44) -- and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup's ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Fast White-Box Adversarial Streaming Without a Random Oracle
Authors:
Ying Feng,
Aayush Jain,
David P. Woodruff
Abstract:
Recently, the question of adversarially robust streaming, where the stream is allowed to depend on the randomness of the streaming algorithm, has gained a lot of attention. In this work, we consider a strong white-box adversarial model (Ajtai et al. PODS 2022), in which the adversary has access to all past random coins and the parameters used by the streaming algorithm. We focus on the sparse reco…
▽ More
Recently, the question of adversarially robust streaming, where the stream is allowed to depend on the randomness of the streaming algorithm, has gained a lot of attention. In this work, we consider a strong white-box adversarial model (Ajtai et al. PODS 2022), in which the adversary has access to all past random coins and the parameters used by the streaming algorithm. We focus on the sparse recovery problem and extend our result to other tasks such as distinct element estimation and low-rank approximation of matrices and tensors. The main drawback of previous work is that it requires a random oracle, which is especially problematic in the streaming model since the amount of randomness is counted in the space complexity of a streaming algorithm. Also, the previous work suffers from large update time. We construct a near-optimal solution for the sparse recovery problem in white-box adversarial streams, based on the subexponentially secure Learning with Errors assumption. Importantly, our solution does not require a random oracle and has a polylogarithmic per item processing time. We also give results in a related white-box adversarially robust distributed model. Our constructions are based on homomorphic encryption schemes satisfying very mild structural properties that are currently satisfied by most known schemes.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
GenPalm: Contactless Palmprint Generation with Diffusion Models
Authors:
Steven A. Grosz,
Anil K. Jain
Abstract:
The scarcity of large-scale palmprint databases poses a significant bottleneck to advancements in contactless palmprint recognition. To address this, researchers have turned to synthetic data generation. While Generative Adversarial Networks (GANs) have been widely used, they suffer from instability and mode collapse. Recently, diffusion probabilistic models have emerged as a promising alternative…
▽ More
The scarcity of large-scale palmprint databases poses a significant bottleneck to advancements in contactless palmprint recognition. To address this, researchers have turned to synthetic data generation. While Generative Adversarial Networks (GANs) have been widely used, they suffer from instability and mode collapse. Recently, diffusion probabilistic models have emerged as a promising alternative, offering stable training and better distribution coverage. This paper introduces a novel palmprint generation method using diffusion probabilistic models, develops an end-to-end framework for synthesizing multiple palm identities, and validates the realism and utility of the generated palmprints. Experimental results demonstrate the effectiveness of our approach in generating palmprint images which enhance contactless palmprint recognition performance across several test databases utilizing challenging cross-database and time-separated evaluation protocols.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
A Comparative Study of CNN, ResNet, and Vision Transformers for Multi-Classification of Chest Diseases
Authors:
Ananya Jain,
Aviral Bhardwaj,
Kaushik Murali,
Isha Surani
Abstract:
Large language models, notably utilizing Transformer architectures, have emerged as powerful tools due to their scalability and ability to process large amounts of data. Dosovitskiy et al. expanded this architecture to introduce Vision Transformers (ViT), extending its applicability to image processing tasks. Motivated by this advancement, we fine-tuned two variants of ViT models, one pre-trained…
▽ More
Large language models, notably utilizing Transformer architectures, have emerged as powerful tools due to their scalability and ability to process large amounts of data. Dosovitskiy et al. expanded this architecture to introduce Vision Transformers (ViT), extending its applicability to image processing tasks. Motivated by this advancement, we fine-tuned two variants of ViT models, one pre-trained on ImageNet and another trained from scratch, using the NIH Chest X-ray dataset containing over 100,000 frontal-view X-ray images. Our study evaluates the performance of these models in the multi-label classification of 14 distinct diseases, while using Convolutional Neural Networks (CNNs) and ResNet architectures as baseline models for comparison. Through rigorous assessment based on accuracy metrics, we identify that the pre-trained ViT model surpasses CNNs and ResNet in this multilabel classification task, highlighting its potential for accurate diagnosis of various lung conditions from chest X-ray images.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training
Authors:
Anchit Jain,
Rozhin Nobahari,
Aristide Baratin,
Stefano Sarao Mannelli
Abstract:
Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student s…
▽ More
Machine learning systems often acquire biases by leveraging undesired features in the data, impacting accuracy variably across different sub-populations. Current understanding of bias formation mostly focuses on the initial and final stages of learning, leaving a gap in knowledge regarding the transient dynamics. To address this gap, this paper explores the evolution of bias in a teacher-student setup modeling different data sub-populations with a Gaussian-mixture model. We provide an analytical description of the stochastic gradient descent dynamics of a linear classifier in this setting, which we prove to be exact in high dimension. Notably, our analysis reveals how different properties of sub-populations influence bias at different timescales, showing a shifting preference of the classifier during training. Applying our findings to fairness and robustness, we delineate how and when heterogeneous data and spurious features can generate and amplify bias. We empirically validate our results in more complex scenarios by training deeper networks on synthetic and real datasets, including CIFAR10, MNIST, and CelebA.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation
Authors:
Abhinav Jain,
Swarat Chaudhuri,
Thomas Reps,
Chris Jermaine
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) has become the standard for customising Foundation Models (FMs) to user-specific downstream tasks. However, typical PEFT methods require storing multiple task-specific adapters, creating scalability issues as these adapters must be housed and run at the FM server. Traditional prompt tuning offers a potential solution by customising them through task-specific…
▽ More
Parameter-Efficient Fine-Tuning (PEFT) has become the standard for customising Foundation Models (FMs) to user-specific downstream tasks. However, typical PEFT methods require storing multiple task-specific adapters, creating scalability issues as these adapters must be housed and run at the FM server. Traditional prompt tuning offers a potential solution by customising them through task-specific input prefixes, but it under-performs compared to other PEFT methods like LoRA. To address this gap, we propose Low-Rank Prompt Adaptation (LOPA), a prompt-tuning-based approach that performs on par with state-of-the-art PEFT methods and full fine-tuning while being more parameter-efficient and not requiring a server-based adapter. LOPA generates soft prompts by balancing between sharing task-specific information across instances and customization for each instance. It uses a low-rank decomposition of the soft-prompt component encoded for each instance to achieve parameter efficiency. We provide a comprehensive evaluation on multiple natural language understanding and code generation and understanding tasks across a wide range of foundation models with varying sizes.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
AlabOS: A Python-based Reconfigurable Workflow Management Framework for Autonomous Laboratories
Authors:
Yuxing Fei,
Bernardus Rendy,
Rishi Kumar,
Olympia Dartsi,
Hrushikesh P. Sahasrabuddhe,
Matthew J. McDermott,
Zheren Wang,
Nathan J. Szymanski,
Lauren N. Walters,
David Milsted,
Yan Zeng,
Anubhav Jain,
Gerbrand Ceder
Abstract:
The recent advent of autonomous laboratories, coupled with algorithms for high-throughput screening and active learning, promises to accelerate materials discovery and innovation. As these autonomous systems grow in complexity, the demand for robust and efficient workflow management software becomes increasingly critical. In this paper, we introduce AlabOS, a general-purpose software framework for…
▽ More
The recent advent of autonomous laboratories, coupled with algorithms for high-throughput screening and active learning, promises to accelerate materials discovery and innovation. As these autonomous systems grow in complexity, the demand for robust and efficient workflow management software becomes increasingly critical. In this paper, we introduce AlabOS, a general-purpose software framework for orchestrating experiments and managing resources, with an emphasis on automated laboratories for materials synthesis and characterization. We demonstrate the implementation of AlabOS in a prototype autonomous materials laboratory. AlabOS features a reconfigurable experiment workflow model, enabling the simultaneous execution of varied workflows composed of modular tasks. Therefore, AlabOS is well-suited to handle the rapidly changing experimental protocols defining the progress of self-driving laboratory development for materials research.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
SmartFlow: Robotic Process Automation using LLMs
Authors:
Arushi Jain,
Shubham Paliwal,
Monika Sharma,
Lovekesh Vig,
Gautam Shroff
Abstract:
Robotic Process Automation (RPA) systems face challenges in handling complex processes and diverse screen layouts that require advanced human-like decision-making capabilities. These systems typically rely on pixel-level encoding through drag-and-drop or automation frameworks such as Selenium to create navigation workflows, rather than visual understanding of screen elements. In this context, we p…
▽ More
Robotic Process Automation (RPA) systems face challenges in handling complex processes and diverse screen layouts that require advanced human-like decision-making capabilities. These systems typically rely on pixel-level encoding through drag-and-drop or automation frameworks such as Selenium to create navigation workflows, rather than visual understanding of screen elements. In this context, we present SmartFlow, an AI-based RPA system that uses pre-trained large language models (LLMs) coupled with deep-learning based image understanding. Our system can adapt to new scenarios, including changes in the user interface and variations in input data, without the need for human intervention. SmartFlow uses computer vision and natural language processing to perceive visible elements on the graphical user interface (GUI) and convert them into a textual representation. This information is then utilized by LLMs to generate a sequence of actions that are executed by a scripting engine to complete an assigned task. To assess the effectiveness of SmartFlow, we have developed a dataset that includes a set of generic enterprise applications with diverse layouts, which we are releasing for research use. Our evaluations on this dataset demonstrate that SmartFlow exhibits robustness across different layouts and applications. SmartFlow can automate a wide range of business processes such as form filling, customer service, invoice processing, and back-office operations. SmartFlow can thus assist organizations in enhancing productivity by automating an even larger fraction of screen-based workflows. The demo-video and dataset are available at https://smartflow-4c5a0a.webflow.io/.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Multi-Subject Personalization
Authors:
Arushi Jain,
Shubham Paliwal,
Monika Sharma,
Vikram Jamwal,
Lovekesh Vig
Abstract:
Creative story illustration requires a consistent interplay of multiple characters or objects. However, conventional text-to-image models face significant challenges while producing images featuring multiple personalized subjects. For example, they distort the subject rendering, or the text descriptions fail to render coherent subject interactions. We present Multi-Subject Personalization (MSP) to…
▽ More
Creative story illustration requires a consistent interplay of multiple characters or objects. However, conventional text-to-image models face significant challenges while producing images featuring multiple personalized subjects. For example, they distort the subject rendering, or the text descriptions fail to render coherent subject interactions. We present Multi-Subject Personalization (MSP) to alleviate some of these challenges. We implement MSP using Stable Diffusion and assess our approach against other text-to-image models, showcasing its consistent generation of good-quality images representing intended subjects and interactions.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
CustomText: Customized Textual Image Generation using Diffusion Models
Authors:
Shubham Paliwal,
Arushi Jain,
Monika Sharma,
Vikram Jamwal,
Lovekesh Vig
Abstract:
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the s…
▽ More
Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models. We call our proposed method CustomText. Our implementation leverages a pre-trained TextDiffuser model to enable control over font color, background, and types. Additionally, to address the challenge of accurately rendering small-sized fonts, we train the ControlNet model for a consistency decoder, significantly enhancing text-generation performance. We assess the performance of CustomText in comparison to previous methods of textual image generation on the publicly available CTW-1500 dataset and a self-curated dataset for small-text generation, showcasing superior results.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
Adaptive Exploration for Data-Efficient General Value Function Evaluations
Authors:
Arushi Jain,
Josiah P. Hanna,
Doina Precup
Abstract:
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This…
▽ More
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This leaves an open question: how can behavior policy be chosen for data-efficient GVF learning? To address this gap, we propose GVFExplorer, which aims at learning a behavior policy that efficiently gathers data for evaluating multiple GVFs in parallel. This behavior policy selects actions in proportion to the total variance in the return across all GVFs, reducing the number of environmental interactions. To enable accurate variance estimation, we use a recently proposed temporal-difference-style variance estimator. We prove that each behavior policy update reduces the mean squared error in the summed predictions over all GVFs. We empirically demonstrate our method's performance in both tabular representations and nonlinear function approximation.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers
Authors:
Adit Jain,
Vikram Krishnamurthy
Abstract:
This paper studies the problem of autonomous agents performing Bayesian social learning for sequential detection when the observations of the state belong to a high-dimensional space and are expensive to analyze. Specifically, when the observations are textual, the Bayesian agent can use a large language model (LLM) as a map to get a low-dimensional private observation. The agent performs Bayesian…
▽ More
This paper studies the problem of autonomous agents performing Bayesian social learning for sequential detection when the observations of the state belong to a high-dimensional space and are expensive to analyze. Specifically, when the observations are textual, the Bayesian agent can use a large language model (LLM) as a map to get a low-dimensional private observation. The agent performs Bayesian learning and takes an action that minimizes the expected cost and is visible to subsequent agents. We prove that a sequence of such Bayesian agents herd in finite time to the public belief and take the same action disregarding the private observations. We propose a stopping time formulation for quickest time herding in social learning and optimally balance privacy and herding. Structural results are shown on the threshold nature of the optimal policy to the stopping time problem. We illustrate the application of our framework when autonomous Bayesian detectors aim to sequentially identify if a user is a hate speech peddler on an online platform by parsing text observations using an LLM. We numerically validate our results on real-world hate speech datasets. We show that autonomous Bayesian agents designed to flag hate speech peddlers in online platforms herd and misclassify the users when the public prior is strong. We also numerically show the effect of a threshold policy in delaying herding.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Structured Reinforcement Learning for Incentivized Stochastic Covert Optimization
Authors:
Adit Jain,
Vikram Krishnamurthy
Abstract:
This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG.…
▽ More
This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG. The oracle probabilistically returns either a noisy gradient of the function} or a non-informative measurement, depending on the oracle state and incentive. The learner's query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This paper formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi-armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Mobius Transformation-Based Circular Motion Control for Unicycle Robots in Nonconcentric Circular Geofences
Authors:
Shubham Singh,
Anoop Jain
Abstract:
Nonuniform motion constraints are ubiquitous in robotic applications. Geofencing control is one such paradigm where the motion of a robot must be constrained within a predefined boundary. This paper addresses the problem of stabilizing a unicycle robot around a desired circular orbit while confining its motion within a nonconcentric external circular boundary. Our solution approach relies on the c…
▽ More
Nonuniform motion constraints are ubiquitous in robotic applications. Geofencing control is one such paradigm where the motion of a robot must be constrained within a predefined boundary. This paper addresses the problem of stabilizing a unicycle robot around a desired circular orbit while confining its motion within a nonconcentric external circular boundary. Our solution approach relies on the concept of the so-called Mobius transformation that, under certain practical conditions, maps two nonconcentric circles to a pair of concentric circles, and hence, results in uniform spatial motion constraints. The choice of such a Mobius transformation is governed by the roots of a quadratic equation in the post-design analysis that decides how the regions enclosed by the two circles are mapped onto the two planes. We show that the problem can be formulated either as a trajectory-constraining problem or an obstacle-avoidance problem in the transformed plane, depending on these roots. Exploiting the idea of the barrier Lyapunov function, we propose a unique control law that solves both these contrasting problems in the transformed plane and renders a solution to the original problem in the actual plane. By relating parameters of two planes under Mobius transformation and its inverse map, we further establish a connection between the control laws in two planes and determine the control law to be applied in the actual plane. Simulation and experimental results are provided to illustrate the key theoretical developments.
△ Less
Submitted 17 July, 2024; v1 submitted 11 May, 2024;
originally announced May 2024.
-
Uncovering implementable dormant pruning decisions from three different stakeholder perspectives
Authors:
Deanna Flynn,
Abhinav Jain,
Heather Knight,
Cristina G. Wilson,
Cindy Grimm
Abstract:
Dormant pruning, or the removal of unproductive portions of a tree while a tree is not actively growing, is an important orchard task to help maintain yield, requiring years to build expertise. Because of long training periods and an increasing labor shortage in agricultural jobs, pruning could benefit from robotic automation. However, to program robots to prune branches, we first need to understa…
▽ More
Dormant pruning, or the removal of unproductive portions of a tree while a tree is not actively growing, is an important orchard task to help maintain yield, requiring years to build expertise. Because of long training periods and an increasing labor shortage in agricultural jobs, pruning could benefit from robotic automation. However, to program robots to prune branches, we first need to understand how pruning decisions are made, and what variables in the environment (e.g., branch size and thickness) we need to capture. Working directly with three pruning stakeholders -- horticulturists, growers, and pruners -- we find that each group of human experts approaches pruning decision-making differently. To capture this knowledge, we present three studies and two extracted pruning protocols from field work conducted in Prosser, Washington in January 2022 and 2023. We interviewed six stakeholders (two in each group) and observed pruning across three cultivars -- Bing Cherries, Envy Apples, and Jazz Apples -- and two tree architectures -- Upright Fruiting Offshoot and V-Trellis. Leveraging participant interviews and video data, this analysis uses grounded coding to extract pruning terminology, discover horticultural contexts that influence pruning decisions, and find implementable pruning heuristics for autonomous systems. The results include a validated terminology set, which we offer for use by both pruning stakeholders and roboticists, to communicate general pruning concepts and heuristics. The results also highlight seven pruning heuristics utilizing this terminology set that would be relevant for use by future autonomous robot pruning systems, and characterize three discovered horticultural contexts (i.e., environmental management, crop-load management, and replacement wood) across all three cultivars.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Diabetic Retinopathy Detection Using Quantum Transfer Learning
Authors:
Ankush Jain,
Rinav Gupta,
Jai Singhal
Abstract:
Diabetic Retinopathy (DR), a prevalent complication in diabetes patients, can lead to vision impairment due to lesions formed on the retina. Detecting DR at an advanced stage often results in irreversible blindness. The traditional process of diagnosing DR through retina fundus images by ophthalmologists is not only time-intensive but also expensive. While classical transfer learning models have b…
▽ More
Diabetic Retinopathy (DR), a prevalent complication in diabetes patients, can lead to vision impairment due to lesions formed on the retina. Detecting DR at an advanced stage often results in irreversible blindness. The traditional process of diagnosing DR through retina fundus images by ophthalmologists is not only time-intensive but also expensive. While classical transfer learning models have been widely adopted for computer-aided detection of DR, their high maintenance costs can hinder their detection efficiency. In contrast, Quantum Transfer Learning offers a more effective solution to this challenge. This approach is notably advantageous because it operates on heuristic principles, making it highly optimized for the task. Our proposed methodology leverages this hybrid quantum transfer learning technique to detect DR. To construct our model, we utilize the APTOS 2019 Blindness Detection dataset, available on Kaggle. We employ the ResNet-18, ResNet34, ResNet50, ResNet101, ResNet152 and Inception V3, pre-trained classical neural networks, for the initial feature extraction. For the classification stage, we use a Variational Quantum Classifier. Our hybrid quantum model has shown remarkable results, achieving an accuracy of 97% for ResNet-18. This demonstrates that quantum computing, when integrated with quantum machine learning, can perform tasks with a level of power and efficiency unattainable by classical computers alone. By harnessing these advanced technologies, we can significantly improve the detection and diagnosis of Diabetic Retinopathy, potentially saving many from the risk of blindness.
Keywords: Diabetic Retinopathy, Quantum Transfer Learning, Deep Learning
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Hide and Seek: How Does Watermarking Impact Face Recognition?
Authors:
Yuguang Yao,
Steven Grosz,
Sijia Liu,
Anil Jain
Abstract:
The recent progress in generative models has revolutionized the synthesis of highly realistic images, including face images. This technological development has undoubtedly helped face recognition, such as training data augmentation for higher recognition accuracy and data privacy. However, it has also introduced novel challenges concerning the responsible use and proper attribution of computer gen…
▽ More
The recent progress in generative models has revolutionized the synthesis of highly realistic images, including face images. This technological development has undoubtedly helped face recognition, such as training data augmentation for higher recognition accuracy and data privacy. However, it has also introduced novel challenges concerning the responsible use and proper attribution of computer generated images. We investigate the impact of digital watermarking, a technique for embedding ownership signatures into images, on the effectiveness of face recognition models. We propose a comprehensive pipeline that integrates face image generation, watermarking, and face recognition to systematically examine this question. The proposed watermarking scheme, based on an encoder-decoder architecture, successfully embeds and recovers signatures from both real and synthetic face images while preserving their visual fidelity. Through extensive experiments, we unveil that while watermarking enables robust image attribution, it results in a slight decline in face recognition accuracy, particularly evident for face images with challenging poses and expressions. Additionally, we find that directly training face recognition models on watermarked images offers only a limited alleviation of this performance decline. Our findings underscore the intricate trade off between watermarking and face recognition accuracy. This work represents a pivotal step towards the responsible utilization of generative models in face recognition and serves to initiate discussions regarding the broader implications of watermarking in biometrics.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Impact of Traffic-Following on Order of Autonomous Airspace Operations
Authors:
Anahita Jain,
Husni R. Idris,
John-Paul Clarke
Abstract:
In this paper, we investigate the dynamic emergence of traffic order in a distributed multi-agent system, aiming to minimize inefficiencies that stem from unnecessary structural impositions. We introduce a methodology for developing a dynamically-updating traffic pattern map of the airspace by leveraging information about the consistency and frequency of flow directions used by current as well as…
▽ More
In this paper, we investigate the dynamic emergence of traffic order in a distributed multi-agent system, aiming to minimize inefficiencies that stem from unnecessary structural impositions. We introduce a methodology for developing a dynamically-updating traffic pattern map of the airspace by leveraging information about the consistency and frequency of flow directions used by current as well as preceding traffic. Informed by this map, an agent can discern the degree to which it is advantageous to follow traffic by trading off utilities such as time and order. We show that for the traffic levels studied, for low degrees of traffic-following behavior, there is minimal penalty in terms of aircraft travel times while improving the overall orderliness of the airspace. On the other hand, heightened traffic-following behavior may result in increased aircraft travel times, while marginally reducing the overall entropy of the airspace. Ultimately, the methods and metrics presented in this paper can be used to optimally and dynamically adjust an agent's traffic-following behavior based on these trade-offs.
△ Less
Submitted 3 June, 2024; v1 submitted 26 April, 2024;
originally announced April 2024.
-
How Could AI Support Design Education? A Study Across Fields Fuels Situating Analytics
Authors:
Ajit Jain,
Andruid Kerne,
Hannah Fowler,
Jinsil Seo,
Galen Newman,
Nic Lupfer,
Aaron Perrine
Abstract:
We use the process and findings from a case study of design educators' practices of assessment and feedback to fuel theorizing about how to make AI useful in service of human experience. We build on Suchman's theory of situated actions. We perform a qualitative study of 11 educators in 5 fields, who teach design processes situated in project-based learning contexts. Through qualitative data gather…
▽ More
We use the process and findings from a case study of design educators' practices of assessment and feedback to fuel theorizing about how to make AI useful in service of human experience. We build on Suchman's theory of situated actions. We perform a qualitative study of 11 educators in 5 fields, who teach design processes situated in project-based learning contexts. Through qualitative data gathering and analysis, we derive codes: design process; assessment and feedback challenges; and computational support.
We twice invoke creative cognition's family resemblance principle. First, to explain how design instructors already use assessment rubrics and second, to explain the analogous role for design creativity analytics: no particular trait is necessary or sufficient; each only tends to indicate good design work. Human teachers remain essential. We develop a set of situated design creativity analytics--Fluency, Flexibility, Visual Consistency, Multiscale Organization, and Legible Contrast--to support instructors' efforts, by providing on-demand, learning objectives-based assessment and feedback to students.
We theorize a methodology, which we call situating analytics, firstly because making AI support living human activity depends on aligning what analytics measure with situated practices. Further, we realize that analytics can become most significant to users by situating them through interfaces that integrate them into the material contexts of their use. Here, this means situating design creativity analytics into actual design environments. Through the case study, we identify situating analytics as a methodology for explaining analytics to users, because the iterative process of alignment with practice has the potential to enable data scientists to derive analytics that make sense as part of and support situated human experiences.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Universal Fingerprint Generation: Controllable Diffusion Model with Multimodal Conditions
Authors:
Steven A. Grosz,
Anil K. Jain
Abstract:
The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produ…
▽ More
The utilization of synthetic data for fingerprint recognition has garnered increased attention due to its potential to alleviate privacy concerns surrounding sensitive biometric data. However, current methods for generating fingerprints have limitations in creating impressions of the same finger with useful intra-class variations. To tackle this challenge, we present GenPrint, a framework to produce fingerprint images of various types while maintaining identity and offering humanly understandable control over different appearance factors such as fingerprint class, acquisition type, sensor device, and quality level. Unlike previous fingerprint generation approaches, GenPrint is not confined to replicating style characteristics from the training dataset alone: it enables the generation of novel styles from unseen devices without requiring additional fine-tuning. To accomplish these objectives, we developed GenPrint using latent diffusion models with multimodal conditions (text and image) for consistent generation of style and identity. Our experiments leverage a variety of publicly available datasets for training and evaluation. Results demonstrate the benefits of GenPrint in terms of identity preservation, explainable control, and universality of generated images. Importantly, the GenPrint-generated images yield comparable or even superior accuracy to models trained solely on real data and further enhances performance when augmenting the diversity of existing real fingerprint datasets.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks
Authors:
Kaylee Burns,
Ajinkya Jain,
Keegan Go,
Fei Xia,
Michael Stark,
Stefan Schaal,
Karol Hausman
Abstract:
Large Language Models (LLMs) have been successful at generating robot policy code, but so far these results have been limited to high-level tasks that do not require precise movement. It is an open question how well such approaches work for tasks that require reasoning over contact forces and working within tight success tolerances. We find that, with the right action space, LLMs are capable of su…
▽ More
Large Language Models (LLMs) have been successful at generating robot policy code, but so far these results have been limited to high-level tasks that do not require precise movement. It is an open question how well such approaches work for tasks that require reasoning over contact forces and working within tight success tolerances. We find that, with the right action space, LLMs are capable of successfully generating policies for a variety of contact-rich and high-precision manipulation tasks, even under noisy conditions, such as perceptual errors or grasping inaccuracies. Specifically, we reparameterize the action space to include compliance with constraints on the interaction forces and stiffnesses involved in reaching a target pose. We validate this approach on subtasks derived from the Functional Manipulation Benchmark (FMB) and NIST Task Board Benchmarks. Exposing this action space alongside methods for estimating object poses improves policy generation with an LLM by greater than 3x and 4x when compared to non-compliant action spaces
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
CoBT: Collaborative Programming of Behaviour Trees from One Demonstration for Robot Manipulation
Authors:
Aayush Jain,
Philip Long,
Valeria Villani,
John D. Kelleher,
Maria Chiara Leva
Abstract:
Mass customization and shorter manufacturing cycles are becoming more important among small and medium-sized companies. However, classical industrial robots struggle to cope with product variation and dynamic environments. In this paper, we present CoBT, a collaborative programming by demonstration framework for generating reactive and modular behavior trees. CoBT relies on a single demonstration…
▽ More
Mass customization and shorter manufacturing cycles are becoming more important among small and medium-sized companies. However, classical industrial robots struggle to cope with product variation and dynamic environments. In this paper, we present CoBT, a collaborative programming by demonstration framework for generating reactive and modular behavior trees. CoBT relies on a single demonstration and a combination of data-driven machine learning methods with logic-based declarative learning to learn a task, thus eliminating the need for programming expertise or long development times. The proposed framework is experimentally validated on 7 manipulation tasks and we show that CoBT achieves approx. 93% success rate overall with an average of 7.5s programming time. We conduct a pilot study with non-expert users to provide feedback regarding the usability of CoBT.
△ Less
Submitted 10 April, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
Indexing Analytics to Instances: How Integrating a Dashboard can Support Design Education
Authors:
Ajit Jain,
Andruid Kerne,
Nic Lupfer,
Gabriel Britain,
Aaron Perrine,
Yoonsuck Choe,
John Keyser,
Ruihong Huang,
Jinsil Seo,
Annie Sungkajun,
Robert Lightfoot,
Timothy McGuire
Abstract:
We investigate how to use AI-based analytics to support design education. The analytics at hand measure multiscale design, that is, students' use of space and scale to visually and conceptually organize their design work. With the goal of making the analytics intelligible to instructors, we developed a research artifact integrating a design analytics dashboard with design instances, and the design…
▽ More
We investigate how to use AI-based analytics to support design education. The analytics at hand measure multiscale design, that is, students' use of space and scale to visually and conceptually organize their design work. With the goal of making the analytics intelligible to instructors, we developed a research artifact integrating a design analytics dashboard with design instances, and the design environment that students use to create them. We theorize about how Suchman's notion of mutual intelligibility requires contextualized investigation of AI in order to develop findings about how analytics work for people. We studied the research artifact in 5 situated course contexts, in 3 departments. A total of 236 students used the multiscale design environment. The 9 instructors who taught those students experienced the analytics via the new research artifact.
We derive findings from a qualitative analysis of interviews with instructors regarding their experiences. Instructors reflected on how the analytics and their presentation in the dashboard have the potential to affect design education. We develop research implications addressing: (1) how indexing design analytics in the dashboard to actual design work instances helps design instructors reflect on what they mean and, more broadly, is a technique for how AI-based design analytics can support instructors' assessment and feedback experiences in situated course contexts; and (2) how multiscale design analytics, in particular, have the potential to support design education. By indexing, we mean linking which provides context, here connecting the numbers of the analytics with visually annotated design work instances.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
KeyPoint Relative Position Encoding for Face Recognition
Authors:
Minchul Kim,
Yiyang Su,
Feng Liu,
Anil Jain,
Xiaoming Liu
Abstract:
In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin…
▽ More
In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Authors:
Alexander Khazatsky,
Karl Pertsch,
Suraj Nair,
Ashwin Balakrishna,
Sudeep Dasari,
Siddharth Karamcheti,
Soroush Nasiriany,
Mohan Kumar Srirama,
Lawrence Yunliang Chen,
Kirsty Ellis,
Peter David Fagan,
Joey Hejna,
Masha Itkina,
Marion Lepert,
Yecheng Jason Ma,
Patrick Tree Miller,
Jimmy Wu,
Suneel Belkhale,
Shivin Dass,
Huy Ha,
Arhan Jain,
Abraham Lee,
Youngwoon Lee,
Marius Memmel,
Sungjae Park
, et al. (74 additional authors not shown)
Abstract:
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu…
▽ More
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity
Authors:
Siddharth Joshi,
Arnav Jain,
Ali Payani,
Baharan Mirzasoleiman
Abstract:
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding…
▽ More
Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip.
△ Less
Submitted 19 March, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Alpha-wolves and Alpha-mammals: Exploring Dictionary Attacks on Iris Recognition Systems
Authors:
Sudipta Banerjee,
Anubhav Jain,
Zehua Jiang,
Nasir Memon,
Julian Togelius,
Arun Ross
Abstract:
A dictionary attack in a biometric system entails the use of a small number of strategically generated images or templates to successfully match with a large number of identities, thereby compromising security. We focus on dictionary attacks at the template level, specifically the IrisCodes used in iris recognition systems. We present an hitherto unknown vulnerability wherein we mix IrisCodes usin…
▽ More
A dictionary attack in a biometric system entails the use of a small number of strategically generated images or templates to successfully match with a large number of identities, thereby compromising security. We focus on dictionary attacks at the template level, specifically the IrisCodes used in iris recognition systems. We present an hitherto unknown vulnerability wherein we mix IrisCodes using simple bitwise operators to generate alpha-mixtures - alpha-wolves (combining a set of "wolf" samples) and alpha-mammals (combining a set of users selected via search optimization) that increase false matches. We evaluate this vulnerability using the IITD, CASIA-IrisV4-Thousand and Synthetic datasets, and observe that an alpha-wolf (from two wolves) can match upto 71 identities @FMR=0.001%, while an alpha-mammal (from two identities) can match upto 133 other identities @FMR=0.01% on the IITD dataset.
△ Less
Submitted 20 November, 2023;
originally announced March 2024.
-
Agonist-Antagonist Pouch Motors: Bidirectional Soft Actuators Enhanced by Thermally Responsive Peltier Elements
Authors:
Trevor Exley,
Rashmi Wijesundara,
Nathan Tan,
Akshay Sunkara,
Xinyu He,
Shuopu Wang,
Bonnie Chan,
Aditya Jain,
Luis Espinosa,
Amir Jafari
Abstract:
In this study, we introduce a novel Mylar-based pouch motor design that leverages the reversible actuation capabilities of Peltier junctions to enable agonist-antagonist muscle mimicry in soft robotics. Addressing the limitations of traditional silicone-based materials, such as leakage and phase-change fluid degradation, our pouch motors filled with Novec 7000 provide a durable and leak-proof solu…
▽ More
In this study, we introduce a novel Mylar-based pouch motor design that leverages the reversible actuation capabilities of Peltier junctions to enable agonist-antagonist muscle mimicry in soft robotics. Addressing the limitations of traditional silicone-based materials, such as leakage and phase-change fluid degradation, our pouch motors filled with Novec 7000 provide a durable and leak-proof solution for geometric modeling. The integration of flexible Peltier junctions offers a significant advantage over conventional Joule heating methods by allowing active and reversible heating and cooling cycles. This innovation not only enhances the reliability and longevity of soft robotic applications but also broadens the scope of design possibilities, including the development of agonist-antagonist artificial muscles, grippers with can manipulate through flexion and extension, and an anchor-slip style simple crawler design. Our findings indicate that this approach could lead to more efficient, versatile, and durable robotic systems, marking a significant advancement in the field of soft robotics.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Authors:
Brandon McKinzie,
Zhe Gan,
Jean-Philippe Fauconnier,
Sam Dodge,
Bowen Zhang,
Philipp Dufter,
Dhruti Shah,
Xianzhi Du,
Futang Peng,
Floris Weers,
Anton Belyi,
Haotian Zhang,
Karanjeet Singh,
Doug Kang,
Ankur Jain,
Hongyu Hè,
Max Schwarzer,
Tom Gunter,
Xiang Kong,
Aonan Zhang,
Jianyu Wang,
Chong Wang,
Nan Du,
Tao Lei,
Sam Wiseman
, et al. (7 additional authors not shown)
Abstract:
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la…
▽ More
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
△ Less
Submitted 18 April, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
OpenVPN is Open to VPN Fingerprinting
Authors:
Diwen Xue,
Reethika Ramesh,
Arham Jain,
Michalis Kallitsis,
J. Alex Halderman,
Jedidiah R. Crandall,
Roya Ensafi
Abstract:
VPN adoption has seen steady growth over the past decade due to increased public awareness of privacy and surveillance threats. In response, certain governments are attempting to restrict VPN access by identifying connections using "dual use" DPI technology. To investigate the potential for VPN blocking, we develop mechanisms for accurately fingerprinting connections using OpenVPN, the most popula…
▽ More
VPN adoption has seen steady growth over the past decade due to increased public awareness of privacy and surveillance threats. In response, certain governments are attempting to restrict VPN access by identifying connections using "dual use" DPI technology. To investigate the potential for VPN blocking, we develop mechanisms for accurately fingerprinting connections using OpenVPN, the most popular protocol for commercial VPN services. We identify three fingerprints based on protocol features such as byte pattern, packet size, and server response. Playing the role of an attacker who controls the network, we design a two-phase framework that performs passive fingerprinting and active probing in sequence. We evaluate our framework in partnership with a million-user ISP and find that we identify over 85% of OpenVPN flows with only negligible false positives, suggesting that OpenVPN-based services can be effectively blocked with little collateral damage. Although some commercial VPNs implement countermeasures to avoid detection, our framework successfully identified connections to 34 out of 41 "obfuscated" VPN configurations. We discuss the implications of the VPN fingerprintability for different threat models and propose short-term defenses. In the longer term, we urge commercial VPN providers to be more transparent about their obfuscation approaches and to adopt more principled detection countermeasures, such as those developed in censorship circumvention research.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches
Authors:
Priya Sundaresan,
Quan Vuong,
Jiayuan Gu,
Peng Xu,
Ted Xiao,
Sean Kirmani,
Tianhe Yu,
Michael Stark,
Ajinkya Jain,
Karol Hausman,
Dorsa Sadigh,
Jeannette Bohg,
Stefan Schaal
Abstract:
Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can…
▽ More
Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings. For supplementary material and videos, please refer to our website: http://rt-sketch.github.io.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Barrier Functions Inspired Reward Shaping for Reinforcement Learning
Authors:
Nilaksh Nilaksh,
Abhishek Ranjan,
Shreenabh Agrawal,
Aayush Jain,
Pushpak Jagtap,
Shishir Kolathaya
Abstract:
Reinforcement Learning (RL) has progressed from simple control tasks to complex real-world challenges with large state spaces. While RL excels in these tasks, training time remains a limitation. Reward shaping is a popular solution, but existing methods often rely on value functions, which face scalability issues. This paper presents a novel safety-oriented reward-shaping framework inspired by bar…
▽ More
Reinforcement Learning (RL) has progressed from simple control tasks to complex real-world challenges with large state spaces. While RL excels in these tasks, training time remains a limitation. Reward shaping is a popular solution, but existing methods often rely on value functions, which face scalability issues. This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks. To evaluate the effectiveness of the proposed reward formulations, we conduct simulation experiments on CartPole, Ant, and Humanoid environments, along with real-world deployment on the Unitree Go1 quadruped robot. Our results demonstrate that our method leads to 1.4-2.8 times faster convergence and as low as 50-60% actuation effort compared to the vanilla reward. In a sim-to-real experiment with the Go1 robot, we demonstrated better control and dynamics of the bot with our reward framework.
△ Less
Submitted 1 April, 2024; v1 submitted 3 March, 2024;
originally announced March 2024.
-
SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code
Authors:
Ziniu Hu,
Ahmet Iscen,
Aashi Jain,
Thomas Kipf,
Yisong Yue,
David A. Ross,
Cordelia Schmid,
Alireza Fathi
Abstract:
This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models…
▽ More
This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
T-HITL Effectively Addresses Problematic Associations in Image Generation and Maintains Overall Visual Quality
Authors:
Susan Epstein,
Li Chen,
Alessandro Vecchiato,
Ankit Jain
Abstract:
Generative AI image models may inadvertently generate problematic representations of people. Past research has noted that millions of users engage daily across the world with these models and that the models, including through problematic representations of people, have the potential to compound and accelerate real-world discrimination and other harms (Bianchi et al, 2023). In this paper, we focus…
▽ More
Generative AI image models may inadvertently generate problematic representations of people. Past research has noted that millions of users engage daily across the world with these models and that the models, including through problematic representations of people, have the potential to compound and accelerate real-world discrimination and other harms (Bianchi et al, 2023). In this paper, we focus on addressing the generation of problematic associations between demographic groups and semantic concepts that may reflect and reinforce negative narratives embedded in social data. Building on sociological literature (Blumer, 1958) and mapping representations to model behaviors, we have developed a taxonomy to study problematic associations in image generation models. We explore the effectiveness of fine tuning at the model level as a method to address these associations, identifying a potential reduction in visual quality as a limitation of traditional fine tuning. We also propose a new methodology with twice-human-in-the-loop (T-HITL) that promises improvements in both reducing problematic associations and also maintaining visual quality. We demonstrate the effectiveness of T-HITL by providing evidence of three problematic associations addressed by T-HITL at the model level. Our contributions to scholarship are two-fold. By defining problematic associations in the context of machine learning models and generative AI, we introduce a conceptual and technical taxonomy for addressing some of these associations. Finally, we provide a method, T-HITL, that addresses these associations and simultaneously maintains visual quality of image model generations. This mitigation need not be a tradeoff, but rather an enhancement.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Comparing skill of historical rainfall data based monsoon rainfall prediction in India with NCEP-NWP forecasts
Authors:
Apoorva Narula,
Aastha Jain,
Jatin Batra,
Sandeep Juneja
Abstract:
In this draft we consider the problem of forecasting rainfall across India during the four monsoon months, one day as well as three days in advance. We train neural networks using historical daily gridded precipitation data for India obtained from IMD for the time period $1901- 2022$, at a spatial resolution of $1^{\circ} \times 1^{\circ}$. This is compared with the numerical weather prediction (N…
▽ More
In this draft we consider the problem of forecasting rainfall across India during the four monsoon months, one day as well as three days in advance. We train neural networks using historical daily gridded precipitation data for India obtained from IMD for the time period $1901- 2022$, at a spatial resolution of $1^{\circ} \times 1^{\circ}$. This is compared with the numerical weather prediction (NWP) forecasts obtained from NCEP (National Centre for Environmental Prediction) available for the period 2011-2022. We conduct a detailed country wide analysis and separately analyze some of the most populated cities in India. Our conclusion is that forecasts obtained by applying deep learning to historical rainfall data are more accurate compared to NWP forecasts as well as predictions based on persistence. On average, compared to our predictions, forecasts from NCEP-NWP model have about 34% higher error for a single day prediction, and over 68% higher error for a three day prediction. Similarly, persistence estimates report a 29% higher error in a single day forecast, and over 54% error in a three day forecast. We further observe that data up to 20 days in the past is useful in reducing errors of one and three day forecasts, when a transformer based learning architecture, and to a lesser extent when an LSTM is used. A key conclusion suggested by our preliminary analysis is that NWP forecasts can be substantially improved upon through more and diverse data relevant to monsoon prediction combined with carefully selected neural network architecture.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following
Authors:
Brian Yang,
Huangyuan Su,
Nikolaos Gkanatsios,
Tsung-Wei Ke,
Ayush Jain,
Jeff Schneider,
Katerina Fragkiadaki
Abstract:
Diffusion models excel at modeling complex and multimodal trajectory distributions for decision-making and control. Reward-gradient guided denoising has been recently proposed to generate trajectories that maximize both a differentiable reward function and the likelihood under the data distribution captured by a diffusion model. Reward-gradient guided denoising requires a differentiable reward fun…
▽ More
Diffusion models excel at modeling complex and multimodal trajectory distributions for decision-making and control. Reward-gradient guided denoising has been recently proposed to generate trajectories that maximize both a differentiable reward function and the likelihood under the data distribution captured by a diffusion model. Reward-gradient guided denoising requires a differentiable reward function fitted to both clean and noised samples, limiting its applicability as a general trajectory optimizer. In this paper, we propose DiffusionES, a method that combines gradient-free optimization with trajectory denoising to optimize black-box non-differentiable objectives while staying in the data manifold. Diffusion-ES samples trajectories during evolutionary search from a diffusion model and scores them using a black-box reward function. It mutates high-scoring trajectories using a truncated diffusion process that applies a small number of noising and denoising steps, allowing for much more efficient exploration of the solution space. We show that DiffusionES achieves state-of-the-art performance on nuPlan, an established closed-loop planning benchmark for autonomous driving. Diffusion-ES outperforms existing sampling-based planners, reactive deterministic or diffusion-based policies, and reward-gradient guidance. Additionally, we show that unlike prior guidance methods, our method can optimize non-differentiable language-shaped reward functions generated by few-shot LLM prompting. When guided by a human teacher that issues instructions to follow, our method can generate novel, highly complex behaviors, such as aggressive lane weaving, which are not present in the training data. This allows us to solve the hardest nuPlan scenarios which are beyond the capabilities of existing trajectory optimization methods and driving policies.
△ Less
Submitted 16 July, 2024; v1 submitted 9 February, 2024;
originally announced February 2024.
-
AttnLRP: Attention-Aware Layer-Wise Relevance Propagation for Transformers
Authors:
Reduan Achtibat,
Sayed Mohammad Vakilzadeh Hatefi,
Maximilian Dreyer,
Aakriti Jain,
Thomas Wiegand,
Sebastian Lapuschkin,
Wojciech Samek
Abstract:
Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to ha…
▽ More
Large Language Models are prone to biased predictions and hallucinations, underlining the paramount importance of understanding their model-internal reasoning process. However, achieving faithful attributions for the entirety of a black-box transformer model and maintaining computational efficiency is an unsolved challenge. By extending the Layer-wise Relevance Propagation attribution method to handle attention layers, we address these challenges effectively. While partial solutions exist, our method is the first to faithfully and holistically attribute not only input but also latent representations of transformer models with the computational efficiency similar to a single backward pass. Through extensive evaluations against existing methods on LLaMa 2, Mixtral 8x7b, Flan-T5 and vision transformer architectures, we demonstrate that our proposed approach surpasses alternative methods in terms of faithfulness and enables the understanding of latent representations, opening up the door for concept-based explanations. We provide an LRP library at https://github.com/rachtibat/LRP-eXplains-Transformers.
△ Less
Submitted 10 June, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
On the Effect of Image Resolution on Semantic Segmentation
Authors:
Ritambhara Singh,
Abhishek Jain,
Pietro Perona,
Shivani Agarwal,
Junfeng Yang
Abstract:
High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model ca…
▽ More
High-resolution semantic segmentation requires substantial computational resources. Traditional approaches in the field typically downscale the input images before processing and then upscale the low-resolution outputs back to their original dimensions. While this strategy effectively identifies broad regions, it often misses finer details. In this study, we demonstrate that a streamlined model capable of directly producing high-resolution segmentations can match the performance of more complex systems that generate lower-resolution results. By simplifying the network architecture, we enable the processing of images at their native resolution. Our approach leverages a bottom-up information propagation technique across various scales, which we have empirically shown to enhance segmentation accuracy. We have rigorously tested our method using leading-edge semantic segmentation datasets. Specifically, for the Cityscapes dataset, we further boost accuracy by applying the Noisy Student Training technique.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Scaling laws for learning with real and surrogate data
Authors:
Ayush Jain,
Andrea Montanari,
Eren Sasoglu
Abstract:
Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce…
▽ More
Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
△ Less
Submitted 28 June, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Lossy Cryptography from Code-Based Assumptions
Authors:
Quang Dao,
Aayush Jain
Abstract:
Over the past few decades, we have seen a proliferation of advanced cryptographic primitives with lossy or homomorphic properties built from various assumptions such as Quadratic Residuosity, Decisional Diffie-Hellman, and Learning with Errors. These primitives imply hard problems in the complexity class $SZK$ (statistical zero-knowledge); as a consequence, they can only be based on assumptions th…
▽ More
Over the past few decades, we have seen a proliferation of advanced cryptographic primitives with lossy or homomorphic properties built from various assumptions such as Quadratic Residuosity, Decisional Diffie-Hellman, and Learning with Errors. These primitives imply hard problems in the complexity class $SZK$ (statistical zero-knowledge); as a consequence, they can only be based on assumptions that are broken in $BPP^{SZK}$. This poses a barrier for building advanced primitives from code-based assumptions, as the only known such assumption is Learning Parity with Noise (LPN) with an extremely low noise rate $\frac{\log^2 n}{n}$, which is broken in quasi-polynomial time.
In this work, we propose a new code-based assumption: Dense-Sparse LPN, that falls in the complexity class $BPP^{SZK}$ and is conjectured to be secure against subexponential time adversaries. Our assumption is a variant of LPN that is inspired by McEliece's cryptosystem and random $k\mbox{-}$XOR in average-case complexity.
We leverage our assumption to build lossy trapdoor functions (Peikert-Waters STOC 08). This gives the first post-quantum alternative to the lattice-based construction in the original paper. Lossy trapdoor functions, being a fundamental cryptographic tool, are known to enable a broad spectrum of both lossy and non-lossy cryptographic primitives; our construction thus implies these primitives in a generic manner. In particular, we achieve collision-resistant hash functions with plausible subexponential security, improving over a prior construction from LPN with noise rate $\frac{\log^2 n}{n}$ that is only quasi-polynomially secure.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
LatticeGraphNet: A two-scale graph neural operator for simulating lattice structures
Authors:
Ayush Jain,
Ehsan Haghighat,
Sai Nelaturi
Abstract:
This study introduces a two-scale Graph Neural Operator (GNO), namely, LatticeGraphNet (LGN), designed as a surrogate model for costly nonlinear finite-element simulations of three-dimensional latticed parts and structures. LGN has two networks: LGN-i, learning the reduced dynamics of lattices, and LGN-ii, learning the mapping from the reduced representation onto the tetrahedral mesh. LGN can pred…
▽ More
This study introduces a two-scale Graph Neural Operator (GNO), namely, LatticeGraphNet (LGN), designed as a surrogate model for costly nonlinear finite-element simulations of three-dimensional latticed parts and structures. LGN has two networks: LGN-i, learning the reduced dynamics of lattices, and LGN-ii, learning the mapping from the reduced representation onto the tetrahedral mesh. LGN can predict deformation for arbitrary lattices, therefore the name operator. Our approach significantly reduces inference time while maintaining high accuracy for unseen simulations, establishing the use of GNOs as efficient surrogate models for evaluating mechanical responses of lattices and structures.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
PBSCSR: The Piano Bootleg Score Composer Style Recognition Dataset
Authors:
Arhan Jain,
Alec Bunn,
Austin Pham,
TJ Tsai
Abstract:
This article motivates, describes, and presents the PBSCSR dataset for studying composer style recognition of piano sheet music. Our overarching goal was to create a dataset for studying composer style recognition that is "as accessible as MNIST and as challenging as ImageNet". To achieve this goal, we use a previously proposed feature representation of sheet music called a bootleg score, which en…
▽ More
This article motivates, describes, and presents the PBSCSR dataset for studying composer style recognition of piano sheet music. Our overarching goal was to create a dataset for studying composer style recognition that is "as accessible as MNIST and as challenging as ImageNet". To achieve this goal, we use a previously proposed feature representation of sheet music called a bootleg score, which encodes the position of noteheads relative to the staff lines. Using this representation, we sample fixed-length bootleg score fragments from piano sheet music images on IMSLP. The dataset itself contains 40,000 62x64 bootleg score images for a 9-way classification task, 100,000 62x64 bootleg score images for a 100-way classification task, and 29,310 unlabeled variable-length bootleg score images for pretraining. The labeled data is presented in a form that mirrors MNIST images, in order to make it extremely easy to visualize, manipulate, and train models in an efficient manner. Additionally, we include relevant metadata to allow access to the underlying raw sheet music images and other related data on IMSLP. We describe several research tasks that could be studied with the dataset, including variations of composer style recognition in a few-shot or zero-shot setting. For tasks that have previously proposed models, we release code and baseline results for future works to compare against. We also discuss open research questions that the PBSCSR data is especially well suited to facilitate research on and areas of fruitful exploration in future work.
△ Less
Submitted 7 February, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets
Authors:
Kumar Abhishek,
Aditi Jain,
Ghassan Hamarneh
Abstract:
The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as t…
▽ More
The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of two popular dermatological image datasets: DermaMNIST and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Explaining Drift using Shapley Values
Authors:
Narayanan U. Edakunni,
Utkarsh Tekriwal,
Anukriti Jain
Abstract:
Machine learning models often deteriorate in their performance when they are used to predict the outcomes over data on which they were not trained. These scenarios can often arise in real world when the distribution of data changes gradually or abruptly due to major events like a pandemic. There have been many attempts in machine learning research to come up with techniques that are resilient to s…
▽ More
Machine learning models often deteriorate in their performance when they are used to predict the outcomes over data on which they were not trained. These scenarios can often arise in real world when the distribution of data changes gradually or abruptly due to major events like a pandemic. There have been many attempts in machine learning research to come up with techniques that are resilient to such Concept drifts. However, there is no principled framework to identify the drivers behind the drift in model performance. In this paper, we propose a novel framework - DBShap that uses Shapley values to identify the main contributors of the drift and quantify their respective contributions. The proposed framework not only quantifies the importance of individual features in driving the drift but also includes the change in the underlying relation between the input and output as a possible driver. The explanation provided by DBShap can be used to understand the root cause behind the drift and use it to make the model resilient to the drift.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Mobile Contactless Palmprint Recognition: Use of Multiscale, Multimodel Embeddings
Authors:
Steven A. Grosz,
Akash Godbole,
Anil K. Jain
Abstract:
Contactless palmprints are comprised of both global and local discriminative features. Most prior work focuses on extracting global features or local features alone for palmprint matching, whereas this research introduces a novel framework that combines global and local features for enhanced palmprint matching accuracy. Leveraging recent advancements in deep learning, this study integrates a visio…
▽ More
Contactless palmprints are comprised of both global and local discriminative features. Most prior work focuses on extracting global features or local features alone for palmprint matching, whereas this research introduces a novel framework that combines global and local features for enhanced palmprint matching accuracy. Leveraging recent advancements in deep learning, this study integrates a vision transformer (ViT) and a convolutional neural network (CNN) to extract complementary local and global features. Next, a mobile-based, end-to-end palmprint recognition system is developed, referred to as Palm-ID. On top of the ViT and CNN features, Palm-ID incorporates a palmprint enhancement module and efficient dimensionality reduction (for faster matching). Palm-ID balances the trade-off between accuracy and latency, requiring just 18ms to extract a template of size 516 bytes, which can be efficiently searched against a 10,000 palmprint gallery in 0.33ms on an AMD EPYC 7543 32-Core CPU utilizing 128-threads. Cross-database matching protocols and evaluations on large-scale operational datasets demonstrate the robustness of the proposed method, achieving a TAR of 98.06% at FAR=0.01% on a newly collected, time-separated dataset. To show a practical deployment of the end-to-end system, the entire recognition pipeline is embedded within a mobile device for enhanced user privacy and security.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.