-
A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks
Authors:
Sinclair Hudson,
Sophia Jit,
Boyue Caroline Hu,
Marsha Chechik
Abstract:
Large Language Models (LLMs) are rapidly becoming ubiquitous both as stand-alone tools and as components of current and future software systems. To enable usage of LLMs in the high-stake or safety-critical systems of 2030, they need to undergo rigorous testing. Software Engineering (SE) research on testing Machine Learning (ML) components and ML-based systems has systematically explored many topic…
▽ More
Large Language Models (LLMs) are rapidly becoming ubiquitous both as stand-alone tools and as components of current and future software systems. To enable usage of LLMs in the high-stake or safety-critical systems of 2030, they need to undergo rigorous testing. Software Engineering (SE) research on testing Machine Learning (ML) components and ML-based systems has systematically explored many topics such as test input generation and robustness. We believe knowledge about tools, benchmarks, research and practitioner views related to LLM testing needs to be similarly organized. To this end, we present a taxonomy of LLM testing topics and conduct preliminary studies of state of the art and practice approaches to research, open-source tools and benchmarks for LLM testing, mapping results onto this taxonomy. Our goal is to identify gaps requiring more research and engineering effort and inspire a clearer communication between LLM practitioners and the SE research community.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Portable, heterogeneous ensemble workflows at scale using libEnsemble
Authors:
Stephen Hudson,
Jeffrey Larson,
John-Luke Navarro,
Stefan M. Wild
Abstract:
libEnsemble is a Python-based toolkit for running dynamic ensembles, developed as part of the DOE Exascale Computing Project. The toolkit utilizes a unique generator--simulator--allocator paradigm, where generators produce input for simulators, simulators evaluate those inputs, and allocators decide whether and when a simulator or generator should be called. The generator steers the ensemble based…
▽ More
libEnsemble is a Python-based toolkit for running dynamic ensembles, developed as part of the DOE Exascale Computing Project. The toolkit utilizes a unique generator--simulator--allocator paradigm, where generators produce input for simulators, simulators evaluate those inputs, and allocators decide whether and when a simulator or generator should be called. The generator steers the ensemble based on simulation results. Generators may, for example, apply methods for numerical optimization, machine learning, or statistical calibration. libEnsemble communicates between a manager and workers. We overview the unique characteristics of libEnsemble as well as current and potential interoperability with other packages in the workflow ecosystem. We highlight libEnsemble's dynamic resource features: libEnsemble can detect system resources, such as available nodes, cores, and GPUs, and assign these in a portable way. These features allow users to specify the number of processors and GPUs required for each simulation; and resources will be automatically assigned on a wide range of systems, including Frontier, Aurora, and Perlmutter. Such ensembles can include multiple simulation types, some using GPUs and others using only CPUs, sharing nodes for maximum efficiency. We also describe the benefits of libEnsemble's generator--simulator coupling, which easily exposes to the user the ability to cancel, and portably kill, running simulations based on models that are updated with intermediate simulation output. We demonstrate libEnsemble's capabilities, scalability, and scientific impact via a Gaussian process surrogate training problem for the longitudinal density profile at the exit of a plasma accelerator stage. The study uses gpCAM for the surrogate model and employs either Wake-T or WarpX simulations, highlighting efficient use of resources that can easily extend to exascale.
△ Less
Submitted 23 July, 2024; v1 submitted 6 March, 2024;
originally announced March 2024.
-
Science Communications for Explainable Artificial Intelligence
Authors:
Simon Hudson,
Matija Franklin
Abstract:
Artificial Intelligence (AI) has a communication problem. XAI methods have been used to make AI more understandable and helped resolve some of the transparency issues that inhibit AI's broader usability. However, user evaluation studies reveal that the often numerical explanations provided by XAI methods have not always been effective for many types of users of AI systems. This article aims to ada…
▽ More
Artificial Intelligence (AI) has a communication problem. XAI methods have been used to make AI more understandable and helped resolve some of the transparency issues that inhibit AI's broader usability. However, user evaluation studies reveal that the often numerical explanations provided by XAI methods have not always been effective for many types of users of AI systems. This article aims to adapt the major communications models from Science Communications into a framework for practitioners to understand, influence, and integrate the context of audiences both for their communications supporting AI literacy in the public and in designing XAI systems that are more adaptive to different users.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
DAS-N2N: Machine learning Distributed Acoustic Sensing (DAS) signal denoising without clean data
Authors:
Sacha Lapins,
Antony Butcher,
J. -Michael Kendall,
Thomas S. Hudson,
Anna L. Stork,
Maximilian J. Werner,
Jemma Gunning,
Alex M. Brisbourne
Abstract:
This article presents a weakly supervised machine learning method, which we call DAS-N2N, for suppressing strong random noise in distributed acoustic sensing (DAS) recordings. DAS-N2N requires no manually produced labels (i.e., pre-determined examples of clean event signals or sections of noise) for training and aims to map random noise processes to a chosen summary statistic, such as the distribu…
▽ More
This article presents a weakly supervised machine learning method, which we call DAS-N2N, for suppressing strong random noise in distributed acoustic sensing (DAS) recordings. DAS-N2N requires no manually produced labels (i.e., pre-determined examples of clean event signals or sections of noise) for training and aims to map random noise processes to a chosen summary statistic, such as the distribution mean, median or mode, whilst retaining the true underlying signal. This is achieved by splicing (joining together) two fibres hosted within a single optical cable, recording two noisy copies of the same underlying signal corrupted by different independent realizations of random observational noise. A deep learning model can then be trained using only these two noisy copies of the data to produce a near fully-denoised copy. Once the model is trained, only noisy data from a single fibre is required. Using a dataset from a DAS array deployed on the surface of the Rutford Ice Stream in Antarctica, we demonstrate that DAS-N2N greatly suppresses incoherent noise and enhances the signal-to-noise ratios (SNR) of natural microseismic icequake events. We further show that this approach is inherently more efficient and effective than standard stop/pass band and white noise (e.g., Wiener) filtering routines, as well as a comparable self-supervised learning method based on masking individual DAS channels. Our preferred model for this task is lightweight, processing 30 seconds of data recorded at a sampling frequency of 1000 Hz over 985 channels (approx. 1 km of fiber) in $<$1 s. Due to the high noise levels in DAS recordings, efficient data-driven denoising methods, such as DAS-N2N, will prove essential to time-critical DAS earthquake detection, particularly in the case of microseismic monitoring.
△ Less
Submitted 24 November, 2023; v1 submitted 17 April, 2023;
originally announced April 2023.
-
Frames for signal processing on Cayley graphs
Authors:
Kathryn Beck,
Mahya Ghandehari,
Skyler Hudson,
Jenna Paltenstein
Abstract:
The spectral decomposition of graph adjacency matrices is an essential ingredient in the design of graph signal processing (GSP) techniques. When the adjacency matrix has multi-dimensional eigenspaces, it is desirable to base GSP constructions on a particular eigenbasis (the `preferred basis'). In this paper, we provide an explicit and detailed representation-theoretic account for the spectral dec…
▽ More
The spectral decomposition of graph adjacency matrices is an essential ingredient in the design of graph signal processing (GSP) techniques. When the adjacency matrix has multi-dimensional eigenspaces, it is desirable to base GSP constructions on a particular eigenbasis (the `preferred basis'). In this paper, we provide an explicit and detailed representation-theoretic account for the spectral decomposition of the adjacency matrix of a (weighted) Cayley graph, which results in a preferred basis. Our method applies to all weighted (not necessarily quasi-Abelian) Cayley graphs, and provides descriptions of eigenvalues and eigenvectors based on the coefficient functions of the representations of the underlying group. Next, we use such bases to build frames that are suitable for developing signal processing on such graphs. These are the Frobenius--Schur frames and Cayley frames, for which we provide a characterization and a practical recipe for their construction.
△ Less
Submitted 9 February, 2024; v1 submitted 5 March, 2023;
originally announced March 2023.
-
Enabling hand gesture customization on wrist-worn devices
Authors:
Xuhai Xu,
Jun Gong,
Carolina Brum,
Lilian Liang,
Bongsoo Suh,
Kumar Gupta,
Yash Agarwal,
Laurence Lindsey,
Runchang Kang,
Behrooz Shahsavari,
Tu Nguyen,
Heriberto Nieto,
Scott E. Hudson,
Charlie Maalouf,
Seyed Mousavi,
Gierad Laput
Abstract:
We present a framework for gesture customization requiring minimal examples from users, all without degrading the performance of existing gesture sets. To achieve this, we first deployed a large-scale study (N=500+) to collect data and train an accelerometer-gyroscope recognition model with a cross-user accuracy of 95.7% and a false-positive rate of 0.6 per hour when tested on everyday non-gesture…
▽ More
We present a framework for gesture customization requiring minimal examples from users, all without degrading the performance of existing gesture sets. To achieve this, we first deployed a large-scale study (N=500+) to collect data and train an accelerometer-gyroscope recognition model with a cross-user accuracy of 95.7% and a false-positive rate of 0.6 per hour when tested on everyday non-gesture data. Next, we design a few-shot learning framework which derives a lightweight model from our pre-trained model, enabling knowledge transfer without performance degradation. We validate our approach through a user study (N=20) examining on-device customization from 12 new gestures, resulting in an average accuracy of 55.3%, 83.1%, and 87.2% on using one, three, or five shots when adding a new gesture, while maintaining the same recognition accuracy and false-positive rate from the pre-existing gesture set. We further evaluate the usability of our real-time implementation with a user experience study (N=20). Our results highlight the effectiveness, learnability, and usability of our customization framework. Our approach paves the way for a future where users are no longer bound to pre-existing gestures, freeing them to creatively introduce new gestures tailored to their preferences and abilities.
△ Less
Submitted 19 April, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Sim-to-Real Domain Adaptation for Lane Detection and Classification in Autonomous Driving
Authors:
Chuqing Hu,
Sinclair Hudson,
Martin Ethier,
Mohammad Al-Sharman,
Derek Rayside,
William Melek
Abstract:
While supervised detection and classification frameworks in autonomous driving require large labelled datasets to converge, Unsupervised Domain Adaptation (UDA) approaches, facilitated by synthetic data generated from photo-real simulated environments, are considered low-cost and less time-consuming solutions. In this paper, we propose UDA schemes using adversarial discriminative and generative me…
▽ More
While supervised detection and classification frameworks in autonomous driving require large labelled datasets to converge, Unsupervised Domain Adaptation (UDA) approaches, facilitated by synthetic data generated from photo-real simulated environments, are considered low-cost and less time-consuming solutions. In this paper, we propose UDA schemes using adversarial discriminative and generative methods for lane detection and classification applications in autonomous driving. We also present Simulanes dataset generator to create a synthetic dataset that is naturalistic utilizing CARLA's vast traffic scenarios and weather conditions. The proposed UDA frameworks take the synthesized dataset with labels as the source domain, whereas the target domain is the unlabelled real-world data. Using adversarial generative and feature discriminators, the learnt models are tuned to predict the lane location and class in the target domain. The proposed techniques are evaluated using both real-world and our synthetic datasets. The results manifest that the proposed methods have shown superiority over other baseline schemes in terms of detection and classification accuracy and consistency. The ablation study reveals that the size of the simulation dataset plays important roles in the classification performance of the proposed methods. Our UDA frameworks are available at https://github.com/anita-hu/sim2real-lane-detection and our dataset generator is released at https://github.com/anita-hu/simulanes
△ Less
Submitted 30 May, 2022; v1 submitted 14 February, 2022;
originally announced February 2022.
-
libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations
Authors:
Stephen Hudson,
Jeffrey Larson,
John-Luke Navarro,
Stefan M. Wild
Abstract:
Almost all applications stop scaling at some point; those that don't are seldom performant when considering time to solution on anything but aspirational/unicorn resources. Recognizing these tradeoffs as well as greater user functionality in a near-term exascale computing era, we present libEnsemble, a library aimed at particular scalability- and capability-stretching uses. libEnsemble enables run…
▽ More
Almost all applications stop scaling at some point; those that don't are seldom performant when considering time to solution on anything but aspirational/unicorn resources. Recognizing these tradeoffs as well as greater user functionality in a near-term exascale computing era, we present libEnsemble, a library aimed at particular scalability- and capability-stretching uses. libEnsemble enables running concurrent instances of an application in dynamically allocated ensembles through an extensible Python library. We highlight the structure, execution, and capabilities of the library on leading pre-exascale environments as well as advanced capabilities for exascale environments and beyond.
△ Less
Submitted 16 April, 2021;
originally announced April 2021.
-
Rapid Convergence: The Outcomes of Making PPE during a Healthcare Crisis
Authors:
Kelly Mack,
Megan Hofmann,
Udaya Lakshmi,
Jerry Cao,
Nayha Auradkar,
Rosa I. Arriaga,
Scott E. Hudson,
Jennifer Mankoff
Abstract:
The NIH 3D Print Exchange is a public and open source repository for primarily 3D printable medical device designs with contributions from expert-amateur makers, engineers from industry and academia, and clinicians. In response to the COVID-19 pandemic, a collection was formed to foster submissions of low-cost, local manufacture of personal protective equipment (Personal Protective Equipment (PPE)…
▽ More
The NIH 3D Print Exchange is a public and open source repository for primarily 3D printable medical device designs with contributions from expert-amateur makers, engineers from industry and academia, and clinicians. In response to the COVID-19 pandemic, a collection was formed to foster submissions of low-cost, local manufacture of personal protective equipment (Personal Protective Equipment (PPE)). We systematically evaluated the 623 submissions in this collection to understand: what makers contributed, how they were made, who made them, and key characteristics of their designs. Our analysis reveals an immediate design convergence to derivatives of a few initial designs affiliated with NIH partners (e.g., universities, the Veteran's Health Administration, America Makes) and major for-profit groups (e.g., Prusa). The NIH worked to review safe and effective designs but was quickly overloaded by derivative works. We found that the vast majority were never reviewed (81.3%) while 10.4% of those reviewed were deemed safe for clinical (5.6%) or community use (4.8%). Our work contributes insights into: the outcomes of distributed, community-based, medical making; features the community accepted as "safe" making; and how platforms can support regulated maker activities in high-risk domains (e.g., healthcare).
△ Less
Submitted 19 January, 2021;
originally announced January 2021.