-
Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows
Authors:
Malik Hassanaly,
Bruce A. Perry,
Michael E. Mueller,
Shashank Yellapantula
Abstract:
Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that is routinely generated. In applications that are constrained by memory and computational intensity, excessively large datasets may hinder scientific discovery, making data reduction a critical component of data-driven methods. Datasets are growing in two directions: the number of d…
▽ More
Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that is routinely generated. In applications that are constrained by memory and computational intensity, excessively large datasets may hinder scientific discovery, making data reduction a critical component of data-driven methods. Datasets are growing in two directions: the number of data points and their dimensionality. Whereas dimension reduction typically aims at describing each data sample on lower-dimensional space, the focus here is on reducing the number of data points. A strategy is proposed to select data points such that they uniformly span the phase-space of the data. The algorithm proposed relies on estimating the probability map of the data and using it to construct an acceptance probability. An iterative method is used to accurately estimate the probability of the rare data points when only a small subset of the dataset is used to construct the probability map. Instead of binning the phase-space to estimate the probability map, its functional form is approximated with a normalizing flow. Therefore, the method naturally extends to high-dimensional datasets. The proposed framework is demonstrated as a viable pathway to enable data-efficient machine learning when abundant data is available. An implementation of the method is available in a companion repository (https://github.com/NREL/Phase-space-sampling).
△ Less
Submitted 27 February, 2023; v1 submitted 28 December, 2021;
originally announced December 2021.
-
Path Planning for Shepherding a Swarm in a Cluttered Environment using Differential Evolution
Authors:
Saber Elsayed,
Hemant Singh,
Essam Debie,
Anthony Perry,
Benjamin Campbell,
Robert Hunjet,
Hussein Abbass
Abstract:
Shepherding involves herding a swarm of agents (\emph{sheep}) by another a control agent (\emph{sheepdog}) towards a goal. Multiple approaches have been documented in the literature to model this behaviour. In this paper, we present a modification to a well-known shepherding approach, and show, via simulation, that this modification improves shepherding efficacy. We then argue that given complexit…
▽ More
Shepherding involves herding a swarm of agents (\emph{sheep}) by another a control agent (\emph{sheepdog}) towards a goal. Multiple approaches have been documented in the literature to model this behaviour. In this paper, we present a modification to a well-known shepherding approach, and show, via simulation, that this modification improves shepherding efficacy. We then argue that given complexity arising from obstacles laden environments, path planning approaches could further enhance this model. To validate this hypothesis, we present a 2-stage evolutionary-based path planning algorithm for shepherding a swarm of agents in 2D environments. In the first stage, the algorithm attempts to find the best path for the sheepdog to move from its initial location to a strategic driving location behind the sheep. In the second stage, it calculates and optimises a path for the sheep. It does so by using \emph{way points} on that path as the sequential sub-goals for the sheepdog to aim towards. The proposed algorithm is evaluated in obstacle laden environments via simulation with further improvements achieved.
△ Less
Submitted 28 August, 2020;
originally announced August 2020.
-
The Medical Scribe: Corpus Development and Model Performance Analyses
Authors:
Izhak Shafran,
Nan Du,
Linh Tran,
Amanda Perry,
Lauren Keyes,
Mark Knichel,
Ashley Domin,
Lei Huang,
Yuhui Chen,
Gang Li,
Mingqiu Wang,
Laurent El Shafey,
Hagen Soltau,
Justin S. Paul
Abstract:
There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-…
▽ More
There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-art tagging model. We report ontologies, labeling results, model performances, and detailed analyses of the results. Our results show that the entities related to medications can be extracted with a relatively high accuracy of 0.90 F-score, followed by symptoms at 0.72 F-score, and conditions at 0.57 F-score. In our task, we not only identify where the symptoms are mentioned but also map them to canonical forms as they appear in the clinical notes. Of the different types of errors, in about 19-38% of the cases, we find that the model output was correct, and about 17-32% of the errors do not impact the clinical note. Taken together, the models developed in this work are more useful than the F-scores reflect, making it a promising approach for practical applications.
△ Less
Submitted 11 March, 2020;
originally announced March 2020.
-
Overcomplete Independent Component Analysis via SDP
Authors:
Anastasia Podosinnikova,
Amelia Perry,
Alexander Wein,
Francis Bach,
Alexandre d'Aspremont,
David Sontag
Abstract:
We present a novel algorithm for overcomplete independent components analysis (ICA), where the number of latent sources k exceeds the dimension p of observed variables. Previous algorithms either suffer from high computational complexity or make strong assumptions about the form of the mixing matrix. Our algorithm does not make any sparsity assumption yet enjoys favorable computational and theoret…
▽ More
We present a novel algorithm for overcomplete independent components analysis (ICA), where the number of latent sources k exceeds the dimension p of observed variables. Previous algorithms either suffer from high computational complexity or make strong assumptions about the form of the mixing matrix. Our algorithm does not make any sparsity assumption yet enjoys favorable computational and theoretical properties. Our algorithm consists of two main steps: (a) estimation of the Hessians of the cumulant generating function (as opposed to the fourth and higher order cumulants used by most algorithms) and (b) a novel semi-definite programming (SDP) relaxation for recovering a mixing component. We show that this relaxation can be efficiently solved with a projected accelerated gradient descent method, which makes the whole algorithm computationally practical. Moreover, we conjecture that the proposed program recovers a mixing component at the rate k < p^2/4 and prove that a mixing component can be recovered with high probability when k < (2 - epsilon) p log p when the original components are sampled uniformly at random on the hyper sphere. Experiments are provided on synthetic data and the CIFAR-10 dataset of real images.
△ Less
Submitted 24 January, 2019;
originally announced January 2019.
-
Optimality and Sub-optimality of PCA I: Spiked Random Matrix Models
Authors:
Amelia Perry,
Alexander S. Wein,
Afonso S. Bandeira,
Ankur Moitra
Abstract:
A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, introduced by Johnstone, in which a prominent eigenvector (or "spike") is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Peche showed that the spiked Wishart ensembl…
▽ More
A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, introduced by Johnstone, in which a prominent eigenvector (or "spike") is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Peche showed that the spiked Wishart ensemble exhibits a sharp phase transition asymptotically: when the spike strength is above a critical threshold, it is possible to detect the presence of a spike based on the top eigenvalue, and below the threshold the top eigenvalue provides no information. Such results form the basis of our understanding of when PCA can detect a low-rank signal in the presence of noise. However, under structural assumptions on the spike, not all information is necessarily contained in the spectrum. We study the statistical limits of tests for the presence of a spike, including non-spectral tests. Our results leverage Le Cam's notion of contiguity, and include:
i) For the Gaussian Wigner ensemble, we show that PCA achieves the optimal detection threshold for certain natural priors for the spike.
ii) For any non-Gaussian Wigner ensemble, PCA is sub-optimal for detection. However, an efficient variant of PCA achieves the optimal threshold (for natural priors) by pre-transforming the matrix entries.
iii) For the Gaussian Wishart ensemble, the PCA threshold is optimal for positive spikes (for natural priors) but this is not always the case for negative spikes.
△ Less
Submitted 12 July, 2018; v1 submitted 2 July, 2018;
originally announced July 2018.
-
Notes on computational-to-statistical gaps: predictions using statistical physics
Authors:
Afonso S. Bandeira,
Amelia Perry,
Alexander S. Wein
Abstract:
In these notes we describe heuristics to predict computational-to-statistical gaps in certain statistical problems. These are regimes in which the underlying statistical problem is information-theoretically possible although no efficient algorithm exists, rendering the problem essentially unsolvable for large instances. The methods we describe here are based on mature, albeit non-rigorous, tools f…
▽ More
In these notes we describe heuristics to predict computational-to-statistical gaps in certain statistical problems. These are regimes in which the underlying statistical problem is information-theoretically possible although no efficient algorithm exists, rendering the problem essentially unsolvable for large instances. The methods we describe here are based on mature, albeit non-rigorous, tools from statistical physics.
These notes are based on a lecture series given by the authors at the Courant Institute of Mathematical Sciences in New York City, on May 16th, 2017.
△ Less
Submitted 20 April, 2018; v1 submitted 29 March, 2018;
originally announced March 2018.
-
Estimation under group actions: recovering orbits from invariants
Authors:
Afonso S. Bandeira,
Ben Blum-Smith,
Joe Kileel,
Amelia Perry,
Jonathan Niles-Weed,
Alexander S. Wein
Abstract:
We study a class of orbit recovery problems in which we observe independent copies of an unknown element of $\mathbb{R}^p$, each linearly acted upon by a random element of some group (such as $\mathbb{Z}/p$ or $\mathrm{SO}(3)$) and then corrupted by additive Gaussian noise. We prove matching upper and lower bounds on the number of samples required to approximately recover the group orbit of this u…
▽ More
We study a class of orbit recovery problems in which we observe independent copies of an unknown element of $\mathbb{R}^p$, each linearly acted upon by a random element of some group (such as $\mathbb{Z}/p$ or $\mathrm{SO}(3)$) and then corrupted by additive Gaussian noise. We prove matching upper and lower bounds on the number of samples required to approximately recover the group orbit of this unknown element with high probability. These bounds, based on quantitative techniques in invariant theory, give a precise correspondence between the statistical difficulty of the estimation problem and algebraic properties of the group. Furthermore, we give computer-assisted procedures to certify these properties that are computationally efficient in many cases of interest.
The model is motivated by geometric problems in signal processing, computer vision, and structural biology, and applies to the reconstruction problem in cryo-electron microscopy (cryo-EM), a problem of significant practical interest. Our results allow us to verify (for a given problem size) that if cryo-EM images are corrupted by noise with variance $σ^2$, the number of images required to recover the molecule structure scales as $σ^6$. We match this bound with a novel (albeit computationally expensive) algorithm for ab initio reconstruction in cryo-EM, based on invariant features of degree at most 3. We further discuss how to recover multiple molecular structures from mixed (or heterogeneous) cryo-EM samples.
△ Less
Submitted 13 June, 2023; v1 submitted 29 December, 2017;
originally announced December 2017.
-
The sample complexity of multi-reference alignment
Authors:
Amelia Perry,
Jonathan Weed,
Afonso S. Bandeira,
Philippe Rigollet,
Amit Singer
Abstract:
The growing role of data-driven approaches to scientific discovery has unveiled a large class of models that involve latent transformations with a rigid algebraic constraint. Three-dimensional molecule reconstruction in Cryo-Electron Microscopy (cryo-EM) is a central problem in this class. Despite decades of algorithmic and software development, there is still little theoretical understanding of t…
▽ More
The growing role of data-driven approaches to scientific discovery has unveiled a large class of models that involve latent transformations with a rigid algebraic constraint. Three-dimensional molecule reconstruction in Cryo-Electron Microscopy (cryo-EM) is a central problem in this class. Despite decades of algorithmic and software development, there is still little theoretical understanding of the sample complexity of this problem, that is, number of images required for 3-D reconstruction. Here we consider multi-reference alignment (MRA), a simple model that captures fundamental aspects of the statistical and algorithmic challenges arising in cryo-EM and related problems. In MRA, an unknown signal is subject to two types of corruption: a latent cyclic shift and the more traditional additive white noise. The goal is to recover the signal at a certain precision from independent samples. While at high signal-to-noise ratio (SNR), the number of observations needed to recover a generic signal is proportional to $1/\mathrm{SNR}$, we prove that it rises to a surprising $1/\mathrm{SNR}^3$ in the low SNR regime. This precise phenomenon was observed empirically more than twenty years ago for cryo-EM but has remained unexplained to date. Furthermore, our techniques can easily be extended to the heterogeneous MRA model where the samples come from a mixture of signals, as is often the case in applications such as cryo-EM, where molecules may have different conformations. This provides a first step towards a statistical theory for heterogeneous cryo-EM.
△ Less
Submitted 3 June, 2019; v1 submitted 4 July, 2017;
originally announced July 2017.
-
Statistical limits of spiked tensor models
Authors:
Amelia Perry,
Alexander S. Wein,
Afonso S. Bandeira
Abstract:
We study the statistical limits of both detecting and estimating a rank-one deformation of a symmetric random Gaussian tensor. We establish upper and lower bounds on the critical signal-to-noise ratio, under a variety of priors for the planted vector: (i) a uniformly sampled unit vector, (ii) i.i.d. $\pm 1$ entries, and (iii) a sparse vector where a constant fraction $ρ$ of entries are i.i.d.…
▽ More
We study the statistical limits of both detecting and estimating a rank-one deformation of a symmetric random Gaussian tensor. We establish upper and lower bounds on the critical signal-to-noise ratio, under a variety of priors for the planted vector: (i) a uniformly sampled unit vector, (ii) i.i.d. $\pm 1$ entries, and (iii) a sparse vector where a constant fraction $ρ$ of entries are i.i.d. $\pm 1$ and the rest are zero. For each of these cases, our upper and lower bounds match up to a $1+o(1)$ factor as the order $d$ of the tensor becomes large. For sparse signals (iii), our bounds are also asymptotically tight in the sparse limit $ρ\to 0$ for any fixed $d$ (including the $d=2$ case of sparse PCA). Our upper bounds for (i) demonstrate a phenomenon reminiscent of the work of Baik, Ben Arous and Péché: an `eigenvalue' of a perturbed tensor emerges from the bulk at a strictly lower signal-to-noise ratio than when the perturbation itself exceeds the bulk; we quantify the size of this effect. We also provide some general results for larger classes of priors. In particular, the large $d$ asymptotics of the threshold location differs between problems with discrete priors versus continuous priors. Finally, for priors (i) and (ii) we carry out the replica prediction from statistical physics, which is conjectured to give the exact information-theoretic threshold for any fixed $d$.
Of independent interest, we introduce a new improvement to the second moment method for contiguity, on which our lower bounds are based. Our technique conditions away from rare `bad' events that depend on interactions between the signal and noise. This enables us to close $\sqrt{2}$-factor gaps present in several previous works.
△ Less
Submitted 24 January, 2017; v1 submitted 22 December, 2016;
originally announced December 2016.
-
Message-passing algorithms for synchronization problems over compact groups
Authors:
Amelia Perry,
Alexander S. Wein,
Afonso S. Bandeira,
Ankur Moitra
Abstract:
Various alignment problems arising in cryo-electron microscopy, community detection, time synchronization, computer vision, and other fields fall into a common framework of synchronization problems over compact groups such as Z/L, U(1), or SO(3). The goal of such problems is to estimate an unknown vector of group elements given noisy relative observations. We present an efficient iterative algorit…
▽ More
Various alignment problems arising in cryo-electron microscopy, community detection, time synchronization, computer vision, and other fields fall into a common framework of synchronization problems over compact groups such as Z/L, U(1), or SO(3). The goal of such problems is to estimate an unknown vector of group elements given noisy relative observations. We present an efficient iterative algorithm to solve a large class of these problems, allowing for any compact group, with measurements on multiple 'frequency channels' (Fourier modes, or more generally, irreducible representations of the group). Our algorithm is a highly efficient iterative method following the blueprint of approximate message passing (AMP), which has recently arisen as a central technique for inference problems such as structured low-rank estimation and compressed sensing. We augment the standard ideas of AMP with ideas from representation theory so that the algorithm can work with distributions over compact groups. Using standard but non-rigorous methods from statistical physics we analyze the behavior of our algorithm on a Gaussian noise model, identifying phases where the problem is easy, (computationally) hard, and (statistically) impossible. In particular, such evidence predicts that our algorithm is information-theoretically optimal in many cases, and that the remaining cases show evidence of statistical-to-computational gaps.
△ Less
Submitted 14 October, 2016;
originally announced October 2016.
-
Crossing the Road Without Traffic Lights: An Android-based Safety Device
Authors:
Adi Perry,
Dor Verbin,
Nahum Kiryati
Abstract:
In the absence of pedestrian crossing lights, finding a safe moment to cross the road is often hazardous and challenging, especially for people with visual impairments. We present a reliable low-cost solution, an Android device attached to a traffic sign or lighting pole near the crossing, indicating whether it is safe to cross the road. The indication can be by sound, display, vibration, and vari…
▽ More
In the absence of pedestrian crossing lights, finding a safe moment to cross the road is often hazardous and challenging, especially for people with visual impairments. We present a reliable low-cost solution, an Android device attached to a traffic sign or lighting pole near the crossing, indicating whether it is safe to cross the road. The indication can be by sound, display, vibration, and various communication modalities provided by the Android device. The integral system camera is aimed at approaching traffic. Optical flow is computed from the incoming video stream, and projected onto an influx map, automatically acquired during a brief training period. The crossing safety is determined based on a 1-dimensional temporal signal derived from the projection. We implemented the complete system on a Samsung Galaxy K-Zoom Android smartphone, and obtained real-time operation. The system achieves promising experimental results, providing pedestrians with sufficiently early warning of approaching vehicles. The system can serve as a stand-alone safety device, that can be installed where pedestrian crossing lights are ruled out. Requiring no dedicated infrastructure, it can be powered by a solar panel and remotely maintained via the cellular network.
△ Less
Submitted 11 October, 2016;
originally announced October 2016.
-
Optimality and Sub-optimality of PCA for Spiked Random Matrices and Synchronization
Authors:
Amelia Perry,
Alexander S. Wein,
Afonso S. Bandeira,
Ankur Moitra
Abstract:
A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, in which a prominent eigenvector is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Péché showed that the spiked Wishart ensemble exhibits a sharp phase transition as…
▽ More
A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, in which a prominent eigenvector is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences. Baik, Ben Arous and Péché showed that the spiked Wishart ensemble exhibits a sharp phase transition asymptotically: when the signal strength is above a critical threshold, it is possible to detect the presence of a spike based on the top eigenvalue, and below the threshold the top eigenvalue provides no information. Such results form the basis of our understanding of when PCA can detect a low-rank signal in the presence of noise.
However, not all the information about the spike is necessarily contained in the spectrum. We study the fundamental limitations of statistical methods, including non-spectral ones. Our results include:
I) For the Gaussian Wigner ensemble, we show that PCA achieves the optimal detection threshold for a variety of benign priors for the spike. We extend previous work on the spherically symmetric and i.i.d. Rademacher priors through an elementary, unified analysis.
II) For any non-Gaussian Wigner ensemble, we show that PCA is always suboptimal for detection. However, a variant of PCA achieves the optimal threshold (for benign priors) by pre-transforming the matrix entries according to a carefully designed function. This approach has been stated before, and we give a rigorous and general analysis.
III) For both the Gaussian Wishart ensemble and various synchronization problems over groups, we show that inefficient procedures can work below the threshold where PCA succeeds, whereas no known efficient algorithm achieves this. This conjectural gap between what is statistically possible and what can be done efficiently remains open.
△ Less
Submitted 23 December, 2016; v1 submitted 18 September, 2016;
originally announced September 2016.
-
A semidefinite program for unbalanced multisection in the stochastic block model
Authors:
Amelia Perry,
Alexander S. Wein
Abstract:
We propose a semidefinite programming (SDP) algorithm for community detection in the stochastic block model, a popular model for networks with latent community structure. We prove that our algorithm achieves exact recovery of the latent communities, up to the information-theoretic limits determined by Abbe and Sandon (2015). Our result extends prior SDP approaches by allowing for many communities…
▽ More
We propose a semidefinite programming (SDP) algorithm for community detection in the stochastic block model, a popular model for networks with latent community structure. We prove that our algorithm achieves exact recovery of the latent communities, up to the information-theoretic limits determined by Abbe and Sandon (2015). Our result extends prior SDP approaches by allowing for many communities of different sizes. By virtue of a semidefinite approach, our algorithms succeed against a semirandom variant of the stochastic block model, guaranteeing a form of robustness and generalization. We further explore how semirandom models can lend insight into both the strengths and limitations of SDPs in this setting.
△ Less
Submitted 2 December, 2016; v1 submitted 20 July, 2015;
originally announced July 2015.