Search | arXiv e-print repository

Cannabis Impairment Monitoring Using Objective Eye Tracking Analytics

Authors: Jon Allen, Leah Brickson, Jan van Merkensteijn, Daniel Beeler, Jamshid Ghajar

Abstract: The continuing growth in cannabis legalization necessitates the development of rapid, objective methods for assessing impairment to ensure public and occupational safety. Traditional measurement techniques are subjective, time-consuming, and do not directly measure physical impairment. This study introduces objective metrics derived from eye-tracking analytics to address these limitations. We em… ▽ More The continuing growth in cannabis legalization necessitates the development of rapid, objective methods for assessing impairment to ensure public and occupational safety. Traditional measurement techniques are subjective, time-consuming, and do not directly measure physical impairment. This study introduces objective metrics derived from eye-tracking analytics to address these limitations. We employed a head-mounted display to present 20 subjects with smooth pursuit performance, horizontal saccade, and simple reaction time tasks. Individual and group performance was compared before and after cannabis use. Results demonstrated significant changes in oculomotor control post-cannabis consumption, with smooth pursuit performance showing the most substantial signal. The objective eye-tracking data was used to develop supervised learning models, achieving a classification accuracy of 89% for distinguishing between sober and impaired states when normalized against baseline measures. Eye-tracking is the optimal candidate for a portable, rapid, and objective tool for assessing cannabis impairment, offering significant improvements over current subjective and indirect methods. △ Less

Submitted 24 June, 2024; originally announced July 2024.

arXiv:2407.05333 [pdf, other]

Generating multi-scale NMC particles with radial grain architectures using spatial stochastics and GANs

Authors: Lukas Fuchs, Orkun Furat, Donal P. Finegan, Jeffery Allen, Francois L. E. Usseglio-Viretta, Bertan Ozdogru, Peter J. Weddle, Kandler Smith, Volker Schmidt

Abstract: Understanding structure-property relationships of Li-ion battery cathodes is crucial for optimizing rate-performance and cycle-life resilience. However, correlating the morphology of cathode particles, such as in NMC811, and their inner grain architecture with electrode performance is challenging, particularly, due to the significant length-scale difference between grain and particle sizes. Experi… ▽ More Understanding structure-property relationships of Li-ion battery cathodes is crucial for optimizing rate-performance and cycle-life resilience. However, correlating the morphology of cathode particles, such as in NMC811, and their inner grain architecture with electrode performance is challenging, particularly, due to the significant length-scale difference between grain and particle sizes. Experimentally, it is currently not feasible to image such a high number of particles with full granular detail to achieve representivity. A second challenge is that sufficiently high-resolution 3D imaging techniques remain expensive and are sparsely available at research institutions. To address these challenges, a stereological generative adversarial network (GAN)-based model fitting approach is presented that can generate representative 3D information from 2D data, enabling characterization of materials in 3D using cost-effective 2D data. Once calibrated, this multi-scale model is able to rapidly generate virtual cathode particles that are statistically similar to experimental data, and thus is suitable for virtual characterization and materials testing through numerical simulations. A large dataset of simulated particles with inner grain architecture has been made publicly available. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2406.09116 [pdf, other]

Injective Flows for parametric hypersurfaces

Authors: Marcello Massimo Negri, Jonathan Aellen, Volker Roth

Abstract: Normalizing Flows (NFs) are powerful and efficient models for density estimation. When modeling densities on manifolds, NFs can be generalized to injective flows but the Jacobian determinant becomes computationally prohibitive. Current approaches either consider bounds on the log-likelihood or rely on some approximations of the Jacobian determinant. In contrast, we propose injective flows for para… ▽ More Normalizing Flows (NFs) are powerful and efficient models for density estimation. When modeling densities on manifolds, NFs can be generalized to injective flows but the Jacobian determinant becomes computationally prohibitive. Current approaches either consider bounds on the log-likelihood or rely on some approximations of the Jacobian determinant. In contrast, we propose injective flows for parametric hypersurfaces and show that for such manifolds we can compute the Jacobian determinant exactly and efficiently, with the same cost as NFs. Furthermore, we show that for the subclass of star-like manifolds we can extend the proposed framework to always allow for a Cartesian representation of the density. We showcase the relevance of modeling densities on hypersurfaces in two settings. Firstly, we introduce a novel Objective Bayesian approach to penalized likelihood models by interpreting level-sets of the penalty as star-like manifolds. Secondly, we consider Bayesian mixture models and introduce a general method for variational inference by defining the posterior of mixture weights on the probability simplex. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.04230 [pdf, other]

M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data

Authors: Matthew J Allen, Francisco Dorr, Joseph Alejandro Gallego Mejia, Laura Martínez-Ferrer, Anna Jungbluth, Freddie Kalaitzis, Raúl Ramos-Pollán

Abstract: Satellite-based remote sensing has revolutionised the way we address global challenges in a rapidly evolving world. Huge quantities of Earth Observation (EO) data are generated by satellite sensors daily, but processing these large datasets for use in ML pipelines is technically and computationally challenging. Specifically, different types of EO data are often hosted on a variety of platforms, wi… ▽ More Satellite-based remote sensing has revolutionised the way we address global challenges in a rapidly evolving world. Huge quantities of Earth Observation (EO) data are generated by satellite sensors daily, but processing these large datasets for use in ML pipelines is technically and computationally challenging. Specifically, different types of EO data are often hosted on a variety of platforms, with differing availability for Python preprocessing tools. In addition, spatial alignment across data sources and data tiling can present significant technical hurdles for novice users. While some preprocessed EO datasets exist, their content is often limited to optical or near-optical wavelength data, which is ineffective at night or in adverse weather conditions. Synthetic Aperture Radar (SAR), an active sensing technique based on microwave length radiation, offers a viable alternative. However, the application of machine learning to SAR has been limited due to a lack of ML-ready data and pipelines, particularly for the full diversity of SAR data, including polarimetry, coherence and interferometry. We introduce M3LEO, a multi-modal, multi-label EO dataset that includes polarimetric, interferometric, and coherence SAR data derived from Sentinel-1, alongside Sentinel-2 RGB imagery and a suite of labelled tasks for model evaluation. M3LEO spans 17.5TB and contains approximately 10M data chips across six geographic regions. The dataset is complemented by a flexible PyTorch Lightning framework, with configuration management using Hydra. We provide tools to process any dataset available on popular platforms such as Google Earth Engine for integration with our framework. Initial experiments validate the utility of our data and framework, showing that SAR imagery contains information additional to that extractable from RGB data. Data at huggingface.co/M3LEO, and code at github.com/spaceml-org/M3LEO. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 9 pages, 2 figures

ACM Class: I.4; I.4.6; I.4.8; I.4.9; I.5; I.5.4

arXiv:2405.05596 [pdf, other]

Measuring Strategization in Recommendation: Users Adapt Their Behavior to Shape Future Content

Authors: Sarah H. Cen, Andrew Ilyas, Jennifer Allen, Hannah Li, Aleksander Madry

Abstract: Most modern recommendation algorithms are data-driven: they generate personalized recommendations by observing users' past behaviors. A common assumption in recommendation is that how a user interacts with a piece of content (e.g., whether they choose to "like" it) is a reflection of the content, but not of the algorithm that generated it. Although this assumption is convenient, it fails to captur… ▽ More Most modern recommendation algorithms are data-driven: they generate personalized recommendations by observing users' past behaviors. A common assumption in recommendation is that how a user interacts with a piece of content (e.g., whether they choose to "like" it) is a reflection of the content, but not of the algorithm that generated it. Although this assumption is convenient, it fails to capture user strategization: that users may attempt to shape their future recommendations by adapting their behavior to the recommendation algorithm. In this work, we test for user strategization by conducting a lab experiment and survey. To capture strategization, we adopt a model in which strategic users select their engagement behavior based not only on the content, but also on how their behavior affects downstream recommendations. Using a custom music player that we built, we study how users respond to different information about their recommendation algorithm as well as to different incentives about how their actions affect downstream outcomes. We find strong evidence of strategization across outcome metrics, including participants' dwell time and use of "likes." For example, participants who are told that the algorithm mainly pays attention to "likes" and "dislikes" use those functions 1.9x more than participants told that the algorithm mainly pays attention to dwell time. A close analysis of participant behavior (e.g., in response to our incentive conditions) rules out experimenter demand as the main driver of these trends. Further, in our post-experiment survey, nearly half of participants self-report strategizing "in the wild," with some stating that they ignore content they actually like to avoid over-recommendation of that content in the future. Together, our findings suggest that user strategization is common and that platforms cannot ignore the effect of their algorithms on user behavior. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2402.06831 [pdf]

What We Know About Using Non-Engagement Signals in Content Ranking

Authors: Tom Cunningham, Sana Pandey, Leif Sigerson, Jonathan Stray, Jeff Allen, Bonnie Barrilleaux, Ravi Iyer, Smitha Milli, Mohit Kothari, Behnam Rezaei

Abstract: Many online platforms predominantly rank items by predicted user engagement. We believe that there is much unrealized potential in including non-engagement signals, which can improve outcomes both for platforms and for society as a whole. Based on a daylong workshop with experts from industry and academia, we formulate a series of propositions and document each as best we can from public evidence,… ▽ More Many online platforms predominantly rank items by predicted user engagement. We believe that there is much unrealized potential in including non-engagement signals, which can improve outcomes both for platforms and for society as a whole. Based on a daylong workshop with experts from industry and academia, we formulate a series of propositions and document each as best we can from public evidence, including quantitative results where possible. There is strong evidence that ranking by predicted engagement is effective in increasing user retention. However retention can be further increased by incorporating other signals, including item "quality" proxies and asking users what they want to see with "item-level" surveys. There is also evidence that "diverse engagement" is an effective quality signal. Ranking changes can alter the prevalence of self-reported experiences of various kinds (e.g. harassment) but seldom have large enough effects on attitude measures like user satisfaction, well-being, polarization etc. to be measured in typical experiments. User controls over ranking often have low usage rates, but when used they do correlate well with quality and item-level surveys. There was no strong evidence on the impact of transparency/explainability on retention. There is reason to believe that generative AI could be used to create better quality signals and enable new kinds of user controls. △ Less

Submitted 9 February, 2024; originally announced February 2024.

ACM Class: H.3.3; H.4.3

arXiv:2311.00197 [pdf, other]

Design, Modeling, and Control of a Low-Cost and Rapid Response Soft-Growing Manipulator for Orchard Operations

Authors: Ryan Dorosh, Justin Allen, Zixuan He, Christopher Ninatanta, Jack Coleman, Jack Spieker, Ethan Tuck, Jordan Kurtz, Qin Zhang, Matthew D. Whiting, Jiecai Luo, Manoj Karkee, Ming Luo

Abstract: Tree fruit growers around the world are facing labor shortages for critical operations, including harvest and pruning. There is a great interest in developing robotic solutions for these labor-intensive tasks, but current efforts have been prohibitively costly, slow, or require a reconfiguration of the orchard in order to function. In this paper, we introduce an alternative approach to robotics us… ▽ More Tree fruit growers around the world are facing labor shortages for critical operations, including harvest and pruning. There is a great interest in developing robotic solutions for these labor-intensive tasks, but current efforts have been prohibitively costly, slow, or require a reconfiguration of the orchard in order to function. In this paper, we introduce an alternative approach to robotics using a novel and low-cost soft-growing robotic platform. Our platform features the ability to extend up to 1.2 m linearly at a maximum speed of 0.27 m/s. The soft-growing robotic arm can operate with a terminal payload of up to 1.4 kg (4.4 N), more than sufficient for carrying an apple. This platform decouples linear and steering motions to simplify path planning and the controller design for targeting. We anticipate our platform being relatively simple to maintain compared to rigid robotic arms. Herein we also describe and experimentally verify the platform's kinematic model, including the prediction of the relationship between the steering angle and the angular positions of the three steering motors. Information from the model enables the position controller to guide the end effector to the targeted positions faster and with higher stability than without this information. Overall, our research show promise for using soft-growing robotic platforms in orchard operations. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: International Conference on Intelligent Robots and Systems (IROS) 2023

arXiv:2305.16846 [pdf, other]

Lagrangian Flow Networks for Conservation Laws

Authors: F. Arend Torres, Marcello Massimo Negri, Marco Inversi, Jonathan Aellen, Volker Roth

Abstract: We introduce Lagrangian Flow Networks (LFlows) for modeling fluid densities and velocities continuously in space and time. By construction, the proposed LFlows satisfy the continuity equation, a PDE describing mass conservation in its differentiable form. Our model is based on the insight that solutions to the continuity equation can be expressed as time-dependent density transformations via diffe… ▽ More We introduce Lagrangian Flow Networks (LFlows) for modeling fluid densities and velocities continuously in space and time. By construction, the proposed LFlows satisfy the continuity equation, a PDE describing mass conservation in its differentiable form. Our model is based on the insight that solutions to the continuity equation can be expressed as time-dependent density transformations via differentiable and invertible maps. This follows from classical theory of the existence and uniqueness of Lagrangian flows for smooth vector fields. Hence, we model fluid densities by transforming a base density with parameterized diffeomorphisms conditioned on time. The key benefit compared to methods relying on numerical ODE solvers or PINNs is that the analytic expression of the velocity is always consistent with changes in density. Furthermore, we require neither expensive numerical solvers, nor additional penalties to enforce the PDE. LFlows show higher predictive accuracy in density modeling tasks compared to competing models in 2D and 3D, while being computationally efficient. As a real-world application, we model bird migration based on sparse weather radar measurements. △ Less

Submitted 13 December, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2303.15604 [pdf, other]

HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations

Authors: Derek Jones, Jonathan E. Allen, Xiaohua Zhang, Behnam Khaleghi, Jaeyoung Kang, Weihong Xu, Niema Moshiri, Tajana S. Rosing

Abstract: Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between t… ▽ More Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug to its protein target. A major drawback of the approaches is that they require exceptional computing capabilities to consider for even relatively small collections of molecules. Hyperdimensional Computing (HDC) is a recently proposed learning paradigm that is able to leverage low-precision binary vector arithmetic to build efficient representations of the data that can be obtained without the need for gradient-based optimization approaches that are required in many conventional machine learning and deep learning approaches. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated for a range of application areas. We consider existing HDC approaches for molecular property classification and introduce two novel encoding algorithms that leverage the extended connectivity fingerprint (ECFP) algorithm. We show that HDC-based inference methods are as much as 90 times more efficient than more complex representative machine learning methods and achieve an acceleration of nearly 9 orders of magnitude as compared to inference with molecular docking. We demonstrate multiple approaches for the encoding of molecular data for HDC and examine their relative performance on a range of challenging molecular property prediction and drug-protein binding classification tasks. Our work thus motivates further investigation into molecular representation learning to develop ultra-efficient pre-screening tools. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2302.06829 [pdf, other]

The Role of Semantic Parsing in Understanding Procedural Text

Authors: Hossein Rajaby Faghihi, Parisa Kordjamshidi, Choh Man Teng, James Allen

Abstract: In this paper, we investigate whether symbolic semantic representations, extracted from deep semantic parsers, can help reasoning over the states of involved entities in a procedural text. We consider a deep semantic parser~(TRIPS) and semantic role labeling as two sources of semantic parsing knowledge. First, we propose PROPOLIS, a symbolic parsing-based procedural reasoning framework. Second, we… ▽ More In this paper, we investigate whether symbolic semantic representations, extracted from deep semantic parsers, can help reasoning over the states of involved entities in a procedural text. We consider a deep semantic parser~(TRIPS) and semantic role labeling as two sources of semantic parsing knowledge. First, we propose PROPOLIS, a symbolic parsing-based procedural reasoning framework. Second, we integrate semantic parsing information into state-of-the-art neural models to conduct procedural reasoning. Our experiments indicate that explicitly incorporating such semantic knowledge improves procedural understanding. This paper presents new metrics for evaluating procedural reasoning tasks that clarify the challenges and identify differences among neural, symbolic, and integrated models. △ Less

Submitted 17 May, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: 9 pages, Appected in EACL2023

arXiv:2212.03646 [pdf, other]

Towards Automatic Cetacean Photo-Identification: A Framework for Fine-Grain, Few-Shot Learning in Marine Ecology

Authors: Cameron Trotter, Nick Wright, A. Stephen McGough, Matt Sharpe, Barbara Cheney, Mònica Arso Civil, Reny Tyson Moore, Jason Allen, Per Berggren

Abstract: Photo-identification (photo-id) is one of the main non-invasive capture-recapture methods utilised by marine researchers for monitoring cetacean (dolphin, whale, and porpoise) populations. This method has historically been performed manually resulting in high workload and cost due to the vast number of images collected. Recently automated aids have been developed to help speed-up photo-id, althoug… ▽ More Photo-identification (photo-id) is one of the main non-invasive capture-recapture methods utilised by marine researchers for monitoring cetacean (dolphin, whale, and porpoise) populations. This method has historically been performed manually resulting in high workload and cost due to the vast number of images collected. Recently automated aids have been developed to help speed-up photo-id, although they are often disjoint in their processing and do not utilise all available identifying information. Work presented in this paper aims to create a fully automatic photo-id aid capable of providing most likely matches based on all available information without the need for data pre-processing such as cropping. This is achieved through a pipeline of computer vision models and post-processing techniques aimed at detecting cetaceans in unedited field imagery before passing them downstream for individual level catalogue matching. The system is capable of handling previously uncatalogued individuals and flagging these for investigation thanks to catalogue similarity comparison. We evaluate the system against multiple real-life photo-id catalogues, achieving mAP@IOU[0.5] = 0.91, 0.96 for the task of dorsal fin detection on catalogues from Tanzania and the UK respectively and 83.1, 97.5% top-10 accuracy for the task of individual classification on catalogues from the UK and USA. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 8 pages, 8 figures, 3 tables. Submitted and accepted to IEEE Big Data 2022 Conference

arXiv:2210.17043 [pdf, other]

Evaluating Point-Prediction Uncertainties in Neural Networks for Drug Discovery

Authors: Ya Ju Fan, Jonathan E. Allen, Kevin S. McLoughlin, Da Shi, Brian J. Bennion, Xiaohua Zhang, Felice C. Lightstone

Abstract: Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models require uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Methods that combine Bayesian models with NN models address this issue, but are d… ▽ More Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models require uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Methods that combine Bayesian models with NN models address this issue, but are difficult to implement and more expensive to train. Some methods require changing the NN architecture or training procedure, limiting the selection of NN models. Moreover, predictive uncertainty can come from different sources. It is important to have the ability to separately model different types of predictive uncertainty, as the model can take assorted actions depending on the source of uncertainty. In this paper, we examine UQ methods that estimate different sources of predictive uncertainty for NN models aiming at drug discovery. We use our prior knowledge on chemical compounds to design the experiments. By utilizing a visualization method we create non-overlapping and chemically diverse partitions from a collection of chemical compounds. These partitions are used as training and test set splits to explore NN model uncertainty. We demonstrate how the uncertainties estimated by the selected methods describe different sources of uncertainty under different partitions and featurization schemes and the relationship to prediction error. △ Less

Submitted 30 October, 2022; originally announced October 2022.

arXiv:2207.14709 [pdf, other]

Robust Quantitative Susceptibility Mapping via Approximate Message Passing with Parameter Estimation

Authors: Shuai Huang, James J. Lah, Jason W. Allen, Deqiang Qiu

Abstract: Purpose: For quantitative susceptibility mapping (QSM), the lack of ground-truth in clinical settings makes it challenging to determine suitable parameters for the dipole inversion. We propose a probabilistic Bayesian approach for QSM with built-in parameter estimation, and incorporate the nonlinear formulation of the dipole inversion to achieve a robust recovery of the susceptibility maps. Theo… ▽ More Purpose: For quantitative susceptibility mapping (QSM), the lack of ground-truth in clinical settings makes it challenging to determine suitable parameters for the dipole inversion. We propose a probabilistic Bayesian approach for QSM with built-in parameter estimation, and incorporate the nonlinear formulation of the dipole inversion to achieve a robust recovery of the susceptibility maps. Theory: From a Bayesian perspective, the image wavelet coefficients are approximately sparse and modelled by the Laplace distribution. The measurement noise is modelled by a Gaussian-mixture distribution with two components, where the second component is used to model the noise outliers. Through probabilistic inference, the susceptibility map and distribution parameters can be jointly recovered using approximate message passing (AMP). Methods: We compare our proposed AMP with built-in parameter estimation (AMP-PE) to the state-of-the-art L1-QSM, FANSI and MEDI approaches on the simulated and in vivo datasets, and perform experiments to explore the optimal settings of AMP-PE. Reproducible code is available at https://github.com/EmoryCN2L/QSM_AMP_PE Results: On the simulated Sim2Snr1 dataset, AMP-PE achieved the lowest NRMSE, DFCM and the highest SSIM, while MEDI achieved the lowest HFEN. On the in vivo datasets, AMP-PE is robust and successfully recovers the susceptibility maps using the estimated parameters, whereas L1-QSM, FANSI and MEDI typically require additional visual fine-tuning to select or double-check working parameters. Conclusion: AMP-PE provides automatic and adaptive parameter estimation for QSM and avoids the subjectivity from the visual fine-tuning step, making it an excellent choice for the clinical setting. △ Less

Submitted 30 May, 2023; v1 submitted 29 July, 2022; originally announced July 2022.

Comments: Keywords: Approximate message passing, Compressive sensing, Outlier modelling, Parameter estimation, Quantitative susceptibility mapping

arXiv:2205.11566 [pdf, other]

doi 10.1103/PhysRevLett.132.077402

Compressing the chronology of a temporal network with graph commutators

Authors: Andrea J. Allen, Cristopher Moore, Laurent Hébert-Dufresne

Abstract: Studies of dynamics on temporal networks often represent the network as a series of "snapshots," static networks active for short durations of time. We argue that successive snapshots can be aggregated if doing so has little effect on the overlying dynamics. We propose a method to compress network chronologies by progressively combining pairs of snapshots whose matrix commutators have the smallest… ▽ More Studies of dynamics on temporal networks often represent the network as a series of "snapshots," static networks active for short durations of time. We argue that successive snapshots can be aggregated if doing so has little effect on the overlying dynamics. We propose a method to compress network chronologies by progressively combining pairs of snapshots whose matrix commutators have the smallest dynamical effect. We apply this method to epidemic modeling on real contact tracing data and find that it allows for significant compression while remaining faithful to the epidemic dynamics. △ Less

Submitted 29 March, 2024; v1 submitted 23 May, 2022; originally announced May 2022.

Journal ref: Phys. Rev. Lett. 132, 077402 (2024)

arXiv:2205.04321 [pdf, other]

Evaluating the Fairness Impact of Differentially Private Synthetic Data

Authors: Blake Bullwinkel, Kristen Grabarz, Lily Ke, Scarlett Gong, Chris Tanner, Joshua Allen

Abstract: Differentially private (DP) synthetic data is a promising approach to maximizing the utility of data containing sensitive information. Due to the suppression of underrepresented classes that is often required to achieve privacy, however, it may be in conflict with fairness. We evaluate four DP synthesizers and present empirical results indicating that three of these models frequently degrade fairn… ▽ More Differentially private (DP) synthetic data is a promising approach to maximizing the utility of data containing sensitive information. Due to the suppression of underrepresented classes that is often required to achieve privacy, however, it may be in conflict with fairness. We evaluate four DP synthesizers and present empirical results indicating that three of these models frequently degrade fairness outcomes on downstream binary classification tasks. We draw a connection between fairness and the proportion of minority groups present in the generated synthetic data, and find that training synthesizers on data that are pre-processed via a multi-label undersampling method can promote more fair outcomes without degrading accuracy. △ Less

Submitted 20 June, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

arXiv:2204.12903 [pdf, other]

Spending Privacy Budget Fairly and Wisely

Authors: Lucas Rosenblatt, Joshua Allen, Julia Stoyanovich

Abstract: Differentially private (DP) synthetic data generation is a practical method for improving access to data as a means to encourage productive partnerships. One issue inherent to DP is that the "privacy budget" is generally "spent" evenly across features in the data set. This leads to good statistical parity with the real data, but can undervalue the conditional probabilities and marginals that are c… ▽ More Differentially private (DP) synthetic data generation is a practical method for improving access to data as a means to encourage productive partnerships. One issue inherent to DP is that the "privacy budget" is generally "spent" evenly across features in the data set. This leads to good statistical parity with the real data, but can undervalue the conditional probabilities and marginals that are critical for predictive quality of synthetic data. Further, loss of predictive quality may be non-uniform across the data set, with subsets that correspond to minority groups potentially suffering a higher loss. In this paper, we develop ensemble methods that distribute the privacy budget "wisely" to maximize predictive accuracy of models trained on DP data, and "fairly" to bound potential disparities in accuracy across groups and reduce inequality. Our methods are based on the insights that feature importance can inform how privacy budget is allocated, and, further, that per-group feature importance and fairness-related performance objectives can be incorporated in the allocation. These insights make our methods tunable to social contexts, allowing data owners to produce balanced synthetic data for predictive analysis. △ Less

Submitted 27 April, 2022; originally announced April 2022.

arXiv:2203.09986 [pdf, other]

GiNGR: Generalized Iterative Non-Rigid Point Cloud and Surface Registration Using Gaussian Process Regression

Authors: Dennis Madsen, Jonathan Aellen, Andreas Morel-Forster, Thomas Vetter, Marcel Lüthi

Abstract: In this paper, we unify popular non-rigid registration methods for point sets and surfaces under our general framework, GiNGR. GiNGR builds upon Gaussian Process Morphable Models (GPMM) and hence separates modeling the deformation prior from model adaptation for registration. In addition, it provides explainable hyperparameters, multi-resolution registration, trivial inclusion of expert annotation… ▽ More In this paper, we unify popular non-rigid registration methods for point sets and surfaces under our general framework, GiNGR. GiNGR builds upon Gaussian Process Morphable Models (GPMM) and hence separates modeling the deformation prior from model adaptation for registration. In addition, it provides explainable hyperparameters, multi-resolution registration, trivial inclusion of expert annotation, and the ability to use and combine analytical and statistical deformation priors. But more importantly, the reformulation allows for a direct comparison of registration methods. Instead of using a general solver in the optimization step, we show how Gaussian process regression (GPR) iteratively can warp a reference onto a target, leading to smooth deformations following the prior for any dense, sparse, or partial estimated correspondences in a principled way. We show how the popular CPD and ICP algorithms can be directly explained with GiNGR. Furthermore, we show how existing algorithms in the GiNGR framework can perform probabilistic registration to obtain a distribution of different registrations instead of a single best registration. This can be used to analyze the uncertainty e.g. when registering partial observations. GiNGR is publicly available and fully modular to allow for domain-specific prior construction. △ Less

Submitted 18 March, 2022; originally announced March 2022.

arXiv:2202.06692 [pdf, other]

TRIP: Trust-Limited Coercion-Resistant In-Person Voter Registration

Authors: Louis-Henri Merino, Simone Colombo, Rene Reyes, Alaleh Azhir, Haoqian Zhang, Jeff Allen, Bernhard Tellenbach, Vero Estrada-Galiñanes, Bryan Ford

Abstract: Remote electronic voting is convenient and flexible, but presents risks of coercion and vote buying. One promising mitigation strategy enables voters to give a coercer fake voting credentials, which silently cast votes that do not count. However, current proposals make problematic assumptions during credential issuance, such as relying on a trustworthy registrar, on trusted hardware, or on voters… ▽ More Remote electronic voting is convenient and flexible, but presents risks of coercion and vote buying. One promising mitigation strategy enables voters to give a coercer fake voting credentials, which silently cast votes that do not count. However, current proposals make problematic assumptions during credential issuance, such as relying on a trustworthy registrar, on trusted hardware, or on voters interacting with multiple registrars. We present TRIP, the first voter registration scheme that addresses these challenges by leveraging the physical security of in-person interaction. Voters use a kiosk in a privacy booth to print real and fake paper credentials, which appear indistinguishable to others. Voters interact with only one authority, need no trusted hardware during credential issuance, and need not trust the registrar except when actually under coercion. For verifiability, each credential includes an interactive zero-knowledge proof, which is sound in real credentials and unsound in fake credentials. Voters learn the difference by observing the order of printing steps, and need not understand the technical details. We prove formally that TRIP satisfies coercion-resistance and verifiability. In a user study with 150 participants, 83% successfully used TRIP. △ Less

Submitted 17 March, 2024; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: 21 pages

arXiv:2112.08692 [pdf, other]

Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription

Authors: Nikolai Vogler, Jonathan Parkes Allen, Matthew Thomas Miller, Taylor Berg-Kirkpatrick

Abstract: We present a self-supervised pre-training approach for learning rich visual language representations for both handwritten and printed historical document transcription. After supervised fine-tuning of our pre-trained encoder representations for low-resource document transcription on two languages, (1) a heterogeneous set of handwritten Islamicate manuscript images and (2) early modern English prin… ▽ More We present a self-supervised pre-training approach for learning rich visual language representations for both handwritten and printed historical document transcription. After supervised fine-tuning of our pre-trained encoder representations for low-resource document transcription on two languages, (1) a heterogeneous set of handwritten Islamicate manuscript images and (2) early modern English printed documents, we show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training. Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations invariant to scribal writing style and printing noise present across documents. △ Less

Submitted 16 December, 2021; originally announced December 2021.

arXiv:2111.07407 [pdf, other]

A Machine Learning Approach for Recruitment Prediction in Clinical Trial Design

Authors: Jingshu Liu, Patricia J Allen, Luke Benz, Daniel Blickstein, Evon Okidi, Xiao Shi

Abstract: Significant advancements have been made in recent years to optimize patient recruitment for clinical trials, however, improved methods for patient recruitment prediction are needed to support trial site selection and to estimate appropriate enrollment timelines in the trial design stage. In this paper, using data from thousands of historical clinical trials, we explore machine learning methods to… ▽ More Significant advancements have been made in recent years to optimize patient recruitment for clinical trials, however, improved methods for patient recruitment prediction are needed to support trial site selection and to estimate appropriate enrollment timelines in the trial design stage. In this paper, using data from thousands of historical clinical trials, we explore machine learning methods to predict the number of patients enrolled per month at a clinical trial site over the course of a trial's enrollment duration. We show that these methods can reduce the error that is observed with current industry standards and propose opportunities for further improvement. △ Less

Submitted 14 November, 2021; originally announced November 2021.

Comments: Machine Learning for Health (ML4H) - Extended Abstract

arXiv:2109.03328 [pdf]

Predicting Process Name from Network Data

Authors: Justin Allen, David Knapp, Kristine Monteith

Abstract: The ability to identify applications based on the network data they generate could be a valuable tool for cyber defense. We report on a machine learning technique capable of using netflow-like features to predict the application that generated the traffic. In our experiments, we used ground-truth labels obtained from host-based sensors deployed in a large enterprise environment; we applied random… ▽ More The ability to identify applications based on the network data they generate could be a valuable tool for cyber defense. We report on a machine learning technique capable of using netflow-like features to predict the application that generated the traffic. In our experiments, we used ground-truth labels obtained from host-based sensors deployed in a large enterprise environment; we applied random forests and multilayer perceptrons to the tasks of browser vs. non-browser identification, browser fingerprinting, and process name prediction. For each of these tasks, we demonstrate how machine learning models can achieve high classification accuracy using only netflow-like features as the basis for classification. △ Less

Submitted 3 September, 2021; originally announced September 2021.

Comments: Presented at 1st International Workshop on Adaptive Cyber Defense, 2021 (arXiv:2108.08476)

Report number: IJCAI-ACD/2021/104

arXiv:2105.11538 [pdf]

doi 10.1108/JSBED-08-2015-0110

The power of reciprocal knowledge sharing relationships for startup success

Authors: T. J. Allen, P. Gloor, A. Fronzetti Colladon, S. L. Woerner, O. Raz

Abstract: Purpose: The purpose of this paper is to examine the innovative capabilities of biotech start-ups in relation to geographic proximity and knowledge sharing interaction in the R&D network of a major high-tech cluster. Design-methodology-approach: This study compares longitudinal informal communication networks of researchers at biotech start-ups with company patent applications in subsequent year… ▽ More Purpose: The purpose of this paper is to examine the innovative capabilities of biotech start-ups in relation to geographic proximity and knowledge sharing interaction in the R&D network of a major high-tech cluster. Design-methodology-approach: This study compares longitudinal informal communication networks of researchers at biotech start-ups with company patent applications in subsequent years. For a year, senior R&D staff members from over 70 biotech firms located in the Boston biotech cluster were polled and communication information about interaction with peers, universities and big pharmaceutical companies was collected, as well as their geolocation tags. Findings: Location influences the amount of communication between firms, but not their innovation success. Rather, what matters is communication intensity and recollection by others. In particular, there is evidence that rotating leadership - changing between a more active and passive communication style - is a predictor of innovative performance. Practical implications: Expensive real-estate investments can be replaced by maintaining social ties. A more dynamic communication style and more diverse social ties are beneficial to innovation. Originality-value: Compared to earlier work that has shown a connection between location, network and firm performance, this paper offers a more differentiated view; including a novel measure of communication style, using a unique data set and providing new insights for firms who want to shape their communication patterns to improve innovation, independently of their location. △ Less

Submitted 20 May, 2021; originally announced May 2021.

ACM Class: J.4

Journal ref: Journal of Small Business and Enterprise Development 23(3), 636-651 (2016)

arXiv:2105.07140 [pdf, other]

NeuroGen: activation optimized image synthesis for discovery neuroscience

Authors: Zijin Gu, Keith W. Jamison, Meenakshi Khosla, Emily J. Allen, Yihan Wu, Thomas Naselaris, Kendrick Kay, Mert R. Sabuncu, Amy Kuceyeski

Abstract: Functional MRI (fMRI) is a powerful technique that has allowed us to characterize visual cortex responses to stimuli, yet such experiments are by nature constructed based on a priori hypotheses, limited to the set of images presented to the individual while they are in the scanner, are subject to noise in the observed brain responses, and may vary widely across individuals. In this work, we propos… ▽ More Functional MRI (fMRI) is a powerful technique that has allowed us to characterize visual cortex responses to stimuli, yet such experiments are by nature constructed based on a priori hypotheses, limited to the set of images presented to the individual while they are in the scanner, are subject to noise in the observed brain responses, and may vary widely across individuals. In this work, we propose a novel computational strategy, which we call NeuroGen, to overcome these limitations and develop a powerful tool for human vision neuroscience discovery. NeuroGen combines an fMRI-trained neural encoding model of human vision with a deep generative network to synthesize images predicted to achieve a target pattern of macro-scale brain activation. We demonstrate that the reduction of noise that the encoding model provides, coupled with the generative network's ability to produce images of high fidelity, results in a robust discovery architecture for visual neuroscience. By using only a small number of synthetic images created by NeuroGen, we demonstrate that we can detect and amplify differences in regional and individual human brain response patterns to visual stimuli. We then verify that these discoveries are reflected in the several thousand observed image responses measured with fMRI. We further demonstrate that NeuroGen can create synthetic images predicted to achieve regional response patterns not achievable by the best-matching natural images. The NeuroGen framework extends the utility of brain encoding models and opens up a new avenue for exploring, and possibly precisely controlling, the human visual system. △ Less

Submitted 15 May, 2021; originally announced May 2021.

arXiv:2104.04547 [pdf, other]

High-Throughput Virtual Screening of Small Molecule Inhibitors for SARS-CoV-2 Protein Targets with Deep Fusion Models

Authors: Garrett A. Stevenson, Derek Jones, Hyojin Kim, W. F. Drew Bennett, Brian J. Bennion, Monica Borucki, Feliza Bourguet, Aidan Epstein, Magdalena Franco, Brooke Harmon, Stewart He, Max P. Katz, Daniel Kirshner, Victoria Lao, Edmond Y. Lau, Jacky Lo, Kevin McLoughlin, Richard Mosesso, Deepa K. Murugesh, Oscar A. Negrete, Edwin A. Saada, Brent Segelke, Maxwell Stefan, Marisa W. Torres, Dina Weilhammer , et al. (7 additional authors not shown)

Abstract: Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhanceme… ▽ More Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhancements to Deep Fusion were made in order to evaluate more than 5 billion docked poses on SARS-CoV-2 protein targets. First, the Deep Fusion concept was refined by formulating the architecture as one, coherently backpropagated model (Coherent Fusion) to improve binding-affinity prediction accuracy. Secondly, the model was trained using a distributed, genetic hyper-parameter optimization. Finally, a scalable, high-throughput screening capability was developed to maximize the number of ligands evaluated and expedite the path to experimental evaluation. In this work, we present both the methods developed for machine learning-based high-throughput screening and results from using our computational pipeline to find SARS-CoV-2 inhibitors. △ Less

Submitted 31 May, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

arXiv:2103.14035 [pdf, other]

U.S. Broadband Coverage Data Set: A Differentially Private Data Release

Authors: Mayana Pereira, Allen Kim, Joshua Allen, Kevin White, Juan Lavista Ferres, Rahul Dodhia

Abstract: Broadband connectivity is a key metric in today's economy. In an era of rapid expansion of the digital economy, it directly impacts GDP. Furthermore, with the COVID-19 guidelines of social distancing, internet connectivity became necessary to everyday activities such as work, learning, and staying in touch with family and friends. This paper introduces a publicly available U.S. Broadband Coverage… ▽ More Broadband connectivity is a key metric in today's economy. In an era of rapid expansion of the digital economy, it directly impacts GDP. Furthermore, with the COVID-19 guidelines of social distancing, internet connectivity became necessary to everyday activities such as work, learning, and staying in touch with family and friends. This paper introduces a publicly available U.S. Broadband Coverage data set that reports broadband coverage percentages at a zip code-level. We also explain how we used differential privacy to guarantee that the privacy of individual households is preserved. Our data set also contains error ranges estimates, providing information on the expected error introduced by differential privacy per zip code. We describe our error range calculation method and show that this additional data metric does not induce any privacy losses. △ Less

Submitted 1 April, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

arXiv:2011.05537 [pdf, other]

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Authors: Lucas Rosenblatt, Xiaoyan Liu, Samira Pouyanfar, Eduardo de Leon, Anuj Desai, Joshua Allen

Abstract: Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the… ▽ More Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine learning scenarios with private internal data to researchers and practioners alike. In addition, we propose QUAIL, an ensemble-based modeling approach to generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances in which it outperforms baseline differentially private supervised learning models under the same budget constraint. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: Under Review

arXiv:2009.10861 [pdf, other]

doi 10.1109/IRI49571.2020.00021

Distributed Differentially Private Mutual Information Ranking and Its Applications

Authors: Ankit Srivastava, Samira Pouyanfar, Joshua Allen, Ken Johnston, Qida Ma

Abstract: Computation of Mutual Information (MI) helps understand the amount of information shared between a pair of random variables. Automated feature selection techniques based on MI ranking are regularly used to extract information from sensitive datasets exceeding petabytes in size, over millions of features and classes. Series of one-vs-all MI computations can be cascaded to produce n-fold MI results,… ▽ More Computation of Mutual Information (MI) helps understand the amount of information shared between a pair of random variables. Automated feature selection techniques based on MI ranking are regularly used to extract information from sensitive datasets exceeding petabytes in size, over millions of features and classes. Series of one-vs-all MI computations can be cascaded to produce n-fold MI results, rapidly pinpointing informative relationships. This ability to quickly pinpoint the most informative relationships from datasets of billions of users creates privacy concerns. In this paper, we present Distributed Differentially Private Mutual Information (DDP-MI), a privacy-safe fast batch MI, across various scenarios such as feature selection, segmentation, ranking, and query expansion. This distributed implementation is protected with global model differential privacy to provide strong assurances against a wide range of privacy attacks. We also show that our DDP-MI can substantially improve the efficiency of MI calculations compared to standard implementations on a large-scale public dataset. △ Less

Submitted 22 September, 2020; originally announced September 2020.

Journal ref: 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI) (pp. 90-96)

arXiv:2008.01806 [pdf, other]

Fast Nonconvex $T_2^*$ Mapping Using ADMM

Authors: Shuai Huang, James J. Lah, Jason W. Allen, Deqiang Qiu

Abstract: Magnetic resonance (MR)-$T_2^*$ mapping is widely used to study hemorrhage, calcification and iron deposition in various clinical applications, it provides a direct and precise mapping of desired contrast in the tissue. However, the long acquisition time required by conventional 3D high-resolution $T_2^*$ mapping method causes discomfort to patients and introduces motion artifacts to reconstructed… ▽ More Magnetic resonance (MR)-$T_2^*$ mapping is widely used to study hemorrhage, calcification and iron deposition in various clinical applications, it provides a direct and precise mapping of desired contrast in the tissue. However, the long acquisition time required by conventional 3D high-resolution $T_2^*$ mapping method causes discomfort to patients and introduces motion artifacts to reconstructed images, which limits its wider applicability. In this paper we address this issue by performing $T_2^*$ mapping from undersampled data using compressive sensing (CS). We formulate the reconstruction as a nonconvex problem that can be decomposed into two subproblems. They can be solved either separately via the standard approach or jointly via the alternating direction method of multipliers (ADMM). Compared to previous CS-based approaches that only apply sparse regularization on the spin density $\boldsymbol X_0$ and the relaxation rate $\boldsymbol R_2^*$, our formulation enforces additional sparse priors on the $T_2^*$-weighted images at multiple echoes to improve the reconstruction performance. We performed convergence analysis of the proposed algorithm, evaluated its performance on in vivo data, and studied the effects of different sampling schemes. Experimental results showed that the proposed joint-recovery approach generally outperforms the state-of-the-art method, especially in the low-sampling rate regime, making it a preferred choice to perform fast 3D $T_2^*$ mapping in practice. The framework adopted in this work can be easily extended to other problems arising from MR or other imaging modalities with non-linearly coupled variables. △ Less

Submitted 4 August, 2020; originally announced August 2020.

arXiv:2007.12327 [pdf, other]

Stochastic Dynamic Information Flow Tracking Game using Supervised Learning for Detecting Advanced Persistent Threats

Authors: Shana Moothedath, Dinuka Sahabandu, Joey Allen, Linda Bushnell, Wenke Lee, Radha Poovendran

Abstract: Advanced persistent threats (APTs) are organized prolonged cyberattacks by sophisticated attackers. Although APT activities are stealthy, they interact with the system components and these interactions lead to information flows. Dynamic Information Flow Tracking (DIFT) has been proposed as one of the effective ways to detect APTs using the information flows. However, wide range security analysis u… ▽ More Advanced persistent threats (APTs) are organized prolonged cyberattacks by sophisticated attackers. Although APT activities are stealthy, they interact with the system components and these interactions lead to information flows. Dynamic Information Flow Tracking (DIFT) has been proposed as one of the effective ways to detect APTs using the information flows. However, wide range security analysis using DIFT results in a significant increase in performance overhead and high rates of false-positives and false-negatives generated by DIFT. In this paper, we model the strategic interaction between APT and DIFT as a non-cooperative stochastic game. The game unfolds on a state space constructed from an information flow graph (IFG) that is extracted from the system log. The objective of the APT in the game is to choose transitions in the IFG to find an optimal path in the IFG from an entry point of the attack to an attack target. On the other hand, the objective of DIFT is to dynamically select nodes in the IFG to perform security analysis for detecting APT. Our game model has imperfect information as the players do not have information about the actions of the opponent. We consider two scenarios of the game (i) when the false-positive and false-negative rates are known to both players and (ii) when the false-positive and false-negative rates are unknown to both players. Case (i) translates to a game model with complete information and we propose a value iteration-based algorithm and prove the convergence. Case (ii) translates to a game with unknown transition probabilities. In this case, we propose Hierarchical Supervised Learning (HSL) algorithm that integrates a neural network, to predict the value vector of the game, with a policy iteration algorithm to compute an approximate equilibrium. We implemented our algorithms on real attack datasets and validated the performance of our approach. △ Less

Submitted 25 June, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

arXiv:2007.02670 [pdf]

A Broad-Coverage Deep Semantic Lexicon for Verbs

Authors: James Allen, Hannah An, Ritwik Bose, Will de Beaumont, Choh Man Teng

Abstract: Progress on deep language understanding is inhibited by the lack of a broad coverage lexicon that connects linguistic behavior to ontological concepts and axioms. We have developed COLLIE-V, a deep lexical resource for verbs, with the coverage of WordNet and syntactic and semantic details that meet or exceed existing resources. Bootstrapping from a hand-built lexicon and ontology, new ontological… ▽ More Progress on deep language understanding is inhibited by the lack of a broad coverage lexicon that connects linguistic behavior to ontological concepts and axioms. We have developed COLLIE-V, a deep lexical resource for verbs, with the coverage of WordNet and syntactic and semantic details that meet or exceed existing resources. Bootstrapping from a hand-built lexicon and ontology, new ontological concepts and lexical entries, together with semantic role preferences and entailment axioms, are automatically derived by combining multiple constraints from parsing dictionary definitions and examples. We evaluated the accuracy of the technique along a number of different dimensions and were able to obtain high accuracy in deriving new concepts and lexical entries. COLLIE-V is publicly available. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: Draft of LREC-2020 paper. Proceedings of The 12th Language Resources and Evaluation Conference. 2020

ACM Class: I.2.7

arXiv:2007.00076 [pdf, other]

A Reinforcement Learning Approach for Dynamic Information Flow Tracking Games for Detecting Advanced Persistent Threats

Authors: Dinuka Sahabandu, Shana Moothedath, Joey Allen, Linda Bushnell, Wenke Lee, Radha Poovendran

Abstract: Advanced Persistent Threats (APTs) are stealthy attacks that threaten the security and privacy of sensitive information. Interactions of APTs with victim system introduce information flows that are recorded in the system logs. Dynamic Information Flow Tracking (DIFT) is a promising detection mechanism for detecting APTs. DIFT taints information flows originating at system entities that are suscept… ▽ More Advanced Persistent Threats (APTs) are stealthy attacks that threaten the security and privacy of sensitive information. Interactions of APTs with victim system introduce information flows that are recorded in the system logs. Dynamic Information Flow Tracking (DIFT) is a promising detection mechanism for detecting APTs. DIFT taints information flows originating at system entities that are susceptible to an attack, tracks the propagation of the tainted flows, and authenticates the tainted flows at certain system components according to a pre-defined security policy. Deployment of DIFT to defend against APTs in cyber systems is limited by the heavy resource and performance overhead associated with DIFT. In this paper, we propose a resource-efficient model for DIFT by incorporating the security costs, false-positives, and false-negatives associated with DIFT. Specifically, we develop a game-theoretic framework and provide an analytical model of DIFT that enables the study of trade-off between resource efficiency and the effectiveness of detection. Our game model is a nonzero-sum, infinite-horizon, average reward stochastic game. Our model incorporates the information asymmetry between players that arises from DIFT's inability to distinguish malicious flows from benign flows and APT's inability to know the locations where DIFT performs a security analysis. Additionally, the game has incomplete information as the transition probabilities (false-positive and false-negative rates) are unknown. We propose a multiple-time scale stochastic approximation algorithm to learn an equilibrium solution of the game. We prove that our algorithm converges to an average reward Nash equilibrium. We evaluate our proposed model and algorithm on a real-world ransomware dataset and validate the effectiveness of the proposed approach. △ Less

Submitted 28 June, 2021; v1 submitted 30 June, 2020; originally announced July 2020.

Comments: 15

arXiv:2006.15942 [pdf, other]

Hinting Semantic Parsing with Statistical Word Sense Disambiguation

Authors: Ritwik Bose, Siddharth Vashishtha, James Allen

Abstract: The task of Semantic Parsing can be approximated as a transformation of an utterance into a logical form graph where edges represent semantic roles and nodes represent word senses. The resulting representation should be capture the meaning of the utterance and be suitable for reasoning. Word senses and semantic roles are interdependent, meaning errors in assigning word senses can cause errors in a… ▽ More The task of Semantic Parsing can be approximated as a transformation of an utterance into a logical form graph where edges represent semantic roles and nodes represent word senses. The resulting representation should be capture the meaning of the utterance and be suitable for reasoning. Word senses and semantic roles are interdependent, meaning errors in assigning word senses can cause errors in assigning semantic roles and vice versa. While statistical approaches to word sense disambiguation outperform logical, rule-based semantic parsers for raw word sense assignment, these statistical word sense disambiguation systems do not produce the rich role structure or detailed semantic representation of the input. In this work, we provide hints from a statistical WSD system to guide a logical semantic parser to produce better semantic type assignments while maintaining the soundness of the resulting logical forms. We observe an improvement of up to 10.5% in F-score, however we find that this improvement comes at a cost to the structural integrity of the parse △ Less

Submitted 6 July, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

Comments: Longer version of AAAI2020 student abstract

ACM Class: I.2.7

arXiv:2006.12327 [pdf, other]

Dynamic Information Flow Tracking for Detection of Advanced Persistent Threats: A Stochastic Game Approach

Authors: Shana Moothedath, Dinuka Sahabandu, Joey Allen, Andrew Clark, Linda Bushnell, Wenke Lee, Radha Poovendran

Abstract: Advanced Persistent Threats (APTs) are stealthy customized attacks by intelligent adversaries. This paper deals with the detection of APTs that infiltrate cyber systems and compromise specifically targeted data and/or infrastructures. Dynamic information flow tracking is an information trace-based detection mechanism against APTs that taints suspicious information flows in the system and generates… ▽ More Advanced Persistent Threats (APTs) are stealthy customized attacks by intelligent adversaries. This paper deals with the detection of APTs that infiltrate cyber systems and compromise specifically targeted data and/or infrastructures. Dynamic information flow tracking is an information trace-based detection mechanism against APTs that taints suspicious information flows in the system and generates security analysis for unauthorized use of tainted data. In this paper, we develop an analytical model for resource-efficient detection of APTs using an information flow tracking game. The game is a nonzero-sum, turn-based, stochastic game with asymmetric information as the defender cannot distinguish whether an incoming flow is malicious or benign and hence has only partial state observation. We analyze equilibrium of the game and prove that a Nash equilibrium is given by a solution to the minimum capacity cut set problem on a flow-network derived from the system, where the edge capacities are obtained from the cost of performing security analysis. Finally, we implement our algorithm on the real-world dataset for a data exfiltration attack augmented with false-negative and false-positive rates and compute an optimal defender strategy. △ Less

Submitted 25 June, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

arXiv:2005.07704 [pdf, other]

Improved Protein-ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference

Authors: Derek Jones, Hyojin Kim, Xiaohua Zhang, Adam Zemla, Garrett Stevenson, William D. Bennett, Dan Kirshner, Sergio Wong, Felice Lightstone, Jonathan E. Allen

Abstract: Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance depends on the input data representation and suffer… ▽ More Predicting accurate protein-ligand binding affinity is important in drug discovery but remains a challenge even with computationally expensive biophysics-based energy scoring methods and state-of-the-art deep learning approaches. Despite the recent advances in the deep convolutional and graph neural network based approaches, the model performance depends on the input data representation and suffers from distinct limitations. It is natural to combine complementary features and their inference from the individual models for better predictions. We present fusion models to benefit from different feature representations of two neural network models to improve the binding affinity prediction. We demonstrate effectiveness of the proposed approach by performing experiments with the PDBBind 2016 dataset and its docking pose complexes. The results show that the proposed approach improves the overall prediction compared to the individual neural network models with greater computational efficiency than related biophysics based energy scoring functions. We also discuss the benefit of the proposed fusion inference with several example complexes. The software is made available as open source at https://github.com/llnl/fast. △ Less

Submitted 17 May, 2020; originally announced May 2020.

arXiv:2005.07225 [pdf, other]

SAGE: Sequential Attribute Generator for Analyzing Glioblastomas using Limited Dataset

Authors: Padmaja Jonnalagedda, Brent Weinberg, Jason Allen, Taejin L. Min, Shiv Bhanu, Bir Bhanu

Abstract: While deep learning approaches have shown remarkable performance in many imaging tasks, most of these methods rely on availability of large quantities of data. Medical image data, however, is scarce and fragmented. Generative Adversarial Networks (GANs) have recently been very effective in handling such datasets by generating more data. If the datasets are very small, however, GANs cannot learn th… ▽ More While deep learning approaches have shown remarkable performance in many imaging tasks, most of these methods rely on availability of large quantities of data. Medical image data, however, is scarce and fragmented. Generative Adversarial Networks (GANs) have recently been very effective in handling such datasets by generating more data. If the datasets are very small, however, GANs cannot learn the data distribution properly, resulting in less diverse or low-quality results. One such limited dataset is that for the concurrent gain of 19 and 20 chromosomes (19/20 co-gain), a mutation with positive prognostic value in Glioblastomas (GBM). In this paper, we detect imaging biomarkers for the mutation to streamline the extensive and invasive prognosis pipeline. Since this mutation is relatively rare, i.e. small dataset, we propose a novel generative framework - the Sequential Attribute GEnerator (SAGE), that generates detailed tumor imaging features while learning from a limited dataset. Experiments show that not only does SAGE generate high quality tumors when compared to standard Deep Convolutional GAN (DC-GAN) and Wasserstein GAN with Gradient Penalty (WGAN-GP), it also captures the imaging biomarkers accurately. △ Less

Submitted 3 June, 2022; v1 submitted 14 May, 2020; originally announced May 2020.

arXiv:1911.05211 [pdf, other]

AMPL: A Data-Driven Modeling Pipeline for Drug Discovery

Authors: Amanda J. Minnich, Kevin McLoughlin, Margaret Tse, Jason Deng, Andrew Weber, Neha Murad, Benjamin D. Madej, Bharath Ramsundar, Tom Rush, Stacie Calad-Thomson, Jim Brase, Jonathan E. Allen

Abstract: One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine,… ▽ More One of the key requirements for incorporating machine learning into the drug discovery process is complete reproducibility and traceability of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing machine learning models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of machine learning and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical datasets covering a wide range of parameters. As a result of these comprehensive experiments, we have found that physicochemical descriptors and deep learning-based graph representations significantly outperform traditional fingerprints in the characterization of molecular features. We have also found that dataset size is directly correlated to prediction performance, and that single-task deep learning models only outperform shallow learners if there is sufficient data. Likewise, dataset size has a direct impact on model predictivity, independent of comprehensive hyperparameter model tuning. Our findings point to the need for public dataset integration or multi-task/transfer learning approaches. Lastly, we found that uncertainty quantification (UQ) analysis may help identify model error; however, efficacy of UQ to filter predictions varies considerably between datasets and featurization/model types. AMPL is open source and available for download at http://github.com/ATOMconsortium/AMPL. △ Less

Submitted 13 November, 2019; v1 submitted 12 November, 2019; originally announced November 2019.

arXiv:1909.09064 [pdf, other]

Human-In-The-Loop Learning of Qualitative Preference Models

Authors: Joseph Allen, Ahmed Moussa, Xudong Liu

Abstract: In this work, we present a novel human-in-the-loop framework to help the human user understand the decision making process that involves choosing preferred options. We focus on qualitative preference models over alternatives from combinatorial domains. This framework is interactive: the user provides her behavioral data to the framework, and the framework explains the learned model to the user. It… ▽ More In this work, we present a novel human-in-the-loop framework to help the human user understand the decision making process that involves choosing preferred options. We focus on qualitative preference models over alternatives from combinatorial domains. This framework is interactive: the user provides her behavioral data to the framework, and the framework explains the learned model to the user. It is iterative: the framework collects feedback on the learned model from the user and tries to improve it accordingly till the user terminates the iteration. In order to communicate the learned preference model to the user, we develop visualization of intuitive and explainable graphic models, such as lexicographic preference trees and forests, and conditional preference networks. To this end, we discuss key aspects of our framework for lexicographic preference models. △ Less

Submitted 19 September, 2019; originally announced September 2019.

Comments: Published in the Proceedings of the 32nd International Florida Artificial Intelligence Research Society Conference, 2019

arXiv:1901.11152 [pdf, other]

Distinguishing between Normal and Cancer Cells Using Autoencoder Node Saliency

Authors: Ya Ju Fan, Jonathan E. Allen, Sam Ade Jacobs, Brian C. Van Essen

Abstract: Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from gene expression profiles. The autoencoder is a nonl… ▽ More Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. As data becomes available, scalable learning toolkits become essential to processing large datasets using deep learning models to model complex biological processes. We present an autoencoder to capture nonlinear relationships recovered from gene expression profiles. The autoencoder is a nonlinear dimension reduction technique using an artificial neural network, which learns hidden representations of unlabeled data. We train the autoencoder on a large collection of tumor samples from the National Cancer Institute Genomic Data Commons, and obtain a generalized and unsupervised latent representation. We leverage a HPC-focused deep learning toolkit, Livermore Big Artificial Neural Network (LBANN) to efficiently parallelize the training algorithm, reducing computation times from several hours to a few minutes. With the trained autoencoder, we generate latent representations of a small dataset, containing pairs of normal and cancer cells of various tumor types. A novel measure called autoencoder node saliency (ANS) is introduced to identify the hidden nodes that best differentiate various pairs of cells. We compare our findings of the best classifying nodes with principal component analysis and the visualization of t-distributed stochastic neighbor embedding. We demonstrate that the autoencoder effectively extracts distinct gene features for multiple learning tasks in the dataset. △ Less

Submitted 30 January, 2019; originally announced January 2019.

Comments: Second Workshop on HPC Applications in Precision Medicine, June 2018

arXiv:1811.05622 [pdf, other]

A Game Theoretic Approach for Dynamic Information Flow Tracking to Detect Multi-Stage Advanced Persistent Threats

Authors: Shana Moothedath, Dinuka Sahabandu, Joey Allen, Andrew Clark, Linda Bushnell, Wenke Lee, Radha Poovendran

Abstract: Advanced Persistent Threats (APTs) infiltrate cyber systems and compromise specifically targeted data and/or resources through a sequence of stealthy attacks consisting of multiple stages. Dynamic information flow tracking has been proposed to detect APTs. In this paper, we develop a dynamic information flow tracking game for resource-efficient detection of APTs via multi-stage dynamic games. The… ▽ More Advanced Persistent Threats (APTs) infiltrate cyber systems and compromise specifically targeted data and/or resources through a sequence of stealthy attacks consisting of multiple stages. Dynamic information flow tracking has been proposed to detect APTs. In this paper, we develop a dynamic information flow tracking game for resource-efficient detection of APTs via multi-stage dynamic games. The game evolves on an information flow graph, whose nodes are processes and objects (e.g. file, network endpoints) in the system and the edges capture the interaction between different processes and objects. Each stage of the game has pre-specified targets which are characterized by a set of nodes of the graph and the goal of the APT is to evade detection and reach a target node of that stage. The goal of the defender is to maximize the detection probability while minimizing performance overhead on the system. The resource costs of the players are different and the information structure is asymmetric resulting in a nonzero-sum imperfect information game. We first calculate the best responses of the players and characterize the set of Nash equilibria for single stage attacks. Subsequently, we provide a polynomial-time algorithm to compute a correlated equilibrium for the multi-stage attack case. Finally, we experiment our model and algorithms on real-world nation state attack data obtained from Refinable Attack Investigation system. △ Less

Submitted 13 November, 2018; originally announced November 2018.

Comments: 16

arXiv:1807.00736 [pdf, other]

An Algorithmic Framework For Differentially Private Data Analysis on Trusted Processors

Authors: Joshua Allen, Bolin Ding, Janardhan Kulkarni, Harsha Nori, Olga Ohrimenko, Sergey Yekhanin

Abstract: Differential privacy has emerged as the main definition for private data analysis and machine learning. The {\em global} model of differential privacy, which assumes that users trust the data collector, provides strong privacy guarantees and introduces small errors in the output. In contrast, applications of differential privacy in commercial systems by Apple, Google, and Microsoft, use the {\em l… ▽ More Differential privacy has emerged as the main definition for private data analysis and machine learning. The {\em global} model of differential privacy, which assumes that users trust the data collector, provides strong privacy guarantees and introduces small errors in the output. In contrast, applications of differential privacy in commercial systems by Apple, Google, and Microsoft, use the {\em local model}. Here, users do not trust the data collector, and hence randomize their data before sending it to the data collector. Unfortunately, local model is too strong for several important applications and hence is limited in its applicability. In this work, we propose a framework based on trusted processors and a new definition of differential privacy called {\em Oblivious Differential Privacy}, which combines the best of both local and global models. The algorithms we design in this framework show interesting interplay of ideas from the streaming algorithms, oblivious algorithms, and differential privacy. △ Less

Submitted 26 October, 2019; v1 submitted 2 July, 2018; originally announced July 2018.

Comments: Accepted at NeurIPS 2019

arXiv:1807.00412 [pdf, other]

Learning to Drive in a Day

Authors: Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, Amar Shah

Abstract: We demonstrate the first application of deep reinforcement learning to autonomous driving. From randomly initialised parameters, our model is able to learn a policy for lane following in a handful of training episodes using a single monocular image as input. We provide a general and easy to obtain reward: the distance travelled by the vehicle without the safety driver taking control. We use a cont… ▽ More We demonstrate the first application of deep reinforcement learning to autonomous driving. From randomly initialised parameters, our model is able to learn a policy for lane following in a handful of training episodes using a single monocular image as input. We provide a general and easy to obtain reward: the distance travelled by the vehicle without the safety driver taking control. We use a continuous, model-free deep reinforcement learning algorithm, with all exploration and optimisation performed on-vehicle. This demonstrates a new framework for autonomous driving which moves away from reliance on defined logical rules, mapping, and direct supervision. We discuss the challenges and opportunities to scale this approach to a broader range of autonomous driving tasks. △ Less

Submitted 11 September, 2018; v1 submitted 1 July, 2018; originally announced July 2018.

Comments: Further results and demo videos can be viewed at: https://wayve.ai/blog/l2diad

arXiv:1803.09027 [pdf, other]

Comparing Population Means under Local Differential Privacy: with Significance and Power

Authors: Bolin Ding, Harsha Nori, Paul Li, Joshua Allen

Abstract: A statistical hypothesis test determines whether a hypothesis should be rejected based on samples from populations. In particular, randomized controlled experiments (or A/B testing) that compare population means using, e.g., t-tests, have been widely deployed in technology companies to aid in making data-driven decisions. Samples used in these tests are collected from users and may contain sensiti… ▽ More A statistical hypothesis test determines whether a hypothesis should be rejected based on samples from populations. In particular, randomized controlled experiments (or A/B testing) that compare population means using, e.g., t-tests, have been widely deployed in technology companies to aid in making data-driven decisions. Samples used in these tests are collected from users and may contain sensitive information. Both the data collection and the testing process may compromise individuals' privacy. In this paper, we study how to conduct hypothesis tests to compare population means while preserving privacy. We use the notation of local differential privacy (LDP), which has recently emerged as the main tool to ensure each individual's privacy without the need of a trusted data collector. We propose LDP tests that inject noise into every user's data in the samples before collecting them (so users do not need to trust the data collector), and draw conclusions with bounded type-I (significance level) and type-II errors (1 - power). Our approaches can be extended to the scenario where some users require LDP while some are willing to provide exact data. We report experimental results on real-world datasets to verify the effectiveness of our approaches. △ Less

Submitted 23 March, 2018; originally announced March 2018.

Comments: Full version of an AAAI 2018 conference paper

arXiv:1708.07785 [pdf, other]

Integral Curvature Representation and Matching Algorithms for Identification of Dolphins and Whales

Authors: Hendrik J. Weideman, Zachary M. Jablons, Jason Holmberg, Kiirsten Flynn, John Calambokidis, Reny B. Tyson, Jason B. Allen, Randall S. Wells, Krista Hupman, Kim Urian, Charles V. Stewart

Abstract: We address the problem of identifying individual cetaceans from images showing the trailing edge of their fins. Given the trailing edge from an unknown individual, we produce a ranking of known individuals from a database. The nicks and notches along the trailing edge define an individual's unique signature. We define a representation based on integral curvature that is robust to changes in viewpo… ▽ More We address the problem of identifying individual cetaceans from images showing the trailing edge of their fins. Given the trailing edge from an unknown individual, we produce a ranking of known individuals from a database. The nicks and notches along the trailing edge define an individual's unique signature. We define a representation based on integral curvature that is robust to changes in viewpoint and pose, and captures the pattern of nicks and notches in a local neighborhood at multiple scales. We explore two ranking methods that use this representation. The first uses a dynamic programming time-warping algorithm to align two representations, and interprets the alignment cost as a measure of similarity. This algorithm also exploits learned spatial weights to downweight matches from regions of unstable curvature. The second interprets the representation as a feature descriptor. Feature keypoints are defined at the local extrema of the representation. Descriptors for the set of known individuals are stored in a tree structure, which allows us to perform queries given the descriptors from an unknown trailing edge. We evaluate the top-k accuracy on two real-world datasets to demonstrate the effectiveness of the curvature representation, achieving top-1 accuracy scores of approximately 95% and 80% for bottlenose dolphins and humpback whales, respectively. △ Less

Submitted 25 August, 2017; originally announced August 2017.

Comments: To appear in ICCV 2017 First Workshop on Visual Wildlife Monitoring

arXiv:1604.01696 [pdf, other]

A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories

Authors: Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, James Allen

Abstract: Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. This issue is particularly challenging for understanding casual and correlational relationships between events. While this topic has received a lot of interest in the NLP community, research has been hindered by the lack of a proper evaluation framework. This… ▽ More Representation and learning of commonsense knowledge is one of the foundational problems in the quest to enable deep language understanding. This issue is particularly challenging for understanding casual and correlational relationships between events. While this topic has received a lot of interest in the NLP community, research has been hindered by the lack of a proper evaluation framework. This paper attempts to address this problem with a new framework for evaluating story understanding and script learning: the 'Story Cloze Test'. This test requires a system to choose the correct ending to a four-sentence story. We created a new corpus of ~50k five-sentence commonsense stories, ROCStories, to enable this evaluation. This corpus is unique in two ways: (1) it captures a rich set of causal and temporal commonsense relations between daily events, and (2) it is a high quality collection of everyday life stories that can also be used for story generation. Experimental evaluation shows that a host of baselines and state-of-the-art models based on shallow language understanding struggle to achieve a high score on the Story Cloze Test. We discuss these implications for script and story learning, and offer suggestions for deeper language understanding. △ Less

Submitted 6 April, 2016; originally announced April 2016.

Comments: In Proceedings of the 2016 North American Chapter of the ACL (NAACL HLT), 2016

arXiv:1412.1419 [pdf]

CloudQTL: Evolving a Bioinformatics Application to the Cloud

Authors: John Allen, David Scott, Malcolm Illingworth, Bartek Dobrzelecki, Davy Virdee, Steve Thorn, Sara Knott

Abstract: A timeline is presented which shows the stages involved in converting a bioinformatics software application from a set of standalone algorithms through to a simple web based tool then to a web based portal harnessing Grid technologies and on to its latest inception as a Cloud based bioinformatics web tool. The nature of the software is discussed together with a description of its development at va… ▽ More A timeline is presented which shows the stages involved in converting a bioinformatics software application from a set of standalone algorithms through to a simple web based tool then to a web based portal harnessing Grid technologies and on to its latest inception as a Cloud based bioinformatics web tool. The nature of the software is discussed together with a description of its development at various stages including a detailed account of the Cloud service. An outline of user results is also included. △ Less

Submitted 3 December, 2014; originally announced December 2014.

Comments: 12 pages, 3 figures, EGI conference Madrid 2013

arXiv:1303.5731 [pdf]

A Language for Planning with Statistics

Authors: Nathaniel G. Martin, James F. Allen

Abstract: When a planner must decide whether it has enough evidence to make a decision based on probability, it faces the sample size problem. Current planners using probabilities need not deal with this problem because they do not generate their probabilities from observations. This paper presents an event based language in which the planner's probabilities are calculated from the binomial random variabl… ▽ More When a planner must decide whether it has enough evidence to make a decision based on probability, it faces the sample size problem. Current planners using probabilities need not deal with this problem because they do not generate their probabilities from observations. This paper presents an event based language in which the planner's probabilities are calculated from the binomial random variable generated by the observed ratio of one type of event to another. Such probabilities are subject to error, so the planner must introspect about their validity. Inferences about the probability of these events can be made using statistics. Inferences about the validity of the approximations can be made using interval estimation. Interval estimation allows the planner to avoid making choices that are only weakly supported by the planner's evidence. △ Less

Submitted 20 March, 2013; originally announced March 2013.

Comments: Appears in Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence (UAI1991)

Report number: UAI-P-1991-PG-220-227

arXiv:1206.5333 [pdf, ps, other]

TempEval-3: Evaluating Events, Time Expressions, and Temporal Relations

Authors: Naushad UzZaman, Hector Llorens, James Allen, Leon Derczynski, Marc Verhagen, James Pustejovsky

Abstract: We describe the TempEval-3 task which is currently in preparation for the SemEval-2013 evaluation exercise. The aim of TempEval is to advance research on temporal information processing. TempEval-3 follows on from previous TempEval events, incorporating: a three-part task structure covering event, temporal expression and temporal relation extraction; a larger dataset; and single overall task quali… ▽ More We describe the TempEval-3 task which is currently in preparation for the SemEval-2013 evaluation exercise. The aim of TempEval is to advance research on temporal information processing. TempEval-3 follows on from previous TempEval events, incorporating: a three-part task structure covering event, temporal expression and temporal relation extraction; a larger dataset; and single overall task quality scores. △ Less

Submitted 25 May, 2014; v1 submitted 22 June, 2012; originally announced June 2012.

arXiv:cmp-lg/9801002 [pdf, ps, other]

Identifying Discourse Markers in Spoken Dialog

Authors: Peter A. Heeman, Donna Byron, James F. Allen

Abstract: In this paper, we present a method for identifying discourse marker usage in spontaneous speech based on machine learning. Discourse markers are denoted by special POS tags, and thus the process of POS tagging can be used to identify discourse markers. By incorporating POS tagging into language modeling, discourse markers can be identified during speech recognition, in which the timeliness of th… ▽ More In this paper, we present a method for identifying discourse marker usage in spontaneous speech based on machine learning. Discourse markers are denoted by special POS tags, and thus the process of POS tagging can be used to identify discourse markers. By incorporating POS tagging into language modeling, discourse markers can be identified during speech recognition, in which the timeliness of the information can be used to help predict the following words. We contrast this approach with an alternative machine learning approach proposed by Litman (1996). This paper also argues that discourse markers can be used to help the hearer predict the role that the upcoming utterance plays in the dialog. Thus discourse markers should provide valuable evidence for automatic dialog act prediction. △ Less

Submitted 16 January, 1998; originally announced January 1998.

Comments: 8 pages, uses psfig

Journal ref: AAAI 1998 Spring Symposium on Applying Machine Learning to Discourse Processing

arXiv:cmp-lg/9705014 [pdf, ps, other]

Incorporating POS Tagging into Language Modeling

Authors: Peter A. Heeman, James F. Allen

Abstract: Language models for speech recognition tend to concentrate solely on recognizing the words that were spoken. In this paper, we redefine the speech recognition problem so that its goal is to find both the best sequence of words and their syntactic role (part-of-speech) in the utterance. This is a necessary first step towards tightening the interaction between speech recognition and natural langua… ▽ More Language models for speech recognition tend to concentrate solely on recognizing the words that were spoken. In this paper, we redefine the speech recognition problem so that its goal is to find both the best sequence of words and their syntactic role (part-of-speech) in the utterance. This is a necessary first step towards tightening the interaction between speech recognition and natural language understanding. △ Less

Submitted 22 May, 1997; originally announced May 1997.

Comments: 5 pages, 2 postscript figures

Journal ref: In proceedings of Eurospeech'97

arXiv:cmp-lg/9704008 [pdf, ps, other]

Intonational Boundaries, Speech Repairs and Discourse Markers: Modeling Spoken Dialog

Authors: Peter A. Heeman, James F. Allen

Abstract: To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problems, doe… ▽ More To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problems, does POS tagging, and can be used as the language model of a speech recognizer. We find that by accounting for the interactions between these tasks that the performance on each task improves, as does POS tagging and perplexity. △ Less

Submitted 23 April, 1997; originally announced April 1997.

Comments: 8 pages, 3 postscript figures

Journal ref: In proceedings of ACL/EACL'97

Showing 1–50 of 53 results for author: Allen, J