Zum Hauptinhalt springen

Showing 1–17 of 17 results for author: Tan, Y S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19958  [pdf, other

    stat.ML cs.LG math.ST

    The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

    Authors: Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In th… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 62G08; 65C40

  2. arXiv:2309.09880  [pdf, ps, other

    stat.ML cs.LG

    Error Reduction from Stacked Regressions

    Authors: Xin Chen, Jason M. Klusowski, Yan Shuo Tan

    Abstract: Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing… ▽ More

    Submitted 26 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

  3. arXiv:2307.01932  [pdf, other

    stat.ME cs.AI cs.LG stat.ML

    MDI+: A Flexible Random Forest-Based Feature Importance Framework

    Authors: Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu

    Abstract: Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Speci… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  4. arXiv:2304.04816  [pdf, other

    cs.CV

    Multi-Object Tracking by Iteratively Associating Detections with Uniform Appearance for Trawl-Based Fishing Bycatch Monitoring

    Authors: Cheng-Yen Yang, Alan Yu Shyang Tan, Melanie J. Underwood, Charlotte Bodie, Zhongyu Jiang, Steve George, Karl Warr, Jenq-Neng Hwang, Emma Jones

    Abstract: The aim of in-trawl catch monitoring for use in fishing operations is to detect, track and classify fish targets in real-time from video footage. Information gathered could be used to release unwanted bycatch in real-time. However, traditional multi-object tracking (MOT) methods have limitations, as they are developed for tracking vehicles or pedestrians with linear motions and diverse appearances… ▽ More

    Submitted 10 April, 2023; originally announced April 2023.

  5. arXiv:2210.09352  [pdf, other

    stat.ML cs.AI cs.LG math.ST

    A Mixing Time Lower Bound for a Simplified Version of BART

    Authors: Omer Ronen, Theo Saarinen, Yan Shuo Tan, James Duncan, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, bios… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

  6. arXiv:2202.00858  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

    Authors: Abhineet Agarwal, Yan Shuo Tan, Omer Ronen, Chandan Singh, Bin Yu

    Abstract: Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking th… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

  7. arXiv:2201.11931  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Fast Interpretable Greedy-Tree Sums

    Authors: Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu

    Abstract: Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FI… ▽ More

    Submitted 8 July, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

  8. arXiv:2110.09626  [pdf, other

    stat.ML cs.IT cs.LG

    A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds

    Authors: Yan Shuo Tan, Abhineet Agarwal, Bin Yu

    Abstract: Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We ta… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  9. arXiv:2011.14709  [pdf

    physics.optics cs.NE physics.app-ph

    Monadic Pavlovian associative learning in a backpropagation-free photonic network

    Authors: James Y. S. Tan, Zengguang Cheng, Johannes Feldmann, Xuan Li, Nathan Youngblood, Utku E. Ali, C. David Wright, Wolfram H. P. Pernice, Harish Bhaskaran

    Abstract: Over a century ago, Ivan P. Pavlov, in a classic experiment, demonstrated how dogs can learn to associate a ringing bell with food, thereby causing a ring to result in salivation. Today, it is rare to find the use of Pavlovian type associative learning for artificial intelligence (AI) applications even though other learning concepts, in particular backpropagation on artificial neural networks (ANN… ▽ More

    Submitted 5 August, 2022; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: 24 pages, 5 figures

    Journal ref: Optica, Volume 9, Issue 7, pp. 792-802 (2022)

  10. arXiv:2008.10109  [pdf, other

    stat.ME cs.LG stat.AP

    Stable discovery of interpretable subgroups via calibration in causal studies

    Authors: Raaz Dwivedi, Yan Shuo Tan, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu

    Abstract: Building on Yu and Kumbier's PCS framework and for randomized experiments, we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial (RCT), that compared the risk of adverse events from a… ▽ More

    Submitted 28 September, 2020; v1 submitted 23 August, 2020; originally announced August 2020.

    Comments: Raaz Dwivedi and Yan Shuo Tan are joint first authors and contributed equally to this work. 52 pages, 8 Figures, 9 Tables. To appear in International Statistical Review, 2020

  11. Curating a COVID-19 data repository and forecasting county-level death counts in the United States

    Authors: Nick Altieri, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu

    Abstract: As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative de… ▽ More

    Submitted 9 August, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: Authors ordered alphabetically. All authors contributed significantly to this work. All collected data, modeling code, forecasts, and visualizations are updated daily and available at \url{https://github.com/Yu-Group/covid19-severity-prediction}

    Journal ref: Published in Harvard Data Science Review, 2020

  12. arXiv:1910.12837  [pdf, other

    stat.ML cs.IT cs.LG math.NA math.OC

    Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

    Authors: Yan Shuo Tan, Roman Vershynin

    Abstract: In recent literature, a general two step procedure has been formulated for solving the problem of phase retrieval. First, a spectral technique is used to obtain a constant-error initial estimate, following which, the estimate is refined to arbitrary precision by first-order optimization of a non-convex loss function. Numerical experiments, however, seem to suggest that simply running the iterative… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

    MSC Class: 65K10

    Journal ref: Journal of Machine Learning Research, 24(58), 1-47 (2023)

  13. arXiv:1712.04106  [pdf, ps, other

    cs.IT cs.LG math.ST

    Sparse Phase Retrieval via Sparse PCA Despite Model Misspecification: A Simplified and Extended Analysis

    Authors: Yan Shuo Tan

    Abstract: We consider the problem of high-dimensional misspecified phase retrieval. This is where we have an $s$-sparse signal vector $\mathbf{x}_*$ in $\mathbb{R}^n$, which we wish to recover using sampling vectors $\textbf{a}_1,\ldots,\textbf{a}_m$, and measurements $y_1,\ldots,y_m$, which are related by the equation $f(\left<\textbf{a}_i,\textbf{x}_*\right>) = y_i$. Here, $f$ is an unknown link function… ▽ More

    Submitted 12 December, 2017; v1 submitted 11 December, 2017; originally announced December 2017.

    Comments: Edited formatting for abstract

    MSC Class: 94A12; 60D05; 90C25

  14. arXiv:1709.04744  [pdf, ps, other

    cs.CV cs.LG stat.ML

    Subspace Clustering using Ensembles of $K$-Subspaces

    Authors: John Lipor, David Hong, Yan Shuo Tan, Laura Balzano

    Abstract: Subspace clustering is the unsupervised grouping of points lying near a union of low-dimensional linear subspaces. Algorithms based directly on geometric properties of such data tend to either provide poor empirical performance, lack theoretical guarantees, or depend heavily on their initialization. We present a novel geometric approach to the subspace clustering problem that leverages ensembles o… ▽ More

    Submitted 6 January, 2021; v1 submitted 14 September, 2017; originally announced September 2017.

  15. arXiv:1706.09993  [pdf, other

    math.NA cs.IT cs.LG math.PR math.ST

    Phase Retrieval via Randomized Kaczmarz: Theoretical Guarantees

    Authors: Yan Shuo Tan, Roman Vershynin

    Abstract: We consider the problem of phase retrieval, i.e. that of solving systems of quadratic equations. A simple variant of the randomized Kaczmarz method was recently proposed for phase retrieval, and it was shown numerically to have a computational edge over state-of-the-art Wirtinger flow methods. In this paper, we provide the first theoretical guarantee for the convergence of the randomized Kaczmarz… ▽ More

    Submitted 13 January, 2018; v1 submitted 29 June, 2017; originally announced June 2017.

    Comments: Revised after comments from referees

    MSC Class: 65K10

  16. arXiv:1704.01041  [pdf, ps, other

    cs.LG math.PR stat.ML

    Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods

    Authors: Yan Shuo Tan, Roman Vershynin

    Abstract: The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $E$ in $\mathbb{R}^n$ so that data points projected onto $E$ follow a non-gaussian distribution. Although this is an appropriate model for some real world data analysis problems, there has been little progress on this problem over the last decade. In this paper, we attempt to address this st… ▽ More

    Submitted 4 April, 2017; originally announced April 2017.

    MSC Class: 68Q87

  17. arXiv:1612.06343  [pdf, ps, other

    math.PR cs.IT

    Energy optimization for distributions on the sphere and improvement to the Welch bounds

    Authors: Yan Shuo Tan

    Abstract: For any Borel probability measure on $\mathbb{R}^n$, we may define a family of eccentricity tensors. This new notion, together with a tensorization trick, allows us to prove an energy minimization property for rotationally invariant probability measures. We use this theory to give a new proof of the Welch bounds, and to improve upon them for collections of real vectors. In addition, we are able to… ▽ More

    Submitted 13 January, 2018; v1 submitted 19 December, 2016; originally announced December 2016.

    MSC Class: 60E15; 52A40; 15A69

    Journal ref: Electron. Commun. Probab. Volume 22 (2017), paper no. 43, 12 pp