Zum Hauptinhalt springen

Showing 1–50 of 1,424 results for author: Sun

Searching in archive stat. Search in all archives.
.
  1. arXiv:2409.04140  [pdf, other

    stat.ML cs.LG

    Half-VAE: An Encoder-Free VAE to Bypass Explicit Inverse Mapping

    Authors: Yuan-Hao Wei, Yan-Jie Sun, Chen Zhang

    Abstract: Inference and inverse problems are closely related concepts, both fundamentally involving the deduction of unknown causes or parameters from observed data. Bayesian inference, a powerful class of methods, is often employed to solve a variety of problems, including those related to causal inference. Variational inference, a subset of Bayesian inference, is primarily used to efficiently approximate… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

  2. arXiv:2409.01017  [pdf, other

    stat.ME

    Linear spline index regression model: Interpretability, nonlinearity and dimension reduction

    Authors: Lianqiang Qu, Long Lv, Meiling Hao, Liuquan Sun

    Abstract: Inspired by the complexity of certain real-world datasets, this article introduces a novel flexible linear spline index regression model. The model posits piecewise linear effects of an index on the response, with continuous changes occurring at knots. Significantly, it possesses the interpretability of linear models, captures nonlinear effects similar to nonparametric models, and achieves dimensi… ▽ More

    Submitted 2 September, 2024; originally announced September 2024.

    Comments: 84 pages, 4 figures

  3. arXiv:2408.08998  [pdf, other

    stat.ML cs.LG

    A Confidence Interval for the $\ell_2$ Expected Calibration Error

    Authors: Yan Sun, Pratik Chaudhari, Ian J. Barnett, Edgar Dobriban

    Abstract: Recent advances in machine learning have significantly improved prediction accuracy in various applications. However, ensuring the calibration of probabilistic predictions remains a significant challenge. Despite efforts to enhance model calibration, the rigorous statistical evaluation of model calibration remains less explored. In this work, we develop confidence intervals the $\ell_2$ Expected C… ▽ More

    Submitted 3 September, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  4. arXiv:2408.07094  [pdf

    cs.LG stat.ML

    Overcoming Imbalanced Safety Data Using Extended Accident Triangle

    Authors: Kailai Sun, Tianxiang Lan, Yang Miang Goh, Yueng-Hsiang Huang

    Abstract: There is growing interest in using safety analytics and machine learning to support the prevention of workplace incidents, especially in high-risk industries like construction and trucking. Although existing safety analytics studies have made remarkable progress, they suffer from imbalanced datasets, a common problem in safety analytics, resulting in prediction inaccuracies. This can lead to manag… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  5. arXiv:2408.06263  [pdf, other

    stat.ME

    Optimal Integrative Estimation for Distributed Precision Matrices with Heterogeneity Adjustment

    Authors: Yinrui Sun, Yin Xia

    Abstract: Distributed learning offers a practical solution for the integrative analysis of multi-source datasets, especially under privacy or communication constraints. However, addressing prospective distributional heterogeneity and ensuring communication efficiency pose significant challenges on distributed statistical analysis. In this article, we focus on integrative estimation of distributed heterogene… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  6. arXiv:2408.05788  [pdf, other

    cs.LG cs.AI stat.ML

    Continual Learning of Nonlinear Independent Representations

    Authors: Boyang Sun, Ignavier Ng, Guangyi Chen, Yifan Shen, Qirong Ho, Kun Zhang

    Abstract: Identifying the causal relations between interested variables plays a pivotal role in representation learning as it provides deep insights into the dataset. Identifiability, as the central theme of this approach, normally hinges on leveraging data from multiple distributions (intervention, distribution shift, time series, etc.). Despite the exciting development in this field, a practical but often… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 9 pages, 5 Figures

  7. arXiv:2408.05428  [pdf, other

    cs.LG stat.ME stat.ML

    Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression

    Authors: Anpeng Wu, Kun Kuang, Ruoxuan Xiong, Xiangwei Chen, Zexu Sun, Fei Wu, Kun Zhang

    Abstract: In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements ac… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

  8. arXiv:2408.04440  [pdf, other

    stat.CO

    Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators

    Authors: Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun

    Abstract: We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the spherical harmonic transform to stochastically model spatio-temporal variations in climate data. This provides tunable spatio-temporal resolution and significantly improves the fideli… ▽ More

    Submitted 11 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

  9. arXiv:2407.21622  [pdf, other

    stat.ML cs.LG math.ST

    Extended Fiducial Inference: Toward an Automated Process of Statistical Inference

    Authors: Faming Liang, Sehwan Kim, Yan Sun

    Abstract: While fiducial inference was widely considered a big blunder by R.A. Fisher, the goal he initially set --`inferring the uncertainty of model parameters on the basis of observations' -- has been continually pursued by many statisticians. To this end, we develop a new statistical inference method called extended Fiducial inference (EFI). The new method achieves the goal of fiducial inference by leve… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  10. arXiv:2407.21154  [pdf, other

    stat.ME

    Bayesian thresholded modeling for integrating brain node and network predictors

    Authors: Zhe Sun, Wanwan Xu, Tianxi Li, Jian Kang, Gregorio Alanis-Lobato, Yize Zhao

    Abstract: Progress in neuroscience has provided unprecedented opportunities to advance our understanding of brain alterations and their correspondence to phenotypic profiles. With data collected from various imaging techniques, studies have integrated different types of information ranging from brain structure, function, or metabolism. More recently, an emerging way to categorize imaging traits is through a… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: 57 pages, 6 figures

    MSC Class: 62C10; 92B15; 62P10

  11. arXiv:2407.20177  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

    Authors: Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia

    Abstract: To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the o… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  12. arXiv:2407.20057  [pdf

    physics.ao-ph cs.LG stat.AP

    Reconstructing Global Daily CO2 Emissions via Machine Learning

    Authors: Tao Li, Lixing Wang, Zihan Qiu, Philippe Ciais, Taochun Sun, Matthew W. Jones, Robbie M. Andrew, Glen P. Peters, Piyu ke, Xiaoting Huang, Robert B. Jackson, Zhu Liu

    Abstract: High temporal resolution CO2 emission data are crucial for understanding the drivers of emission changes, however, current emission dataset is only available on a yearly basis. Here, we extended a global daily CO2 emissions dataset backwards in time to 1970 using machine learning algorithm, which was trained to predict historical daily emissions on national scales based on relationships between da… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  13. arXiv:2407.19078  [pdf, other

    cs.LG stat.ML

    Practical Marketplace Optimization at Uber Using Causally-Informed Machine Learning

    Authors: Bobby Chen, Siyu Chen, Jason Dowlatabadi, Yu Xuan Hong, Vinayak Iyer, Uday Mantripragada, Rishabh Narang, Apoorv Pandey, Zijun Qin, Abrar Sheikh, Hongtao Sun, Jiaqi Sun, Matthew Walker, Kaichen Wei, Chen Xu, Jingnan Yang, Allen T. Zhang, Guoqing Zhang

    Abstract: Budget allocation of marketplace levers, such as incentives for drivers and promotions for riders, has long been a technical and business challenge at Uber; understanding lever budget changes' impact and estimating cost efficiency to achieve predefined budgets is crucial, with the goal of optimal allocations that maximize business value; we introduce an end-to-end machine learning and optimization… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: To be published in the 2nd Workshop on Causal Inference and Machine Learning in Practice, KDD 2024, August 25 to 29, 2024, Barcelona, Spain, 10 pages

    MSC Class: 62J99

  14. arXiv:2407.18377  [pdf, other

    stat.AP

    Bayesian Nowcasting Data Breach IBNR Incidents

    Authors: Maochao Xu, Hong Sun, Peng Zhao

    Abstract: The reporting delay in data breach incidents poses a formidable challenge for Incurred But Not Reported (IBNR) studies, complicating reserve estimation for actuarial professionals. This work presents a novel Bayesian nowcasting model designed to accurately model and predict the number of IBNR data breach incidents. Leveraging a Bayesian modeling framework, the model integrates time and heterogeneo… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  15. arXiv:2407.16975  [pdf, other

    cs.LG stat.ME

    On the Parameter Identifiability of Partially Observed Linear Causal Models

    Authors: Xinshuai Dong, Ignavier Ng, Biwei Huang, Yuewen Sun, Songyao Jin, Roberto Legaspi, Peter Spirtes, Kun Zhang

    Abstract: Linear causal models are important tools for modeling causal dependencies and yet in practice, only a subset of the variables can be observed. In this paper, we examine the parameter identifiability of these models by investigating whether the edge coefficients can be recovered given the causal structure and partially observed data. Our setting is more general than that of prior research - we allo… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  16. arXiv:2407.16870  [pdf, other

    stat.ME

    CoCA: Cooperative Component Analysis

    Authors: Daisy Yi Ding, Alden Green, Min Woo Sun, Robert Tibshirani

    Abstract: We propose Cooperative Component Analysis (CoCA), a new method for unsupervised multi-view analysis: it identifies the component that simultaneously captures significant within-view variance and exhibits strong cross-view correlation. The challenge of integrating multi-view data is particularly important in biology and medicine, where various types of "-omic" data, ranging from genomics to proteom… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  17. arXiv:2407.11678  [pdf, other

    cs.LG math.ST stat.ML

    Theoretical Insights into CycleGAN: Analyzing Approximation and Estimation Errors in Unpaired Data Generation

    Authors: Luwei Sun, Dongrui Shen, Han Feng

    Abstract: In this paper, we focus on analyzing the excess risk of the unpaired data generation model, called CycleGAN. Unlike classical GANs, CycleGAN not only transforms data between two unpaired distributions but also ensures the mappings are consistent, which is encouraged by the cycle-consistency term unique to CycleGAN. The increasing complexity of model structure and the addition of the cycle-consiste… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  18. arXiv:2407.10448  [pdf, other

    cs.LG stat.ML

    Spectral Representation for Causal Estimation with Hidden Confounders

    Authors: Tongzheng Ren, Haotian Sun, Antoine Moulin, Arthur Gretton, Bo Dai

    Abstract: We address the problem of causal effect estimation where hidden confounders are present, with a focus on two settings: instrumental variable regression with additional observed confounders, and proxy causal learning. Our approach uses a singular value decomposition of a conditional expectation operator, followed by a saddle-point optimization problem, which, in the context of IV regression, can be… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  19. arXiv:2407.07873  [pdf, other

    cs.LG math.DS math.OC math.PR stat.ML

    Dynamical Measure Transport and Neural PDE Solvers for Sampling

    Authors: Jingtong Sun, Julius Berner, Lorenz Richter, Marius Zeinhofer, Johannes Müller, Kamyar Azizzadenesheli, Anima Anandkumar

    Abstract: The task of sampling from a probability density can be approached as transporting a tractable density function to the target, known as dynamical measure transport. In this work, we tackle it through a principled unified framework using deterministic or stochastic evolutions described by partial differential equations (PDEs). This framework incorporates prior trajectory-based sampling methods, such… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  20. arXiv:2407.05895  [pdf, other

    cs.LG stat.ML

    Link Representation Learning for Probabilistic Travel Time Estimation

    Authors: Chen Xu, Qiang Wang, Lijun Sun

    Abstract: Travel time estimation is a crucial application in navigation apps and web mapping services. Current deterministic and probabilistic methods primarily focus on modeling individual trips, assuming independence among trips. However, in real-world scenarios, we often observe strong inter-trip correlations due to factors such as weather conditions, traffic management, and road works. In this paper, we… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  21. arXiv:2407.03082  [pdf, other

    cs.LG stat.ML

    Stable Heterogeneous Treatment Effect Estimation across Out-of-Distribution Populations

    Authors: Yuling Zhang, Anpeng Wu, Kun Kuang, Liang Du, Zixun Sun, Zhi Wang

    Abstract: Heterogeneous treatment effect (HTE) estimation is vital for understanding the change of treatment effect across individuals or subgroups. Most existing HTE estimation methods focus on addressing selection bias induced by imbalanced distributions of confounders between treated and control units, but ignore distribution shifts across populations. Thereby, their applicability has been limited to the… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted by ICDE'2024

  22. arXiv:2407.01004  [pdf, other

    cs.LG stat.ME

    CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect

    Authors: Jiehui Zhou, Linxiao Yang, Xingyu Liu, Xinyue Gu, Liang Sun, Wei Chen

    Abstract: In causal inference, estimating heterogeneous treatment effects (HTE) is critical for identifying how different subgroups respond to interventions, with broad applications in fields such as precision medicine and personalized advertising. Although HTE estimation methods aim to improve accuracy, how to provide explicit subgroup descriptions remains unclear, hindering data interpretation and strateg… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 12 pages, 3 figures

  23. arXiv:2407.00791  [pdf, other

    stat.ME stat.CO

    inlabru: software for fitting latent Gaussian models with non-linear predictors

    Authors: Finn Lindgren, Fabian Bachl, Janine Illian, Man Ho Suen, Håvard Rue, Andrew E. Seaton

    Abstract: The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian mode… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    MSC Class: 62-04

  24. arXiv:2406.14808  [pdf, other

    math.ST cs.LG stat.ME stat.ML

    On the estimation rate of Bayesian PINN for inverse problems

    Authors: Yi Sun, Debarghya Mukherjee, Yves Atchade

    Abstract: Solving partial differential equations (PDEs) and their inverse problems using Physics-informed neural networks (PINNs) is a rapidly growing approach in the physics and machine learning community. Although several architectures exist for PINNs that work remarkably in practice, our theoretical understanding of their performances is somewhat limited. In this work, we study the behavior of a Bayesian… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: 35 Pages, 3 figures, and 2 tables

  25. arXiv:2406.14784  [pdf, other

    cs.LG stat.OT

    Active Learning for Fair and Stable Online Allocations

    Authors: Riddhiman Bhattacharya, Thanh Nguyen, Will Wei Sun, Mohit Tawarmalani

    Abstract: We explore an active learning approach for dynamic fair resource allocation problems. Unlike previous work that assumes full feedback from all agents on their allocations, we consider feedback from a select subset of agents at each epoch of the online resource allocation process. Despite this restriction, our proposed algorithms provide regret bounds that are sub-linear in number of time-periods f… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  26. arXiv:2406.10917  [pdf, other

    cs.LG stat.ML

    Bayesian Intervention Optimization for Causal Discovery

    Authors: Yuxuan Wang, Mingzhou Liu, Xinwei Sun, Wei Wang, Yizhou Wang

    Abstract: Causal discovery is crucial for understanding complex systems and informing decisions. While observational data can uncover causal relationships under certain assumptions, it often falls short, making active interventions necessary. Current methods, such as Bayesian and graph-theoretical approaches, do not prioritize decision-making and often rely on ideal conditions or information gain, which is… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  27. arXiv:2406.05372  [pdf, ps, other

    stat.ML cs.LG

    Bridging the Gap: Rademacher Complexity in Robust and Standard Generalization

    Authors: Jiancong Xiao, Ruoyu Sun, Qi Long, Weijie J. Su

    Abstract: Training Deep Neural Networks (DNNs) with adversarial examples often results in poor generalization to test-time adversarial data. This paper investigates this issue, known as adversarially robust generalization, through the lens of Rademacher complexity. Building upon the studies by Khim and Loh (2018); Yin et al. (2019), numerous works have been dedicated to this problem, yet achieving a satisfa… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: COLT 2024

  28. arXiv:2406.03849  [pdf

    cs.LG stat.AP stat.ML

    A Noise-robust Multi-head Attention Mechanism for Formation Resistivity Prediction: Frequency Aware LSTM

    Authors: Yongan Zhang, Junfeng Zhao, Jian Li, Xuanran Wang, Youzhuang Sun, Yuntian Chen, Dongxiao Zhang

    Abstract: The prediction of formation resistivity plays a crucial role in the evaluation of oil and gas reservoirs, identification and assessment of geothermal energy resources, groundwater detection and monitoring, and carbon capture and storage. However, traditional well logging techniques fail to measure accurate resistivity in cased boreholes, and the transient electromagnetic method for cased borehole… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  29. arXiv:2406.02701  [pdf, other

    stat.CO

    MPCR: Multi- and Mixed-Precision Computations Package in R

    Authors: Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton

    Abstract: Computational statistics has traditionally utilized double-precision (64-bit) data structures and full-precision operations, resulting in higher-than-necessary accuracy for certain applications. Recently, there has been a growing interest in exploring low-precision options that could reduce computational complexity while still achieving the required level of accuracy. This trend has been amplified… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  30. arXiv:2406.01799  [pdf, other

    cs.LG math.OC stat.ML

    Online Control in Population Dynamics

    Authors: Noah Golowich, Elad Hazan, Zhou Lu, Dhruv Rohatgi, Y. Jennifer Sun

    Abstract: The study of population dynamics originated with early sociological works but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for control in population dynamics are often restricted to specific, noise-free dynamics,… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  31. arXiv:2406.01335  [pdf, other

    quant-ph q-fin.ST stat.ML

    Statistics-Informed Parameterized Quantum Circuit via Maximum Entropy Principle for Data Science and Finance

    Authors: Xi-Ning Zhuang, Zhao-Yun Chen, Cheng Xue, Xiao-Fan Xu, Chao Wang, Huan-Yu Liu, Tai-Ping Sun, Yun-Jie Wang, Yu-Chun Wu, Guo-Ping Guo

    Abstract: Quantum machine learning has demonstrated significant potential in solving practical problems, particularly in statistics-focused areas such as data science and finance. However, challenges remain in preparing and learning statistical models on a quantum processor due to issues with trainability and interpretability. In this letter, we utilize the maximum entropy principle to design a statistics-i… ▽ More

    Submitted 18 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: 19 pages, 5 figures

  32. arXiv:2406.01252  [pdf, other

    cs.CL cs.AI stat.ML

    Towards Scalable Automated Alignment of LLMs: A Survey

    Authors: Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, Bowen Yu

    Abstract: Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approach… ▽ More

    Submitted 3 September, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Paper List: https://github.com/cascip/awesome-auto-alignment

  33. arXiv:2405.20447  [pdf, other

    stat.ML cs.CY cs.LG

    Algorithmic Fairness in Performative Policy Learning: Escaping the Impossibility of Group Fairness

    Authors: Seamus Somerstep, Ya'acov Ritov, Yuekai Sun

    Abstract: In many prediction problems, the predictive model affects the distribution of the prediction target. This phenomenon is known as performativity and is often caused by the behavior of individuals with vested interests in the outcome of the predictive model. Although performativity is generally problematic because it manifests as distribution shifts, we develop algorithmic fairness practices that le… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  34. arXiv:2405.18782  [pdf, other

    eess.IV cs.CV stat.ML

    Principled Probabilistic Imaging using Diffusion Models as Plug-and-Play Priors

    Authors: Zihui Wu, Yu Sun, Yifan Chen, Bingliang Zhang, Yisong Yue, Katherine L. Bouman

    Abstract: Diffusion models (DMs) have recently shown outstanding capability in modeling complex image distributions, making them expressive image priors for solving Bayesian inverse problems. However, most existing DM-based methods rely on approximations in the generative process to be generic to different inverse problems, leading to inaccurate sample distributions that deviate from the target posterior de… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  35. arXiv:2405.18563  [pdf, other

    cs.LG stat.ME

    Counterfactual Explanations for Multivariate Time-Series without Training Datasets

    Authors: Xiangyu Sun, Raquel Aoki, Kevin H. Wilson

    Abstract: Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interp… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  36. arXiv:2405.17216  [pdf, other

    cs.LG cs.AI cs.LO stat.ML

    Autoformalizing Euclidean Geometry

    Authors: Logan Murphy, Kaiyu Yang, Jialiang Sun, Zhaoyu Li, Anima Anandkumar, Xujie Si

    Abstract: Autoformalization involves automatically translating informal math into formal theorems and proofs that are machine-verifiable. Euclidean geometry provides an interesting and controllable domain for studying autoformalization. In this paper, we introduce a neuro-symbolic framework for autoformalizing Euclidean geometry, which combines domain knowledge, SMT solvers, and large language models (LLMs)… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted to ICML 2024. The first two authors contributed equally

  37. arXiv:2405.17202  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    Efficient multi-prompt evaluation of LLMs

    Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin

    Abstract: Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt va… ▽ More

    Submitted 7 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  38. arXiv:2405.16236  [pdf, ps, other

    stat.ML cs.LG

    A statistical framework for weak-to-strong generalization

    Authors: Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Ya'acov Ritov, Mikhail Yurochkin, Yuekai Sun

    Abstract: Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether the techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalizat… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  39. arXiv:2405.15505  [pdf, other

    cs.LG cs.AI stat.ML

    Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck

    Authors: Hao Yang, Zexu Sun, Hongteng Xu, Xu Chen

    Abstract: As a promising individualized treatment effect (ITE) estimation method, counterfactual regression (CFR) maps individuals' covariates to a latent space and predicts their counterfactual outcomes. However, the selection bias between control and treatment groups often imbalances the two groups' latent distributions and negatively impacts this method's performance. In this study, we revisit counterfac… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 19 pages

  40. arXiv:2405.15172  [pdf, other

    stat.ML cs.LG

    Learning the Distribution Map in Reverse Causal Performative Prediction

    Authors: Daniele Bracale, Subha Maity, Moulinath Banerjee, Yuekai Sun

    Abstract: In numerous predictive scenarios, the predictive model affects the sampling distribution; for example, job applicants often meticulously craft their resumes to navigate through a screening systems. Such shifts in distribution are particularly prevalent in the realm of social computing, yet, the strategies to learn these shifts from data remain remarkably limited. Inspired by a microeconomic model… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 17 pages, 4 figures

  41. arXiv:2405.14982  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    In-context Time Series Predictor

    Authors: Jiecheng Lu, Yan Sun, Shihao Yang

    Abstract: Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecast… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  42. arXiv:2405.14892  [pdf, other

    cs.DC stat.CO

    Parallel Approximations for High-Dimensional Multivariate Normal Probability Computation in Confidence Region Detection Applications

    Authors: Xiran Zhang, Sameh Abdulah, Jian Cao, Hatem Ltaief, Ying Sun, Marc G. Genton, David E. Keyes

    Abstract: Addressing the statistical challenge of computing the multivariate normal (MVN) probability in high dimensions holds significant potential for enhancing various applications. One common way to compute high-dimensional MVN probabilities is the Separation-of-Variables (SOV) algorithm. This algorithm is known for its high computational complexity of O(n^3) and space complexity of O(n^2), mainly due t… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  43. arXiv:2405.13346  [pdf, other

    math.OC stat.ML

    Convergence of the Deep Galerkin Method for Mean Field Control Problems

    Authors: William Hofgard, Jingruo Sun, Asaf Cohen

    Abstract: We establish the convergence of the deep Galerkin method (DGM), a deep learning-based scheme for solving high-dimensional nonlinear PDEs, for Hamilton-Jacobi-Bellman (HJB) equations that arise from the study of mean field control problems (MFCPs). Based on a recent characterization of the value function of the MFCP as the unique viscosity solution of an HJB equation on the simplex, we establish bo… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 27 pages, 6 figures

    MSC Class: 91A07; 35Q89; 68T07; 49L12; 49N10; 35A35; 60J27

  44. arXiv:2405.11547  [pdf, other

    stat.ML cs.CR cs.LG

    Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes Errors

    Authors: Ruihan Zhang, Jun Sun

    Abstract: Adversarial examples pose a security threat to many critical systems built on neural networks. While certified training improves robustness, it also decreases accuracy noticeably. Despite various proposals for addressing this issue, the significant accuracy drop remains. More importantly, it is not clear whether there is a certain fundamental limit on achieving robustness whilst maintaining accura… ▽ More

    Submitted 20 June, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

    Comments: accepted by CAV 2024

  45. arXiv:2405.08668  [pdf, other

    cs.CV cs.AI cs.LG stat.AP

    Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research

    Authors: Qinglong Cao, Yuntian Chen, Lu Lu, Hao Sun, Zhenzhong Zeng, Xiaokang Yang, Dongxiao Zhang

    Abstract: Large-scale Vision-Language Models (VLMs) have demonstrated exceptional performance in natural vision tasks, motivating researchers across domains to explore domain-specific VLMs. However, the construction of powerful domain-specific VLMs demands vast amounts of annotated data, substantial electrical energy, and computing resources, primarily accessible to industry, yet hindering VLM research in a… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  46. arXiv:2405.04904  [pdf, other

    stat.ME stat.AP

    Dependence-based fuzzy clustering of functional time series

    Authors: Angel Lopez-Oriona, Ying Sun, Han Lin Shang

    Abstract: Time series clustering is an important data mining task with a wide variety of applications. While most methods focus on time series taking values on the real line, very few works consider functional time series. However, functional objects frequently arise in many fields, such as actuarial science, demography or finance. Functional time series are indexed collections of infinite-dimensional curve… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 43 pages, 5 figures, 10 tables. arXiv admin note: substantial text overlap with arXiv:2402.08687

    MSC Class: 62R10

  47. arXiv:2405.04254  [pdf, ps, other

    stat.ME

    Distributed variable screening for generalized linear models

    Authors: Tianbo Diao, Lianqiang Qu, Bo Li, Liuquan Sun

    Abstract: In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather… ▽ More

    Submitted 7 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  48. arXiv:2405.00859  [pdf, other

    stat.AP

    WATCH: A Workflow to Assess Treatment Effect Heterogeneity in Drug Development for Clinical Trial Sponsors

    Authors: Konstantinos Sechidis, Sophie Sun, Yao Chen, Jiarui Lu, Cong Zang, Mark Baillie, David Ohlssen, Marc Vandemeulebroecke, Rob Hemmings, Stephen Ruberg, Björn Bornkamp

    Abstract: This paper proposes a Workflow for Assessing Treatment effeCt Heterogeneity (WATCH) in clinical drug development targeted at clinical trial sponsors. The workflow is designed to address the challenges of investigating treatment effect heterogeneity (TEH) in randomized clinical trials, where sample size and multiplicity limit the reliability of findings. The proposed workflow includes four steps: A… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  49. arXiv:2405.00675  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Self-Play Preference Optimization for Language Model Alignment

    Authors: Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu

    Abstract: Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language mo… ▽ More

    Submitted 14 June, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 27 pages, 4 figures, 5 tables

  50. arXiv:2405.00581  [pdf, other

    stat.ME stat.CO

    Conformalized Tensor Completion with Riemannian Optimization

    Authors: Hu Sun, Yang Chen

    Abstract: Tensor data, or multi-dimensional array, is a data format popular in multiple fields such as social network analysis, recommender systems, and brain imaging. It is not uncommon to observe tensor data containing missing values and tensor completion aims at estimating the missing values given the partially observed tensor. Sufficient efforts have been spared on devising scalable tensor completion al… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.