Zum Hauptinhalt springen

Showing 1–14 of 14 results for author: Scornet, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.03819  [pdf, other

    stat.ML cs.LG

    Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

    Authors: Abdoulaye Sakho, Emmanuel Malherbe, Erwan Scornet

    Abstract: Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) simply copies the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we int… ▽ More

    Submitted 3 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  2. arXiv:2209.15283  [pdf, other

    stat.ML cs.LG

    Sparse tree-based initialization for neural networks

    Authors: Patrick Lutz, Ludovic Arnould, Claire Boyer, Erwan Scornet

    Abstract: Dedicated neural network (NN) architectures have been designed to handle specific data types (such as CNN for images or RNN for text), which ranks them among state-of-the-art methods for dealing with these data. Unfortunately, no architecture has been found for dealing with tabular data yet, for which tree ensemble methods (tree boosting, random forests) usually show the best predictive performanc… ▽ More

    Submitted 30 September, 2022; originally announced September 2022.

  3. arXiv:2202.01463  [pdf, other

    stat.ML cs.LG

    Minimax rate of consistency for linear models with missing values

    Authors: Alexis Ayme, Claire Boyer, Aymeric Dieuleveut, Erwan Scornet

    Abstract: Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which tu… ▽ More

    Submitted 3 February, 2022; originally announced February 2022.

  4. arXiv:2106.00311  [pdf, other

    stat.ML cs.AI cs.LG

    What's a good imputation to predict with missing values?

    Authors: Marine Le Morvan, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all mi… ▽ More

    Submitted 30 November, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

  5. arXiv:2105.11724  [pdf, other

    stat.ML cs.LG

    SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

    Authors: Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

    Abstract: Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating… ▽ More

    Submitted 2 February, 2022; v1 submitted 25 May, 2021; originally announced May 2021.

  6. arXiv:2102.13347  [pdf, other

    stat.ML cs.LG

    MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

    Authors: Clément Bénard, Sébastien da Veiga, Erwan Scornet

    Abstract: Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to r… ▽ More

    Submitted 1 March, 2022; v1 submitted 26 February, 2021; originally announced February 2021.

  7. arXiv:2010.15690  [pdf, other

    cs.LG math.ST

    Analyzing the tree-layer structure of Deep Forests

    Authors: Ludovic Arnould, Claire Boyer, Erwan Scornet, Sorbonne Lpsm

    Abstract: Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) (Zhou \& Feng,2019). In this paper, our aim is not to benchmark DF performances but to investigate instead their underlying mech… ▽ More

    Submitted 14 October, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

  8. arXiv:2007.01627  [pdf, other

    cs.LG cs.AI stat.ML

    NeuMiss networks: differentiable programming for supervised learning with missing values

    Authors: Marine Le Morvan, Julie Josse, Thomas Moreau, Erwan Scornet, Gaël Varoquaux

    Abstract: The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing pattern… ▽ More

    Submitted 4 November, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

    Journal ref: Advances in Neural Information Processing Systems 33, Dec 2020, Vancouver, Canada

  9. arXiv:2004.14841  [pdf, other

    stat.ML cs.AI cs.LG

    Interpretable Random Forests via Rule Extraction

    Authors: Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

    Abstract: We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as "black boxes" because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly re… ▽ More

    Submitted 10 February, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

  10. arXiv:2002.00658  [pdf, other

    cs.LG cs.AI stat.ML

    Linear predictor on linearly-generated data with missing values: non consistency and solutions

    Authors: Marine Le Morvan, Nicolas Prost, Julie Josse, Erwan Scornet, Gaël Varoquaux

    Abstract: We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and t… ▽ More

    Submitted 12 May, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of Machine Learning Research, PMLR, In press

  11. arXiv:1908.06852  [pdf, other

    stat.ML cs.LG math.ST

    SIRUS: Stable and Interpretable RUle Set for Classification

    Authors: Clément Bénard, Gérard Biau, Sébastien da Veiga, Erwan Scornet

    Abstract: State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such… ▽ More

    Submitted 16 December, 2020; v1 submitted 19 August, 2019; originally announced August 2019.

  12. arXiv:1906.10529  [pdf, other

    stat.ML cs.LG math.ST

    AMF: Aggregated Mondrian Forests for Online Learning

    Authors: Jaouad Mourtada, Stéphane Gaïffas, Erwan Scornet

    Abstract: Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such tree-ensemble methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training… ▽ More

    Submitted 15 May, 2020; v1 submitted 25 June, 2019; originally announced June 2019.

  13. arXiv:1902.06931  [pdf, other

    stat.ML cs.LG math.ST

    On the consistency of supervised learning with missing values

    Authors: Julie Josse, Jacob M. Chen, Nicolas Prost, Erwan Scornet, Gaël Varoquaux

    Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two appr… ▽ More

    Submitted 21 March, 2024; v1 submitted 19 February, 2019; originally announced February 2019.

  14. arXiv:1604.07143  [pdf, other

    stat.ML cs.LG math.ST

    Neural Random Forests

    Authors: Gérard Biau, Erwan Scornet, Johannes Welbl

    Abstract: Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior know… ▽ More

    Submitted 3 April, 2018; v1 submitted 25 April, 2016; originally announced April 2016.