Online cross-validation-based ensemble learning

Stat Med. 2018 Jan 30;37(2):249-260. doi: 10.1002/sim.7320. Epub 2017 May 4.

Abstract

Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and, as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to identify the algorithm with the best performance. We show that by basing estimates on the cross-validation-selected algorithm, we are asymptotically guaranteed to perform as well as the true, unknown best-performing algorithm. We provide extensions of this approach including online estimation of the optimal ensemble of candidate online estimators. We illustrate excellent performance of our methods using simulations and a real data example where we make streaming predictions of infectious disease incidence using data from a large database. Copyright © 2017 John Wiley & Sons, Ltd.

Keywords: cross-validation; dependent data ensemble learning; machine learning; online estimation; stochastic gradient descent; time-series.

Publication types

  • Validation Study

MeSH terms

  • Algorithms*
  • Biostatistics
  • Communicable Diseases / epidemiology
  • Computer Simulation
  • Databases, Factual / statistics & numerical data
  • Humans
  • Incidence
  • Likelihood Functions
  • Machine Learning / statistics & numerical data*
  • Models, Statistical
  • Online Systems
  • Regression Analysis
  • Stochastic Processes