Predictive modeling is the development of a model that is best able to predict an outcome based on given input variables. Model algorithms are different processes that are used to define functions that transform the data within models. Common algorithms include logistic regression (LR), linear discriminant analysis (LDA), classification and regression trees (CART), naïve Bayes (NB), and k-nearest neighbor (KNN). Data preprocessing option, such as feature extraction and reduction, and model algorithms are commonly selected empirically in epidemiological studies even though these decisions can significantly affect model performance. Accordingly, full model selection (FMS) methods were developed to provide a systematic approach to select predictive modeling methods; however, current limitations of FMS, such as its dependency on user-selected hyperparameters, have prevented their routine incorporation into analyses for model performance optimization. Here we present the use of regression trees as an innovative method to apply FMS. Regression tree FMS (rtFMS) requires the development of a model for every combination of predictive modeling method options under consideration. The iterated, cross-validation performances of these models are then passed through a regression tree for selection of a final model. We demonstrate the benefits of rtFMS using a milk Fourier transform infrared spectroscopy dataset, wherein we build prediction models for two blood metabolic health parameters in dairy cows, nonesterified fatty acids (NEFA) and β-hydroxybutyrate acid (BHBA). The goal for building NEFA and BHBA prediction models is to provide a milk-based screening tool for metabolic health in dairy cattle that can be incorporated automatically in milk analysis routines. These models could be used in conjunction with physical exams, cow side tests, and other indications to initiate medical intervention. In contrast to previously reported FMS methods, rtFMS is not a black box, is simple to implement and interpret, it does not have hyperparameters, and it illustrates the relative importance of modeling options. Additionally, rtFMS allows for indirect comparisons among models developed using different datasets. Finally, rtFMS eliminates user bias due to personal preference for certain methods and rtFMS removes the dependency on published comparisons of methods. Thus, rtFMS provides clear benefits over the empirical selection of data preprocessing options and model algorithms.
Keywords: Fourier-transform infrared spectra; Full model selection; Prediction model; Preprocessing; Regression tree.
Copyright © 2018 Elsevier B.V. All rights reserved.