Sensitivity Analysis
Sensitivity Analysis
Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors
--Manuscript Draft--
Manuscript Number:
Full Title: Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors
Keywords: Keywords: variable selection, boosting, lasso, risk factors, prostate cancer
Venugopal Gopalakrishna-Remani
Matthew Cooper
Fiona Green
Margaret Rayman
Funding Information:
Abstract: We begin by arguing that the often used algorithm for the discovery and use of disease
risk factors, stepwise logistic regression, is unstable. We then argue that there are
other algorithms available that are much more stable and reliable (e.g. the lasso and
gradient boosting). We then propose a protocol for the discovery and use of risk
factors using lasso or boosting variable selection. We then illustrate the use of the
protocol with a set of prostate cancer data and show that it recovers known risk factors.
Finally, we use the protocol to identify new SNP based risk factors for prostate cancer.
Kenneth Berk
[email protected]
expert in statistical variable selection, Fellow of American Statistical Association,
Winner of ASA/ASTM Youden Award
Felix Offodile
[email protected]
expert statistician, many publications in area
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript Click here to download Manuscript Manuscript5.doc
Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors
1
M&IS Dept., Kent State University, Kent OH 44242, 2Dept of Management, University of Texas-Tyler,
Tyler TX 75799, 3Dept of Internal Medicine, Washington University School of Medicine, St. Louis MO
63110, 4University of Manchester, Div. of Cardiovascular Sciences, School of Medical Sciences, Faculty
of Biology, Medicine and Health, Manchester, UK, 5Dept. of Nutritional Sciences, University of Surrey,
Guildford GU27XH UK
Corresponding Author: David E. Booth, Professor Emeritus, Kent State University, 595 Martinique
Circle, Stow OH 44224; ph. 330-805-0239; email: [email protected]
Abstract
We begin by arguing that the often used algorithm for the discovery and use of disease risk
factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available
that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol
for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the
use of the protocol with a set of prostate cancer data and show that it recovers known risk factors.
Finally, we use the protocol to identify new SNP based risk factors for prostate cancer.
1. Introduction
As Austin and Tu [1] remark, researchers as well as physicians are often interested in determining
the independent predictors of a disease state. These predictors, often called risk factors, are important
in disease diagnosis, prognosis and general patient management as the attending physician tries to
optimize patient care. In addition, knowledge of these risk factors help researchers evaluate new
treatment modalities and therapies as well as help make comparisons across different hospitals [1].
Because risk factors are so important in patient care it behooves us to do the best job possible in the
discovery and use of disease risk factors. Because new statistical methods [2], [3], [4], [5], [6]. [7],
[8], [9] have been and are being developed, [8] it is important for risk factor researchers to be aware
of these new methods and to adjust their discovery and use of risk factor protocols as is necessary. In
2
this paper, we argue that now is such a time. For a number of years in risk factor research a method
of automatic variable selection called stepwise regression and its variants forward selection and
backward elimination [10] (chapter 9)) have been used even as new methods have become available
(see [11], [12], [13], [14], [15], [16], [17] and many others). The last three cited are risk factor
studies. We do not argue for a change of protocols in risk factor discovery and use simply because
newer methods are available. As literature shows [1] the older methods are often unreliable and the
newer methods are much less so. We point out that the purpose of this paper is the following:
1. To summarize some of the studies that show that stepwise regression and its
variants, as now used more often than they should be in risk factor studies, are
unreliable and in fact may cause some of the irreproducibility of life sciences
2. To argue on the basis of current research that there are methods available that are
3. To propose a modern statistical protocol for the discovery and use of risk factors
4. To illustrate the use of the protocol developed in 3 using a set of prostate cancer data
[19].
5. To report the finding of new prostate cancer risk factors using the modern
procedures.
We further note that nothing in the way of statistical methods is new in this paper. What is new is the
introduction of a clear protocol to identify and use disease risk factors that involve much less problematic
methods than stepwise regression. We then use the proposed methodology to identify a known prostate
cancer risk factor and then discover new prostate cancer risk factors.
3
1.2.What then should replace these automatic variable selection methods?
From the references in Section 1, we see that the shrinkage methods have done well when
compared to the current stepwise and all subsets methods and thus we follow the suggestion of
Steyerburg et al [4] and look at shrinkage methods.The question then becomes what shrinkage method
might we choose as the next variable selection method? We are impressed by the work of Ayers and
Cordell [2] in this regard. First we note that shrinkage estimators are also called penalized estimators. In
particular the lasso [7] as defined by Zou[20] can be considered. We note that the factor lambda is said to
be the penalty.
Now Ayers and Cordell [2] studied “the performance of penalizations in selecting SNPs as
predictors in genetic association studies.”, where SNP stands for single nucleotide polymorphism. Their
conclusion is: “Results show that penalized methods outperform single marker analysis, with the main
difference being that penalized methods allow the simultaneous inclusion of a number of markers, and
generally do not allow correlated variables to enter the model in which most of the identified explanatory
markers are accounted for.”, as shown by Tibshirani [7]. In addition, lasso prevents overfitting the model
[9], p 304. At this point, penalty estimators (i.e. shrinkage) look very attractive in risk factor type
studies.[9] (chapter 16.), especially given the relationship between lasso and boosting. [9], p. 320
Another paper [20] helps us make our final decision. Zou [20] considers a procedure called
adaptive lasso in which different values of the parameter λ are allowed for each of the regression
4
coefficients. Furthermore, Zou shows that an adaptive lasso procedure is an oracle procedure such that
Zou then extends these results to the adaptive lasso for logistic regression. Wang and Lang
[21] developed an approximate adaptive lasso (i.e. a different λ for each β is allowed) by least
squares approximation for many types of regression. Boos [22] shows how easy it is to
implement this software in the statistical language R for logistic regression. Thus, we choose
to use the least squares approximation to their adaptive lasso logistic regression in the next
section. We note here that a special variant of lasso, group lasso [23] is needed for
In the next section, we propose and discuss a protocol for the discovery and use of risk factors in
logistic regression models. In the following section we illustrate the use of the protocol using the
data of Cooper et al [19] to look at some risk factors for prostate cancer. We will show that
currently known risk factors can be identified as well as new risk factors discovered using these
methods.
In addition a second new method of variable selection called gradient boosting has been
developed.[24], [25], [26], Chapter 8,[27], [9], (Chapter 17.) This method has some of the same
advantages as lasso and we add it to the protocol and test it as well. The boosting method makes
5
2.1. A suggested protocol for using logistic type regression to discover and use disease risk
factors.
Our suggested protocol is shown below. We discuss the protocol in this section and illustrate its
use with prostate cancer risk factors in the following section. This protocol uses the R statistical
language. R was chosen because of its power and the fact that all of the required algorithms are
available in R.
2. Input to R.
risk factor) as described by Harrell [28](Chapter 10) for logistic type regression.
4. Select a set of potential risk factors. If an X variable is continuous, we suggest use of the
Bianco-Yohai(robust (outlier resistant, see [30]) estimator and further suggest putting outliers
aside for further analysis as they may give rise to extra information[30].
5. Now build a full risk factor prediction model as described by Shmueli [39].
6. Use potential risk factors (Xs) to form a full model with the appropriate dependent variable
(as in 3).
7. If any variables are continuous repeat 4 using the entire potential full prediction model.
6
8. With any outliers set aside for further study, regress the dependent variable on the logistic
regression type full model using the adaptive lasso method, least squares approximation, as
risk factor based reduced model. If categorical risk factors are present use group lasso
regression [23]. Use graphs like Fig. 1 in [23] to identify the zero lasso regression
11. Validate the reduced model, with the similar validation of the full model of step 6, if there is
any doubt about variables discarded from the full model, using bootstrap cross validation or
10-fold cross validation [28] and then check the usual model diagnostics [29] for either lasso
or boosting or both.
12. Predict with the reduced model containing the appropriate risk factors as described in Harrell
A. We note that for the genome wide case of predictors one should refer to [33] and [34].
B. All logistic regression assumptions should be checked and satisfied as in Pregibon [26].
3. Results
7
This example is taken from Cooper et al[19] where the data(including all sample sizes) and
biological system are described. The data set used in this paper is a subset of the Cooper et al data set
with all observations containing missing values of model variables are removed since we note that all
potential predictor variables are categorical so no imputation was performed. The coding assignments and
the variable definitions are given in the Appendix. The simple and multiple logistic regressions are
carried out as described in [28]. Robust logistic regressions, when needed, are carried out as described in
[30]. Variable selection is carried out using the adaptive lasso [20] with the least squares approximation
of Wang and Leng [21] for continuous independent variables and by group lasso [23] for categorical
independent variables. Gradient boosting is carried out using R Package gbm[24] as described by [25],
[31], [27]. All computations are carried out using the R statistical language. The R functions for variable
selection (adaptive lasso and group lasso) along with the papers are available from Boos [22], and used as
described there. The use of the group lasso R function is covered in R help for packages grplasso and
grpreg. The data sets and R programs are available from the authors (DEB). The variables studied as
potential risk factors are listed in the X column of Table 1. The dependent variable is current status.
We now follow the protocol and explain each step in detail. We begin by considering the one
predictor logistic regressions in Table 1. First note that all potential risk factors in this data set are
categorical (factors) so we do not have to consider the Bianco-Yohai [35] estimator of protocol Step 4 for
this data. We note that this is often not the case. Cooper et al [19] hypothesize a SNP-SNP interaction as a
risk factor for prostate cancer where SNP denotes a single nucleotide polymorphism. We now test this
hypothesis and attempt to answer the question is there such an interaction? In order to answer this
question, we first note that the answer is not completely contained in Table 1. Second, we recall that we
have a gene-gene interaction of two genes if both affect the final phenotype of the individual together. To
8
be specific, we now consider the two genes representing the relevant alleles of the SEPP1 and SOD2
genes. If there is a gene-gene interaction, we must see the following statistically. The relevant alleles of
the SEPP1 and SOD2 genes must be selected to be in a reasonable prediction equation for the disease
state by the appropriate lasso or boosting algorithm (see Figures 1,2, Tables 3, 4). The appropriate lasso
algorithm here is the group lasso for logistic regression because the predictor variables are categorical.
We now note that in our data set we have four candidate predictor variables from which to search for our
Either observation of the Variable Values or a simple trial shows that we cannot include all four variables
in the model at once because they are pairwise collinear. Hence we have to separate the variables into the
two cases, the models of Figure 1 and Figure 2. We also note that lasso generally does not allow
We now begin our search using lasso with the model of Figure 1. This gives us a candidate for an
interaction. We then perform the group lasso analysis of Figure 1. Here we must determine if the
relevant alleles are included in the group lasso selected prediction equation. Roughly this is the case if
the lasso regression coefficients are not zero at the end of the algorithm’s execution as shown on the
coefficient path plot of Figure 1. By looking at equation (2.2) of [23] we see that 0≤λ<∞ hence as
λ→∞, sλ(β)→0 and thus βi→0 but not uniformly. Hence the question is what value of λ do we choose to
determine if the coefficients are close enough to zero to discard that term from the model as a zero
9
Table 1 Simple Logistic Regression Results
Dependent variable CURRENTSTATUS Intercepts are not listed
X Coeff. SE P
X_STRATUM -.055132 .005646 <2x10-16
MnSOD_AD_Final 0 -0.4334 .1241 0.000477
1 -0.2478 .1157 0.032196
2 -0.3140 .1233 0.010879
SeP_Ad_Final 0 0.21219 0.10309 0.039557
1 0.12890 0.10754 0.230675
2 0.23484 0.15797 0.137117
MnSOD_DOM_Final0 0.4334 0.1241 0.000477
1 0.2704 0.1126 0.016369
SeP_DOM_Final 0 0.21219 0.10309 0.039557
1 0.14445 0.10568 0.171679
Smoke_ever 0 -.00339 .08161 0.967
1 -.03791 .07016 0.589
Alco_ever 0 -0.428943 0.142425 0.0026
1 0.002951 0.062317 0.9622
FAMHIST 0.84619 0.09497 <2x10-16
Table 2
Note: λmin computed by package grpreg using a Bayesian Information Criterion λmax was computed by
package grplasso.
10
Figure 1 – The Group Lasso Coefficient plot for the logistic regression –
We note that for lambda=λopt none of the paths shrink to zero suggesting that a SNP-SNP interaction,
11
12
Figure 2 – Grouplasso Coefficient Plot for Model Containing MnSOD_AD_Final and SeP_DOM_Final
13
14
optimal λ to use we choose λ=1.428 to be the cutoff point. Hence we can now apply the condition of the
previous paragraph. We now check Figure 1 to see which if any of these candidate alleles are selected
for the group lasso prediction equation which was our criterion. We now examine the Figure 1 plot at
λopt=1.428. We note that at this λ none of the candidate alleles have coefficients of zero. Hence using
1. We need Figure 1 selection to show interaction. SeP_Ad_Final0 was Ala/Ala so this is one
allele that qualifies. Similarly for SeP_Ad_Final1 and 2 which are Ala/Thr and Thr/Thr
respectively.
2. Both MnSOD_DOM_Final0 and MnSOD_DOM_Final1 (i.e. Ala/Ala and +/Ala) satisfy so this
shows that for MnSOD the result is +/Ala. Hence the identified interaction alleles are
SEPP1 SOD2
Ala/Ala +/Ala
which agrees with the Cooper et al [19] finding on a gene-gene interaction risk factor.
Ala/Thr +/Ala
Thr/Thr +/Ala
We now repeat this analysis for the model which contains the other possible candidate alleles. By
our criterion for gene-gene interaction we need βi≠0 for λopt=0.635, from observing Table 2. Now by
15
Final the 0, 1 and 2 values meet the criteria while for SeP_DOM_Final only the 0 and 1 alleles do. By
SeP_DOM_Final0 is Ala/Ala
MnSOD_AD_Final0 is Val/Val
1 is Val/Ala
2 is Ala/Ala
Hence we conclude that we have additional gene-gene interactions that are risk factors. Since one
SEPP1 SOD2
Ala/Ala Val/Val
Ala/Ala Val/Ala
+/Thr Val/Val
+/Thr Val/Ala
as risk factors. None of these have been reported in the prior literature as far as we can determine
We can now make prediction equations using our now known risk factors which will give our
predicted diagnosis of whether or not a patient is at risk for prostate cancer based on our variable values
assuming that we use a new observation not one which is included in our current data set.. We
recommend the use of bootstrap cross validation to validate this equation and full details are included in
[28]. As a final reminder, all of the other assumptions of logistic regression need to be checked each and
16
every time such a model is used. The reader is referred to Pregibon [29] for further details.These new risk
factor results are particularly important since the SEPP1 gene product is in the same metabolic path as a
We now repeat the analysis using gradient boosting. The results are shown in Tables 3 and 4.
TABLE 3
MnSOD_DOM_Final 68.96
SeP__Ad_Final 31.03
TABLE 4
MnSOD_AD_Final 75.29
SeP_DOM_Final 24.70
17
4.Discussion
As much as we would like this to be the last word on the discovery and use of disease risk factors
with logistic regression, it is not. We will mention a few possible limitations and our hope for some future
First, Ayers and Cordell [2] mention a limitation of this suggestion, the fact that there is no
known way to get confidence intervals and p-values for lasso estimates i.e. the lasso regression
coefficients. Fortunately this is changing. Currently, there is a paper by Lockhart et al (2012) entitled “A
significance test for the lasso”. While this is a complicated paper that doesn’t solve all problems a strong
beachhead has been established. Unfortunately this is not a test on individual lasso regression coefficients.
Next, we discussed the advantages of adaptive lasso earlier (esp. the oracle property) but no
algorithm currently exists to solve the adaptive group lasso problem in the case of logistic regression. We
conjecture based on the results of the linear regression case extended to the logistic case that if we could
extend adaptive lasso to the group lasso for logistic regression cases that the same desirable properties of
Finally the usual problems of outliers, etc., as always, raise their head. The Bianco-Yohai
algorithm [35]) is a start but this hasn’t been extended to any penalized shrinkage regression method. We
conclude that there is much work to be done and fully expect to see other papers like this one in the future
and hopefully statistical practice can continue to evolve and even better solutions can be applied to these
5. Conclusion
18
We have attempted in this paper to bring up to date statistical thinking to the problem of the
identification and use of disease risk factors, where stepwise regression is still too often used. Much
remains to be done, but we hope that the ideas presented here will improve statistical practice in this very
important area. In the process of bringing this thinking up to date, we have shown that we recover a
currently known risk factor and identify new risk factors which suggest the value of our approach. These
new risk factor results are particularly important since the SEPP1 gene product has recently been shown
to be in the same metabolic pathway as a tumor suppressor for prostate cancer [36]
19
Appendix
Data Set
Variable
Cases
were
classifie
d as
either
non-
aggressi
ve at
diagnosi
s (tumor
stage 1
and 2,
Gleason
score <
8,
Different
iation
G1-G2,
NP/NX,
MO/MX
, PSA <
100
μg/L;
NPC) or
aggressi
ve at
diagnosi
20
s (tumor stage 3-4, Gleason score ≥ 8, Differentiation G3-G4, N+, M+, PSA ≥ 100 μg/L;APC).
21
References
1 Austin, P. and Tu, J (2004), Automated Variable Selection Methods for logistic regression produced
unstable models for predicting acute myocardial infarction mortality, J. Clinical Epidemiology 57, 1138-
1146.
2 Ayers, K and Cordell, H (2010), SNP Selection in Genome-Wide and Candidate Gene Studies via
3 Yuan, M. and Lin, Y (2006), Model Selection and Estimation in Regression with Grouped
4 Steyerberg, E, Eijkemans, M, Harrell, Jr, F, Habbema, J. (2000), Prognostic Modeling with logistic
regression analysis: a comparison of selection and estimation methods in small data sets, Statist. Med.
5 Wiegand, R (2009), Performance of Using Multiple Stepwise algorithms for variable selection Statist.
6 Breiman, L (1995), Better Subset Regression Using the Nonnegative garrote, Technometrics 37 (4),
373-384
7 Tibshirani, R (1996), Regression Shrinkage and Selection via the lasso, Journal of the Royal Statistical
8 Dahlgren, J (2010), Alternative Regression Methods are not considered in Murtaugh (2009) or by
9 Efron, B. and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge, Cambridge University
Press.
22
10 Chatterjee, S and Price B, (1977), Regression Analysis by Example, New York: John Wiley and Sons.
11 Neter J, Wasserman, W and Kutner, M (1983) Applied Linear Regression Models, Homewood:
Richard D. Irwin
12 Kutner, M, Nachtsheim, C, Neter, J, Li, W, (2005) Applied Linear Statistical Models, 5th ed., New
13 Labidi, M, Baillot, R, Dionne, B, LaCasse, Y, Maltais, F and Boulet, L (2009), Pleural Effusions
14 Queiroz, N, Sampaio, D, Santos, E, Bezerra, A. 2012, Logistic model for determining factors,
associated with HIV infection among blood donor candidates at the Fundacao HEMOPE
15 Qiu, l, Cheng, X, Wu, J, Liu, J, Xu, T, Ding, H, Liu, Y, Ge, Z, Wang, Y, Han, H, Liu, J, Zhu, G, 2013,
Prevalence of hyperuricemia and its related risk factors in healthy adults from Northern and Northeastern
16 Guo, L., Guo, X., Chang, Y., Yang, T., Zhang, L., Li, T., and Sun, Y. Prevalence and Risk Factors of
Heart Failure with the Preserved Injection Fraction, Int. J. Environ., Res. Public Health 2016, 13(8), 770
17 Khan, MS, Pervaiz, MK, Javed, I, Biostatistical Study of Clinical Risk Factors in Myocardial
18Arnaud, D. H. (2014), Confronting Irreproducibility, Chemical and Engineering News 92 (50), 28-30.
19 Cooper, M., Adami, H., Gronberg, H., Wiklund, F., Green, F., Rayman, M. (2008), Interaction
Determines Prostate Cancer Risk, Cancer Res 2008: 68: (24), 10171-10177
20 Zou, H. (2006) The Adaptive lasso and its Oracle Properties, Journal of the American Statistical
21 Wang, H and Leng, C. (2008), A note on adaptive group lasso, Computational Statistics and Data
23
22 Boos, D. (2014) au., Adaptive Lasso in R, 2/9/2014,
http://www.stat.ncsu.edu/~boos/var.select/lasso.adaptive.html
23 Meier, L, Van der Geer, S, Buhlmann, P (2008), The group lasso for logistic regression J.R. Statist,
accessed 9/1/2016.
26 James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning,
N.Y.: Springer.
27 Maloney, K., Schmid, M., Weller, D., Applying Additive Modeling and Gradient Boosting to Assess
the Effects of Watershed and Reach Characteristics on Riverine Assemblages, Methods in Ecology and
28 Harrell, Jr., F., (2001) Regression Modeling Strategies; New York: Springer.
30 Hauser, R. and Booth, D (2011), Predicting Bankruptcy with robust logistic regression, J. Data Sci
9(4), 585-605.
31 Ho, R. (2012), Big Data Machine Learning, DZoneRefCard z #158, Carey NC:DZone Inc
32 Ryan, T (2009), Modern Regression Methods 2nd Ed, Hoboken, NJ: Wiley
33 Li, H, Das, K, Fu, G, Li, R, Wu, R (2011), The Bayesian lasso for genome-wide association studies,
34 Wu, T, Chen, Y F, Hastie T, Sobel, E, Lange, K, 2009 Genome wide association analysis by lasso
35 Bianco, A and Martinez, E. (2009), Robust testing in the logistic regression model, Computational
24
36 Ansong, E., Ying, Q., Ekoue, D. N., Deaton, R., Hall, A. R., Kajdacsy-Galla, A., Yang, W., Gann, P.
H., Diamond, A. M. (2015) Evidence that Selenium Binding Protein 1 is a Tumor Suppressor in Prostate
37 Lockhart, R, Taylor, J, Tibshirani, R. J, Tibshirani, R, A significance test for the lasso (2012),
38 Elith, J, Leathwick, J, Hastie, T (2008) A Working Guide to Boosted Regression Trees, Journal of
25