Sensitivity Analysis

Cancer Causes & Control
Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors
--Manuscript Draft--
Manuscript Number:
Full Title: Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors
Article Type: Original Article
Keywords: Keywords: variable selection, boosting, lasso, risk factors, prostate cancer
Corresponding Author: David Booth, PhD

Kent State University
Stow, OH UNITED STATES
Corresponding Author Secondary

Information:
Corresponding Author's Institution: Kent State University
Corresponding Author's Secondary

Institution:
First Author: David Booth, PhD
First Author Secondary Information:
Order of Authors: David Booth, PhD
Venugopal Gopalakrishna-Remani
Matthew Cooper
Fiona Green
Margaret Rayman
Order of Authors Secondary Information:
Funding Information:
Abstract: We begin by arguing that the often used algorithm for the discovery and use of disease
risk factors, stepwise logistic regression, is unstable. We then argue that there are
other algorithms available that are much more stable and reliable (e.g. the lasso and
gradient boosting). We then propose a protocol for the discovery and use of risk
factors using lasso or boosting variable selection. We then illustrate the use of the
protocol with a set of prostate cancer data and show that it recovers known risk factors.
Finally, we use the protocol to identify new SNP based risk factors for prostate cancer.
Suggested Reviewers: Thomas Isenhour

[email protected]
Founding editor of ACS Journal of Chemical Information. Earned ACS National Award
in Analytical Chemistry. Provost-Old Dominion University. Published over 200 articles
Kenneth Berk
[email protected]
expert in statistical variable selection, Fellow of American Statistical Association,
Winner of ASA/ASTM Youden Award
Felix Offodile
[email protected]
expert statistician, many publications in area
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript Click here to download Manuscript Manuscript5.doc
Click here to view linked References
David E. Booth1, Venugopal Gopalakrishna-Remani2,
Matthew Cooper3, Fiona R. Green4, Margaret P. Rayman5
1
M&IS Dept., Kent State University, Kent OH 44242, 2Dept of Management, University of Texas-Tyler,
Tyler TX 75799, 3Dept of Internal Medicine, Washington University School of Medicine, St. Louis MO
63110, 4University of Manchester, Div. of Cardiovascular Sciences, School of Medical Sciences, Faculty
of Biology, Medicine and Health, Manchester, UK, 5Dept. of Nutritional Sciences, University of Surrey,
Guildford GU27XH UK
Short title: SNP Risk Factor, Discovery and Use
Corresponding Author: David E. Booth, Professor Emeritus, Kent State University, 595 Martinique
Circle, Stow OH 44224; ph. 330-805-0239; email: [email protected]
Draft Date: 3/6/18
Draft in Revision: 3/6/18
Not for Quotation. For submission to Cancer Causes and Control
Declarations of Interest: None

Abstract
We begin by arguing that the often used algorithm for the discovery and use of disease risk
factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available
that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol
for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the
use of the protocol with a set of prostate cancer data and show that it recovers known risk factors.
Finally, we use the protocol to identify new SNP based risk factors for prostate cancer.
Keywords: variable selection, boosting, lasso, risk factors, prostate cancer
1. Introduction
As Austin and Tu [1] remark, researchers as well as physicians are often interested in determining
the independent predictors of a disease state. These predictors, often called risk factors, are important
in disease diagnosis, prognosis and general patient management as the attending physician tries to
optimize patient care. In addition, knowledge of these risk factors help researchers evaluate new
treatment modalities and therapies as well as help make comparisons across different hospitals [1].
Because risk factors are so important in patient care it behooves us to do the best job possible in the
discovery and use of disease risk factors. Because new statistical methods [2], [3], [4], [5], [6]. [7],
[8], [9] have been and are being developed, [8] it is important for risk factor researchers to be aware
of these new methods and to adjust their discovery and use of risk factor protocols as is necessary. In
2
this paper, we argue that now is such a time. For a number of years in risk factor research a method
of automatic variable selection called stepwise regression and its variants forward selection and
backward elimination [10] (chapter 9)) have been used even as new methods have become available
(see [11], [12], [13], [14], [15], [16], [17] and many others). The last three cited are risk factor
studies. We do not argue for a change of protocols in risk factor discovery and use simply because
newer methods are available. As literature shows [1] the older methods are often unreliable and the
newer methods are much less so. We point out that the purpose of this paper is the following:
1. To summarize some of the studies that show that stepwise regression and its
variants, as now used more often than they should be in risk factor studies, are
unreliable and in fact may cause some of the irreproducibility of life sciences
research as discussed by [18] as we shall discuss later.
2. To argue on the basis of current research that there are methods available that are
considerably more reliable.
3. To propose a modern statistical protocol for the discovery and use of risk factors
when using logistic regression as is commonly done.
4. To illustrate the use of the protocol developed in 3 using a set of prostate cancer data
[19].
5. To report the finding of new prostate cancer risk factors using the modern
procedures.
We further note that nothing in the way of statistical methods is new in this paper. What is new is the
introduction of a clear protocol to identify and use disease risk factors that involve much less problematic
methods than stepwise regression. We then use the proposed methodology to identify a known prostate
cancer risk factor and then discover new prostate cancer risk factors.
3
1.2.What then should replace these automatic variable selection methods?
From the references in Section 1, we see that the shrinkage methods have done well when
compared to the current stepwise and all subsets methods and thus we follow the suggestion of
Steyerburg et al [4] and look at shrinkage methods.The question then becomes what shrinkage method
might we choose as the next variable selection method? We are impressed by the work of Ayers and
Cordell [2] in this regard. First we note that shrinkage estimators are also called penalized estimators. In
particular the lasso [7] as defined by Zou[20] can be considered. We note that the factor lambda is said to
be the penalty.
Now Ayers and Cordell [2] studied “the performance of penalizations in selecting SNPs as
predictors in genetic association studies.”, where SNP stands for single nucleotide polymorphism. Their
conclusion is: “Results show that penalized methods outperform single marker analysis, with the main
difference being that penalized methods allow the simultaneous inclusion of a number of markers, and
generally do not allow correlated variables to enter the model in which most of the identified explanatory
markers are accounted for.”, as shown by Tibshirani [7]. In addition, lasso prevents overfitting the model
[9], p 304. At this point, penalty estimators (i.e. shrinkage) look very attractive in risk factor type
studies.[9] (chapter 16.), especially given the relationship between lasso and boosting. [9], p. 320
Another paper [20] helps us make our final decision. Zou [20] considers a procedure called
adaptive lasso in which different values of the parameter λ are allowed for each of the regression
4
coefficients. Furthermore, Zou shows that an adaptive lasso procedure is an oracle procedure such that
β(Ϩ) (asymptotically) has the following properties
a) It identifies the right subset model and
b) It has the optimal estimated rate.
Zou then extends these results to the adaptive lasso for logistic regression. Wang and Lang
[21] developed an approximate adaptive lasso (i.e. a different λ for each β is allowed) by least
squares approximation for many types of regression. Boos [22] shows how easy it is to
implement this software in the statistical language R for logistic regression. Thus, we choose
to use the least squares approximation to their adaptive lasso logistic regression in the next
section. We note here that a special variant of lasso, group lasso [23] is needed for
categorical predictor variables.
In the next section, we propose and discuss a protocol for the discovery and use of risk factors in
logistic regression models. In the following section we illustrate the use of the protocol using the
data of Cooper et al [19] to look at some risk factors for prostate cancer. We will show that
currently known risk factors can be identified as well as new risk factors discovered using these
methods.
In addition a second new method of variable selection called gradient boosting has been
developed.[24], [25], [26], Chapter 8,[27], [9], (Chapter 17.) This method has some of the same
advantages as lasso and we add it to the protocol and test it as well. The boosting method makes
use of regression trees. A readable introduction can be found in [38].
2. Materials and Methods
5
2.1. A suggested protocol for using logistic type regression to discover and use disease risk
factors.
Our suggested protocol is shown below. We discuss the protocol in this section and illustrate its
use with prostate cancer risk factors in the following section. This protocol uses the R statistical
language. R was chosen because of its power and the fact that all of the required algorithms are
available in R.
Protocol for use with Risk Factors
1. Ready data for analysis.
2. Input to R.
3. Regress a suitable dependent variable ((say) 0- Control, 1 – Has disease) on X (a potential
risk factor) as described by Harrell [28](Chapter 10) for logistic type regression.
4. Select a set of potential risk factors. If an X variable is continuous, we suggest use of the
Bianco-Yohai(robust (outlier resistant, see [30]) estimator and further suggest putting outliers
aside for further analysis as they may give rise to extra information[30].
5. Now build a full risk factor prediction model as described by Shmueli [39].
6. Use potential risk factors (Xs) to form a full model with the appropriate dependent variable
(as in 3).
7. If any variables are continuous repeat 4 using the entire potential full prediction model.
6
8. With any outliers set aside for further study, regress the dependent variable on the logistic
regression type full model using the adaptive lasso method, least squares approximation, as
described by Boos [22] which is easiest in R.
9. Using a Bayesian Information Criterion (BIC) or alternatively an Akaike Information
Criterion(AIC), select variables without zero lasso regression coefficients to be predictors in a
risk factor based reduced model. If categorical risk factors are present use group lasso
regression [23]. Use graphs like Fig. 1 in [23] to identify the zero lasso regression
coefficients that may exist for the categorical variables.
10. Repeat Step 8 for gradient boosting as described by Kendziorski[25] or Ho [31].
11. Validate the reduced model, with the similar validation of the full model of step 6, if there is
any doubt about variables discarded from the full model, using bootstrap cross validation or
10-fold cross validation [28] and then check the usual model diagnostics [29] for either lasso
or boosting or both.
12. Predict with the reduced model containing the appropriate risk factors as described in Harrell
[28], Chapter 11 and Ryan [32], Chapter 9.
Notes to the protocol.
A. We note that for the genome wide case of predictors one should refer to [33] and [34].
B. All logistic regression assumptions should be checked and satisfied as in Pregibon [26].
3. Results
3.1The prostate cancer example
7
This example is taken from Cooper et al[19] where the data(including all sample sizes) and
biological system are described. The data set used in this paper is a subset of the Cooper et al data set
with all observations containing missing values of model variables are removed since we note that all
potential predictor variables are categorical so no imputation was performed. The coding assignments and
the variable definitions are given in the Appendix. The simple and multiple logistic regressions are
carried out as described in [28]. Robust logistic regressions, when needed, are carried out as described in
[30]. Variable selection is carried out using the adaptive lasso [20] with the least squares approximation
of Wang and Leng [21] for continuous independent variables and by group lasso [23] for categorical
independent variables. Gradient boosting is carried out using R Package gbm[24] as described by [25],
[31], [27]. All computations are carried out using the R statistical language. The R functions for variable
selection (adaptive lasso and group lasso) along with the papers are available from Boos [22], and used as
described there. The use of the group lasso R function is covered in R help for packages grplasso and
grpreg. The data sets and R programs are available from the authors (DEB). The variables studied as
potential risk factors are listed in the X column of Table 1. The dependent variable is current status.
We now follow the protocol and explain each step in detail. We begin by considering the one
predictor logistic regressions in Table 1. First note that all potential risk factors in this data set are
categorical (factors) so we do not have to consider the Bianco-Yohai [35] estimator of protocol Step 4 for
this data. We note that this is often not the case. Cooper et al [19] hypothesize a SNP-SNP interaction as a
risk factor for prostate cancer where SNP denotes a single nucleotide polymorphism. We now test this
hypothesis and attempt to answer the question is there such an interaction? In order to answer this
question, we first note that the answer is not completely contained in Table 1. Second, we recall that we
have a gene-gene interaction of two genes if both affect the final phenotype of the individual together. To
8
be specific, we now consider the two genes representing the relevant alleles of the SEPP1 and SOD2
genes. If there is a gene-gene interaction, we must see the following statistically. The relevant alleles of
the SEPP1 and SOD2 genes must be selected to be in a reasonable prediction equation for the disease
state by the appropriate lasso or boosting algorithm (see Figures 1,2, Tables 3, 4). The appropriate lasso
algorithm here is the group lasso for logistic regression because the predictor variables are categorical.
We now note that in our data set we have four candidate predictor variables from which to search for our
gene-gene interaction MnSOD_DOM_Final, SeP_Ad_Final, MnSOD_AD_Final and SeP_DOM_Final.
Either observation of the Variable Values or a simple trial shows that we cannot include all four variables
in the model at once because they are pairwise collinear. Hence we have to separate the variables into the
two cases, the models of Figure 1 and Figure 2. We also note that lasso generally does not allow
correlated variables to enter the model [2] [7].
We now begin our search using lasso with the model of Figure 1. This gives us a candidate for an
interaction. We then perform the group lasso analysis of Figure 1. Here we must determine if the
relevant alleles are included in the group lasso selected prediction equation. Roughly this is the case if
the lasso regression coefficients are not zero at the end of the algorithm’s execution as shown on the
coefficient path plot of Figure 1. By looking at equation (2.2) of [23] we see that 0≤λ<∞ hence as
λ→∞, sλ(β)→0 and thus βi→0 but not uniformly. Hence the question is what value of λ do we choose to
determine if the coefficients are close enough to zero to discard that term from the model as a zero
coefficient. Based on Table 2 where we compute the
9
Table 1 Simple Logistic Regression Results
Dependent variable CURRENTSTATUS Intercepts are not listed
X Coeff. SE P
X_STRATUM -.055132 .005646 <2x10-16
MnSOD_AD_Final 0 -0.4334 .1241 0.000477
1 -0.2478 .1157 0.032196
2 -0.3140 .1233 0.010879
SeP_Ad_Final 0 0.21219 0.10309 0.039557
1 0.12890 0.10754 0.230675
2 0.23484 0.15797 0.137117
MnSOD_DOM_Final0 0.4334 0.1241 0.000477
1 0.2704 0.1126 0.016369
SeP_DOM_Final 0 0.21219 0.10309 0.039557
1 0.14445 0.10568 0.171679
Smoke_ever 0 -.00339 .08161 0.967
1 -.03791 .07016 0.589
Alco_ever 0 -0.428943 0.142425 0.0026
1 0.002951 0.062317 0.9622
FAMHIST 0.84619 0.09497 <2x10-16
Table 2
Optimal λs Computed from R Packages
grplasso and grpreg for Indicated Models
Predictors in Model λmin λmax λopt

MnSOD_AD_Final .009 70.55 .635
SeP_DOM_Final
MnSOD_DOM_Final .017 83.99 1.428
SeP_Ad_Final
Note: λmin computed by package grpreg using a Bayesian Information Criterion λmax was computed by
package grplasso.
10
Figure 1 – The Group Lasso Coefficient plot for the logistic regression –
Containing MnSOD_DOM_FINAL and SeP_Ad_Final
We note that for lambda=λopt none of the paths shrink to zero suggesting that a SNP-SNP interaction,
as reported in [19] exists.
11
12
Figure 2 – Grouplasso Coefficient Plot for Model Containing MnSOD_AD_Final and SeP_DOM_Final
13
14
optimal λ to use we choose λ=1.428 to be the cutoff point. Hence we can now apply the condition of the
previous paragraph. We now check Figure 1 to see which if any of these candidate alleles are selected
for the group lasso prediction equation which was our criterion. We now examine the Figure 1 plot at
λopt=1.428. We note that at this λ none of the candidate alleles have coefficients of zero. Hence using
our criterion we can summarize as follows:
1. We need Figure 1 selection to show interaction. SeP_Ad_Final0 was Ala/Ala so this is one
allele that qualifies. Similarly for SeP_Ad_Final1 and 2 which are Ala/Thr and Thr/Thr
respectively.
2. Both MnSOD_DOM_Final0 and MnSOD_DOM_Final1 (i.e. Ala/Ala and +/Ala) satisfy so this
shows that for MnSOD the result is +/Ala. Hence the identified interaction alleles are
SEPP1 SOD2
Ala/Ala +/Ala
which agrees with the Cooper et al [19] finding on a gene-gene interaction risk factor.
3.2 New Risk Factors
Similarly we have from SeP_Ad_Final 1 and 2
Ala/Thr +/Ala
Thr/Thr +/Ala
which are also risk factors.
We now repeat this analysis for the model which contains the other possible candidate alleles. By
our criterion for gene-gene interaction we need βi≠0 for λopt=0.635, from observing Table 2. Now by
observing Figure 2 we see that for MnSOD_AD_
15
Final the 0, 1 and 2 values meet the criteria while for SeP_DOM_Final only the 0 and 1 alleles do. By
consulting the Appendix we see that
SeP_DOMFinal1 is Ala/Thr and Thr/Thr
SeP_DOM_Final0 is Ala/Ala
MnSOD_AD_Final0 is Val/Val
1 is Val/Ala
2 is Ala/Ala
Hence we conclude that we have additional gene-gene interactions that are risk factors. Since one
combination was identified using the first model. We now have
SEPP1 SOD2
Ala/Ala Val/Val
Ala/Ala Val/Ala
+/Thr Val/Val
+/Thr Val/Ala
as risk factors. None of these have been reported in the prior literature as far as we can determine
We can now make prediction equations using our now known risk factors which will give our
predicted diagnosis of whether or not a patient is at risk for prostate cancer based on our variable values
assuming that we use a new observation not one which is included in our current data set.. We
recommend the use of bootstrap cross validation to validate this equation and full details are included in
[28]. As a final reminder, all of the other assumptions of logistic regression need to be checked each and
16
every time such a model is used. The reader is referred to Pregibon [29] for further details.These new risk
factor results are particularly important since the SEPP1 gene product is in the same metabolic path as a
tumor suppressor for prostate cancer [36].
We now repeat the analysis using gradient boosting. The results are shown in Tables 3 and 4.
The results are identical to the lasso results.
TABLE 3
Boosting Results Pkg gbm Ada Boost, Corresponds to Figure 1
Variable Relative Influence
MnSOD_DOM_Final 68.96
SeP__Ad_Final 31.03
TABLE 4
Boosting Results, Same Conditions as Table 3, Corresponds to Figure 2
Variable Relative Influence
MnSOD_AD_Final 75.29
SeP_DOM_Final 24.70
17
4.Discussion
4.1 Limitations of the proposed Protocol and Future Research
As much as we would like this to be the last word on the discovery and use of disease risk factors
with logistic regression, it is not. We will mention a few possible limitations and our hope for some future
work perhaps by us or others that we would like to see.
First, Ayers and Cordell [2] mention a limitation of this suggestion, the fact that there is no
known way to get confidence intervals and p-values for lasso estimates i.e. the lasso regression
coefficients. Fortunately this is changing. Currently, there is a paper by Lockhart et al (2012) entitled “A
significance test for the lasso”. While this is a complicated paper that doesn’t solve all problems a strong
beachhead has been established. Unfortunately this is not a test on individual lasso regression coefficients.
Next, we discussed the advantages of adaptive lasso earlier (esp. the oracle property) but no
algorithm currently exists to solve the adaptive group lasso problem in the case of logistic regression. We
conjecture based on the results of the linear regression case extended to the logistic case that if we could
extend adaptive lasso to the group lasso for logistic regression cases that the same desirable properties of
adaptive lasso would hold, especially the oracle property.
Finally the usual problems of outliers, etc., as always, raise their head. The Bianco-Yohai
algorithm [35]) is a start but this hasn’t been extended to any penalized shrinkage regression method. We
conclude that there is much work to be done and fully expect to see other papers like this one in the future
and hopefully statistical practice can continue to evolve and even better solutions can be applied to these
interesting and important problems.
5. Conclusion
18
We have attempted in this paper to bring up to date statistical thinking to the problem of the
identification and use of disease risk factors, where stepwise regression is still too often used. Much
remains to be done, but we hope that the ideas presented here will improve statistical practice in this very
important area. In the process of bringing this thinking up to date, we have shown that we recover a
currently known risk factor and identify new risk factors which suggest the value of our approach. These
new risk factor results are particularly important since the SEPP1 gene product has recently been shown
to be in the same metabolic pathway as a tumor suppressor for prostate cancer [36]
19
Appendix
Data Set
Variable
Cases
were
classifie
d as
either
non-
aggressi
ve at
diagnosi
s (tumor
stage 1
and 2,
Gleason
score <
8,
Different
iation
G1-G2,
NP/NX,
MO/MX
, PSA <
100
μg/L;
NPC) or
aggressi
ve at
diagnosi
20
s (tumor stage 3-4, Gleason score ≥ 8, Differentiation G3-G4, N+, M+, PSA ≥ 100 μg/L;APC).
Declaration of Conflicting Interests:

The authors declare that there is no conflict of interest.
21
References
1 Austin, P. and Tu, J (2004), Automated Variable Selection Methods for logistic regression produced
unstable models for predicting acute myocardial infarction mortality, J. Clinical Epidemiology 57, 1138-
1146.
2 Ayers, K and Cordell, H (2010), SNP Selection in Genome-Wide and Candidate Gene Studies via
Penalized Logistic Regression, Genetic Epidemiology 34: 879-891.
3 Yuan, M. and Lin, Y (2006), Model Selection and Estimation in Regression with Grouped
Variables, J. Royal Statistical Society: Series B 68(1), 49-67
4 Steyerberg, E, Eijkemans, M, Harrell, Jr, F, Habbema, J. (2000), Prognostic Modeling with logistic
regression analysis: a comparison of selection and estimation methods in small data sets, Statist. Med.
2000: 19: 1059-1079
5 Wiegand, R (2009), Performance of Using Multiple Stepwise algorithms for variable selection Statist.
Med. 2010, 29, 1647-1659
6 Breiman, L (1995), Better Subset Regression Using the Nonnegative garrote, Technometrics 37 (4),
373-384
7 Tibshirani, R (1996), Regression Shrinkage and Selection via the lasso, Journal of the Royal Statistical
Society: series B58 (1), 267-288
8 Dahlgren, J (2010), Alternative Regression Methods are not considered in Murtaugh (2009) or by
ecologists in general, Ecology Letters (2010) 13: E7-E9.
9 Efron, B. and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge, Cambridge University
Press.
22
10 Chatterjee, S and Price B, (1977), Regression Analysis by Example, New York: John Wiley and Sons.
11 Neter J, Wasserman, W and Kutner, M (1983) Applied Linear Regression Models, Homewood:
Richard D. Irwin
12 Kutner, M, Nachtsheim, C, Neter, J, Li, W, (2005) Applied Linear Statistical Models, 5th ed., New
York; McGraw-Hill Irwin.
13 Labidi, M, Baillot, R, Dionne, B, LaCasse, Y, Maltais, F and Boulet, L (2009), Pleural Effusions
following Cardiac Surgery, Chest 2009: I 36 : 1604-1611
14 Queiroz, N, Sampaio, D, Santos, E, Bezerra, A. 2012, Logistic model for determining factors,
associated with HIV infection among blood donor candidates at the Fundacao HEMOPE
Rev Bras Hematologia Hemoterapia, 2012; 34(3): 217-21
15 Qiu, l, Cheng, X, Wu, J, Liu, J, Xu, T, Ding, H, Liu, Y, Ge, Z, Wang, Y, Han, H, Liu, J, Zhu, G, 2013,
Prevalence of hyperuricemia and its related risk factors in healthy adults from Northern and Northeastern
Chinese Provinces, BMC Public Health, 2013, 13:664
16 Guo, L., Guo, X., Chang, Y., Yang, T., Zhang, L., Li, T., and Sun, Y. Prevalence and Risk Factors of
Heart Failure with the Preserved Injection Fraction, Int. J. Environ., Res. Public Health 2016, 13(8), 770
17 Khan, MS, Pervaiz, MK, Javed, I, Biostatistical Study of Clinical Risk Factors in Myocardial
Infarction, PAFMJ 2016; 66(3): 354-360.
18Arnaud, D. H. (2014), Confronting Irreproducibility, Chemical and Engineering News 92 (50), 28-30.
19 Cooper, M., Adami, H., Gronberg, H., Wiklund, F., Green, F., Rayman, M. (2008), Interaction
between Single Nucleotide Polymorphisms in Selenoprotein P and Mitochondrial Superoxide Dismutase
Determines Prostate Cancer Risk, Cancer Res 2008: 68: (24), 10171-10177
20 Zou, H. (2006) The Adaptive lasso and its Oracle Properties, Journal of the American Statistical
Association 101:476, 1418-142921
21 Wang, H and Leng, C. (2008), A note on adaptive group lasso, Computational Statistics and Data
Analysis 52 (2008), 5277-5286
23
22 Boos, D. (2014) au., Adaptive Lasso in R, 2/9/2014,
http://www.stat.ncsu.edu/~boos/var.select/lasso.adaptive.html
23 Meier, L, Van der Geer, S, Buhlmann, P (2008), The group lasso for logistic regression J.R. Statist,
Soc B, 70, part 1, 53-71
24 Ridgeway, G. (2015), Package ‘gbm’, http://cran.r-project.org 9/17/2016.
25 Kendziorski, C. (2016), https://www.biostat.wisc.edu/~Kendzior/stat877/illustration.pdf
accessed 9/1/2016.
26 James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning,
N.Y.: Springer.
27 Maloney, K., Schmid, M., Weller, D., Applying Additive Modeling and Gradient Boosting to Assess
the Effects of Watershed and Reach Characteristics on Riverine Assemblages, Methods in Ecology and
Evolution, 2012, 3, 116-128.
28 Harrell, Jr., F., (2001) Regression Modeling Strategies; New York: Springer.
29 Pregibon, D (1981), Logistic Regression Diagnostics, Annals of Statistics 9: 705-721
30 Hauser, R. and Booth, D (2011), Predicting Bankruptcy with robust logistic regression, J. Data Sci
9(4), 585-605.
31 Ho, R. (2012), Big Data Machine Learning, DZoneRefCard z #158, Carey NC:DZone Inc
32 Ryan, T (2009), Modern Regression Methods 2nd Ed, Hoboken, NJ: Wiley
33 Li, H, Das, K, Fu, G, Li, R, Wu, R (2011), The Bayesian lasso for genome-wide association studies,
Bioinformatics (2011) 27 (4), 516-523
34 Wu, T, Chen, Y F, Hastie T, Sobel, E, Lange, K, 2009 Genome wide association analysis by lasso
penalized logistic regression, Bioinformatics 25: 714-721
35 Bianco, A and Martinez, E. (2009), Robust testing in the logistic regression model, Computational
Statistics and Data Analysis 53, 4095-4105.
24
36 Ansong, E., Ying, Q., Ekoue, D. N., Deaton, R., Hall, A. R., Kajdacsy-Galla, A., Yang, W., Gann, P.
H., Diamond, A. M. (2015) Evidence that Selenium Binding Protein 1 is a Tumor Suppressor in Prostate
Cancer. PLoS ONE 10(5); e0127295. doi:10.1371/jouenal.pone.0127295
37 Lockhart, R, Taylor, J, Tibshirani, R. J, Tibshirani, R, A significance test for the lasso (2012),
Department of Statistics, paper 131, http://repository.cmu.edu/statistics/131
38 Elith, J, Leathwick, J, Hastie, T (2008) A Working Guide to Boosted Regression Trees, Journal of
Animal Ecology 77, 802-813.
39. Shmueli, G, To Explain or To Predict? Statistical Science 2010 25(3) 289-310
25

Sensitivity Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sensitivity Analysis

Uploaded by

Copyright:

Available Formats

Cancer Causes & Control

Article Type: Original Article

Corresponding Author: David Booth, PhD

Corresponding Author Secondary

Corresponding Author's Institution: Kent State University

Corresponding Author's Secondary

First Author: David Booth, PhD

First Author Secondary Information:

Order of Authors: David Booth, PhD

Order of Authors Secondary Information:

Suggested Reviewers: Thomas Isenhour

Click here to view linked References

David E. Booth1, Venugopal Gopalakrishna-Remani2,

Matthew Cooper3, Fiona R. Green4, Margaret P. Rayman5

Short title: SNP Risk Factor, Discovery and Use

Draft Date: 3/6/18

Draft in Revision: 3/6/18

Not for Quotation. For submission to Cancer Causes and Control

Declarations of Interest: None

Keywords: variable selection, boosting, lasso, risk factors, prostate cancer

research as discussed by [18] as we shall discuss later.

considerably more reliable.

when using logistic regression as is commonly done.

β(Ϩ) (asymptotically) has the following properties

a) It identifies the right subset model and

b) It has the optimal estimated rate.

categorical predictor variables.

use of regression trees. A readable introduction can be found in [38].

2. Materials and Methods

Protocol for use with Risk Factors

1. Ready data for analysis.

3. Regress a suitable dependent variable ((say) 0- Control, 1 – Has disease) on X (a potential

described by Boos [22] which is easiest in R.

9. Using a Bayesian Information Criterion (BIC) or alternatively an Akaike Information

Criterion(AIC), select variables without zero lasso regression coefficients to be predictors in a

coefficients that may exist for the categorical variables.

10. Repeat Step 8 for gradient boosting as described by Kendziorski[25] or Ho [31].

[28], Chapter 11 and Ryan [32], Chapter 9.

Notes to the protocol.

3.1The prostate cancer example

gene-gene interaction MnSOD_DOM_Final, SeP_Ad_Final, MnSOD_AD_Final and SeP_DOM_Final.

correlated variables to enter the model [2] [7].

coefficient. Based on Table 2 where we compute the

Optimal λs Computed from R Packages

grplasso and grpreg for Indicated Models

Predictors in Model λmin λmax λopt

Containing MnSOD_DOM_FINAL and SeP_Ad_Final

as reported in [19] exists.

our criterion we can summarize as follows:

3.2 New Risk Factors

Similarly we have from SeP_Ad_Final 1 and 2

which are also risk factors.

observing Figure 2 we see that for MnSOD_AD_

consulting the Appendix we see that

SeP_DOMFinal1 is Ala/Thr and Thr/Thr

combination was identified using the first model. We now have

tumor suppressor for prostate cancer [36].

The results are identical to the lasso results.

Boosting Results Pkg gbm Ada Boost, Corresponds to Figure 1

Variable Relative Influence

Boosting Results, Same Conditions as Table 3, Corresponds to Figure 2

Variable Relative Influence

4.1 Limitations of the proposed Protocol and Future Research

work perhaps by us or others that we would like to see.