Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Cancer Causes & Control

Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors
--Manuscript Draft--

Manuscript Number:

Full Title: Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors

Article Type: Original Article

Keywords: Keywords: variable selection, boosting, lasso, risk factors, prostate cancer

Corresponding Author: David Booth, PhD


Kent State University
Stow, OH UNITED STATES

Corresponding Author Secondary


Information:

Corresponding Author's Institution: Kent State University

Corresponding Author's Secondary


Institution:

First Author: David Booth, PhD

First Author Secondary Information:

Order of Authors: David Booth, PhD

Venugopal Gopalakrishna-Remani

Matthew Cooper

Fiona Green

Margaret Rayman

Order of Authors Secondary Information:

Funding Information:

Abstract: We begin by arguing that the often used algorithm for the discovery and use of disease
risk factors, stepwise logistic regression, is unstable. We then argue that there are
other algorithms available that are much more stable and reliable (e.g. the lasso and
gradient boosting). We then propose a protocol for the discovery and use of risk
factors using lasso or boosting variable selection. We then illustrate the use of the
protocol with a set of prostate cancer data and show that it recovers known risk factors.
Finally, we use the protocol to identify new SNP based risk factors for prostate cancer.

Suggested Reviewers: Thomas Isenhour


[email protected]
Founding editor of ACS Journal of Chemical Information. Earned ACS National Award
in Analytical Chemistry. Provost-Old Dominion University. Published over 200 articles

Kenneth Berk
[email protected]
expert in statistical variable selection, Fellow of American Statistical Association,
Winner of ASA/ASTM Youden Award

Felix Offodile
[email protected]
expert statistician, many publications in area

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript Click here to download Manuscript Manuscript5.doc

Click here to view linked References

Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors

David E. Booth1, Venugopal Gopalakrishna-Remani2,

Matthew Cooper3, Fiona R. Green4, Margaret P. Rayman5

1
M&IS Dept., Kent State University, Kent OH 44242, 2Dept of Management, University of Texas-Tyler,
Tyler TX 75799, 3Dept of Internal Medicine, Washington University School of Medicine, St. Louis MO
63110, 4University of Manchester, Div. of Cardiovascular Sciences, School of Medical Sciences, Faculty
of Biology, Medicine and Health, Manchester, UK, 5Dept. of Nutritional Sciences, University of Surrey,
Guildford GU27XH UK

Short title: SNP Risk Factor, Discovery and Use

Corresponding Author: David E. Booth, Professor Emeritus, Kent State University, 595 Martinique
Circle, Stow OH 44224; ph. 330-805-0239; email: [email protected]

Draft Date: 3/6/18

Draft in Revision: 3/6/18

Not for Quotation. For submission to Cancer Causes and Control

Declarations of Interest: None


Boosting and Lassoing Cancer Risk Factors: New Prostate Cancer SNP Risk Factors

Abstract

We begin by arguing that the often used algorithm for the discovery and use of disease risk

factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available

that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol

for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the

use of the protocol with a set of prostate cancer data and show that it recovers known risk factors.

Finally, we use the protocol to identify new SNP based risk factors for prostate cancer.

Keywords: variable selection, boosting, lasso, risk factors, prostate cancer

1. Introduction

As Austin and Tu [1] remark, researchers as well as physicians are often interested in determining

the independent predictors of a disease state. These predictors, often called risk factors, are important

in disease diagnosis, prognosis and general patient management as the attending physician tries to

optimize patient care. In addition, knowledge of these risk factors help researchers evaluate new

treatment modalities and therapies as well as help make comparisons across different hospitals [1].

Because risk factors are so important in patient care it behooves us to do the best job possible in the

discovery and use of disease risk factors. Because new statistical methods [2], [3], [4], [5], [6]. [7],

[8], [9] have been and are being developed, [8] it is important for risk factor researchers to be aware

of these new methods and to adjust their discovery and use of risk factor protocols as is necessary. In

2
this paper, we argue that now is such a time. For a number of years in risk factor research a method

of automatic variable selection called stepwise regression and its variants forward selection and

backward elimination [10] (chapter 9)) have been used even as new methods have become available

(see [11], [12], [13], [14], [15], [16], [17] and many others). The last three cited are risk factor

studies. We do not argue for a change of protocols in risk factor discovery and use simply because

newer methods are available. As literature shows [1] the older methods are often unreliable and the

newer methods are much less so. We point out that the purpose of this paper is the following:

1. To summarize some of the studies that show that stepwise regression and its

variants, as now used more often than they should be in risk factor studies, are

unreliable and in fact may cause some of the irreproducibility of life sciences

research as discussed by [18] as we shall discuss later.

2. To argue on the basis of current research that there are methods available that are

considerably more reliable.

3. To propose a modern statistical protocol for the discovery and use of risk factors

when using logistic regression as is commonly done.

4. To illustrate the use of the protocol developed in 3 using a set of prostate cancer data

[19].

5. To report the finding of new prostate cancer risk factors using the modern

procedures.

We further note that nothing in the way of statistical methods is new in this paper. What is new is the

introduction of a clear protocol to identify and use disease risk factors that involve much less problematic

methods than stepwise regression. We then use the proposed methodology to identify a known prostate

cancer risk factor and then discover new prostate cancer risk factors.

3
1.2.What then should replace these automatic variable selection methods?

From the references in Section 1, we see that the shrinkage methods have done well when

compared to the current stepwise and all subsets methods and thus we follow the suggestion of

Steyerburg et al [4] and look at shrinkage methods.The question then becomes what shrinkage method

might we choose as the next variable selection method? We are impressed by the work of Ayers and

Cordell [2] in this regard. First we note that shrinkage estimators are also called penalized estimators. In

particular the lasso [7] as defined by Zou[20] can be considered. We note that the factor lambda is said to

be the penalty.

Now Ayers and Cordell [2] studied “the performance of penalizations in selecting SNPs as

predictors in genetic association studies.”, where SNP stands for single nucleotide polymorphism. Their

conclusion is: “Results show that penalized methods outperform single marker analysis, with the main

difference being that penalized methods allow the simultaneous inclusion of a number of markers, and

generally do not allow correlated variables to enter the model in which most of the identified explanatory

markers are accounted for.”, as shown by Tibshirani [7]. In addition, lasso prevents overfitting the model

[9], p 304. At this point, penalty estimators (i.e. shrinkage) look very attractive in risk factor type

studies.[9] (chapter 16.), especially given the relationship between lasso and boosting. [9], p. 320

Another paper [20] helps us make our final decision. Zou [20] considers a procedure called

adaptive lasso in which different values of the parameter λ are allowed for each of the regression

4
coefficients. Furthermore, Zou shows that an adaptive lasso procedure is an oracle procedure such that

β(Ϩ) (asymptotically) has the following properties

a) It identifies the right subset model and

b) It has the optimal estimated rate.

Zou then extends these results to the adaptive lasso for logistic regression. Wang and Lang

[21] developed an approximate adaptive lasso (i.e. a different λ for each β is allowed) by least

squares approximation for many types of regression. Boos [22] shows how easy it is to

implement this software in the statistical language R for logistic regression. Thus, we choose

to use the least squares approximation to their adaptive lasso logistic regression in the next

section. We note here that a special variant of lasso, group lasso [23] is needed for

categorical predictor variables.

In the next section, we propose and discuss a protocol for the discovery and use of risk factors in

logistic regression models. In the following section we illustrate the use of the protocol using the

data of Cooper et al [19] to look at some risk factors for prostate cancer. We will show that

currently known risk factors can be identified as well as new risk factors discovered using these

methods.

In addition a second new method of variable selection called gradient boosting has been

developed.[24], [25], [26], Chapter 8,[27], [9], (Chapter 17.) This method has some of the same

advantages as lasso and we add it to the protocol and test it as well. The boosting method makes

use of regression trees. A readable introduction can be found in [38].

2. Materials and Methods

5
2.1. A suggested protocol for using logistic type regression to discover and use disease risk

factors.

Our suggested protocol is shown below. We discuss the protocol in this section and illustrate its

use with prostate cancer risk factors in the following section. This protocol uses the R statistical

language. R was chosen because of its power and the fact that all of the required algorithms are

available in R.

Protocol for use with Risk Factors

1. Ready data for analysis.

2. Input to R.

3. Regress a suitable dependent variable ((say) 0- Control, 1 – Has disease) on X (a potential

risk factor) as described by Harrell [28](Chapter 10) for logistic type regression.

4. Select a set of potential risk factors. If an X variable is continuous, we suggest use of the

Bianco-Yohai(robust (outlier resistant, see [30]) estimator and further suggest putting outliers

aside for further analysis as they may give rise to extra information[30].

5. Now build a full risk factor prediction model as described by Shmueli [39].

6. Use potential risk factors (Xs) to form a full model with the appropriate dependent variable

(as in 3).

7. If any variables are continuous repeat 4 using the entire potential full prediction model.

6
8. With any outliers set aside for further study, regress the dependent variable on the logistic

regression type full model using the adaptive lasso method, least squares approximation, as

described by Boos [22] which is easiest in R.

9. Using a Bayesian Information Criterion (BIC) or alternatively an Akaike Information

Criterion(AIC), select variables without zero lasso regression coefficients to be predictors in a

risk factor based reduced model. If categorical risk factors are present use group lasso

regression [23]. Use graphs like Fig. 1 in [23] to identify the zero lasso regression

coefficients that may exist for the categorical variables.

10. Repeat Step 8 for gradient boosting as described by Kendziorski[25] or Ho [31].

11. Validate the reduced model, with the similar validation of the full model of step 6, if there is

any doubt about variables discarded from the full model, using bootstrap cross validation or

10-fold cross validation [28] and then check the usual model diagnostics [29] for either lasso

or boosting or both.

12. Predict with the reduced model containing the appropriate risk factors as described in Harrell

[28], Chapter 11 and Ryan [32], Chapter 9.

Notes to the protocol.

A. We note that for the genome wide case of predictors one should refer to [33] and [34].

B. All logistic regression assumptions should be checked and satisfied as in Pregibon [26].

3. Results

3.1The prostate cancer example

7
This example is taken from Cooper et al[19] where the data(including all sample sizes) and

biological system are described. The data set used in this paper is a subset of the Cooper et al data set

with all observations containing missing values of model variables are removed since we note that all

potential predictor variables are categorical so no imputation was performed. The coding assignments and

the variable definitions are given in the Appendix. The simple and multiple logistic regressions are

carried out as described in [28]. Robust logistic regressions, when needed, are carried out as described in

[30]. Variable selection is carried out using the adaptive lasso [20] with the least squares approximation

of Wang and Leng [21] for continuous independent variables and by group lasso [23] for categorical

independent variables. Gradient boosting is carried out using R Package gbm[24] as described by [25],

[31], [27]. All computations are carried out using the R statistical language. The R functions for variable

selection (adaptive lasso and group lasso) along with the papers are available from Boos [22], and used as

described there. The use of the group lasso R function is covered in R help for packages grplasso and

grpreg. The data sets and R programs are available from the authors (DEB). The variables studied as

potential risk factors are listed in the X column of Table 1. The dependent variable is current status.

We now follow the protocol and explain each step in detail. We begin by considering the one

predictor logistic regressions in Table 1. First note that all potential risk factors in this data set are

categorical (factors) so we do not have to consider the Bianco-Yohai [35] estimator of protocol Step 4 for

this data. We note that this is often not the case. Cooper et al [19] hypothesize a SNP-SNP interaction as a

risk factor for prostate cancer where SNP denotes a single nucleotide polymorphism. We now test this

hypothesis and attempt to answer the question is there such an interaction? In order to answer this

question, we first note that the answer is not completely contained in Table 1. Second, we recall that we

have a gene-gene interaction of two genes if both affect the final phenotype of the individual together. To

8
be specific, we now consider the two genes representing the relevant alleles of the SEPP1 and SOD2

genes. If there is a gene-gene interaction, we must see the following statistically. The relevant alleles of

the SEPP1 and SOD2 genes must be selected to be in a reasonable prediction equation for the disease

state by the appropriate lasso or boosting algorithm (see Figures 1,2, Tables 3, 4). The appropriate lasso

algorithm here is the group lasso for logistic regression because the predictor variables are categorical.

We now note that in our data set we have four candidate predictor variables from which to search for our

gene-gene interaction MnSOD_DOM_Final, SeP_Ad_Final, MnSOD_AD_Final and SeP_DOM_Final.

Either observation of the Variable Values or a simple trial shows that we cannot include all four variables

in the model at once because they are pairwise collinear. Hence we have to separate the variables into the

two cases, the models of Figure 1 and Figure 2. We also note that lasso generally does not allow

correlated variables to enter the model [2] [7].

We now begin our search using lasso with the model of Figure 1. This gives us a candidate for an

interaction. We then perform the group lasso analysis of Figure 1. Here we must determine if the

relevant alleles are included in the group lasso selected prediction equation. Roughly this is the case if

the lasso regression coefficients are not zero at the end of the algorithm’s execution as shown on the

coefficient path plot of Figure 1. By looking at equation (2.2) of [23] we see that 0≤λ<∞ hence as

λ→∞, sλ(β)→0 and thus βi→0 but not uniformly. Hence the question is what value of λ do we choose to

determine if the coefficients are close enough to zero to discard that term from the model as a zero

coefficient. Based on Table 2 where we compute the

9
Table 1 Simple Logistic Regression Results
Dependent variable CURRENTSTATUS Intercepts are not listed
X Coeff. SE P
X_STRATUM -.055132 .005646 <2x10-16
MnSOD_AD_Final 0 -0.4334 .1241 0.000477
1 -0.2478 .1157 0.032196
2 -0.3140 .1233 0.010879
SeP_Ad_Final 0 0.21219 0.10309 0.039557
1 0.12890 0.10754 0.230675
2 0.23484 0.15797 0.137117
MnSOD_DOM_Final0 0.4334 0.1241 0.000477
1 0.2704 0.1126 0.016369
SeP_DOM_Final 0 0.21219 0.10309 0.039557
1 0.14445 0.10568 0.171679
Smoke_ever 0 -.00339 .08161 0.967
1 -.03791 .07016 0.589
Alco_ever 0 -0.428943 0.142425 0.0026
1 0.002951 0.062317 0.9622
FAMHIST 0.84619 0.09497 <2x10-16

Table 2

Optimal λs Computed from R Packages

grplasso and grpreg for Indicated Models

Predictors in Model λmin λmax λopt


MnSOD_AD_Final .009 70.55 .635
SeP_DOM_Final
MnSOD_DOM_Final .017 83.99 1.428
SeP_Ad_Final

Note: λmin computed by package grpreg using a Bayesian Information Criterion λmax was computed by

package grplasso.

10
Figure 1 – The Group Lasso Coefficient plot for the logistic regression –

Containing MnSOD_DOM_FINAL and SeP_Ad_Final

We note that for lambda=λopt none of the paths shrink to zero suggesting that a SNP-SNP interaction,

as reported in [19] exists.

11
12
Figure 2 – Grouplasso Coefficient Plot for Model Containing MnSOD_AD_Final and SeP_DOM_Final

13
14
optimal λ to use we choose λ=1.428 to be the cutoff point. Hence we can now apply the condition of the

previous paragraph. We now check Figure 1 to see which if any of these candidate alleles are selected

for the group lasso prediction equation which was our criterion. We now examine the Figure 1 plot at

λopt=1.428. We note that at this λ none of the candidate alleles have coefficients of zero. Hence using

our criterion we can summarize as follows:

1. We need Figure 1 selection to show interaction. SeP_Ad_Final0 was Ala/Ala so this is one

allele that qualifies. Similarly for SeP_Ad_Final1 and 2 which are Ala/Thr and Thr/Thr

respectively.

2. Both MnSOD_DOM_Final0 and MnSOD_DOM_Final1 (i.e. Ala/Ala and +/Ala) satisfy so this

shows that for MnSOD the result is +/Ala. Hence the identified interaction alleles are

SEPP1 SOD2

Ala/Ala +/Ala

which agrees with the Cooper et al [19] finding on a gene-gene interaction risk factor.

3.2 New Risk Factors

Similarly we have from SeP_Ad_Final 1 and 2

Ala/Thr +/Ala

Thr/Thr +/Ala

which are also risk factors.

We now repeat this analysis for the model which contains the other possible candidate alleles. By

our criterion for gene-gene interaction we need βi≠0 for λopt=0.635, from observing Table 2. Now by

observing Figure 2 we see that for MnSOD_AD_

15
Final the 0, 1 and 2 values meet the criteria while for SeP_DOM_Final only the 0 and 1 alleles do. By

consulting the Appendix we see that

SeP_DOMFinal1 is Ala/Thr and Thr/Thr

SeP_DOM_Final0 is Ala/Ala

MnSOD_AD_Final0 is Val/Val

1 is Val/Ala

2 is Ala/Ala

Hence we conclude that we have additional gene-gene interactions that are risk factors. Since one

combination was identified using the first model. We now have

SEPP1 SOD2

Ala/Ala Val/Val

Ala/Ala Val/Ala

+/Thr Val/Val

+/Thr Val/Ala

as risk factors. None of these have been reported in the prior literature as far as we can determine

We can now make prediction equations using our now known risk factors which will give our

predicted diagnosis of whether or not a patient is at risk for prostate cancer based on our variable values

assuming that we use a new observation not one which is included in our current data set.. We

recommend the use of bootstrap cross validation to validate this equation and full details are included in

[28]. As a final reminder, all of the other assumptions of logistic regression need to be checked each and

16
every time such a model is used. The reader is referred to Pregibon [29] for further details.These new risk

factor results are particularly important since the SEPP1 gene product is in the same metabolic path as a

tumor suppressor for prostate cancer [36].

We now repeat the analysis using gradient boosting. The results are shown in Tables 3 and 4.

The results are identical to the lasso results.

TABLE 3

Boosting Results Pkg gbm Ada Boost, Corresponds to Figure 1

Variable Relative Influence

MnSOD_DOM_Final 68.96

SeP__Ad_Final 31.03

TABLE 4

Boosting Results, Same Conditions as Table 3, Corresponds to Figure 2

Variable Relative Influence

MnSOD_AD_Final 75.29

SeP_DOM_Final 24.70

17
4.Discussion

4.1 Limitations of the proposed Protocol and Future Research

As much as we would like this to be the last word on the discovery and use of disease risk factors

with logistic regression, it is not. We will mention a few possible limitations and our hope for some future

work perhaps by us or others that we would like to see.

First, Ayers and Cordell [2] mention a limitation of this suggestion, the fact that there is no

known way to get confidence intervals and p-values for lasso estimates i.e. the lasso regression

coefficients. Fortunately this is changing. Currently, there is a paper by Lockhart et al (2012) entitled “A

significance test for the lasso”. While this is a complicated paper that doesn’t solve all problems a strong

beachhead has been established. Unfortunately this is not a test on individual lasso regression coefficients.

Next, we discussed the advantages of adaptive lasso earlier (esp. the oracle property) but no

algorithm currently exists to solve the adaptive group lasso problem in the case of logistic regression. We

conjecture based on the results of the linear regression case extended to the logistic case that if we could

extend adaptive lasso to the group lasso for logistic regression cases that the same desirable properties of

adaptive lasso would hold, especially the oracle property.

Finally the usual problems of outliers, etc., as always, raise their head. The Bianco-Yohai

algorithm [35]) is a start but this hasn’t been extended to any penalized shrinkage regression method. We

conclude that there is much work to be done and fully expect to see other papers like this one in the future

and hopefully statistical practice can continue to evolve and even better solutions can be applied to these

interesting and important problems.

5. Conclusion

18
We have attempted in this paper to bring up to date statistical thinking to the problem of the

identification and use of disease risk factors, where stepwise regression is still too often used. Much

remains to be done, but we hope that the ideas presented here will improve statistical practice in this very

important area. In the process of bringing this thinking up to date, we have shown that we recover a

currently known risk factor and identify new risk factors which suggest the value of our approach. These

new risk factor results are particularly important since the SEPP1 gene product has recently been shown

to be in the same metabolic pathway as a tumor suppressor for prostate cancer [36]

19
Appendix
Data Set
Variable

Cases
were
classifie
d as
either
non-
aggressi
ve at
diagnosi
s (tumor
stage 1
and 2,
Gleason
score <
8,
Different
iation
G1-G2,
NP/NX,
MO/MX
, PSA <
100
μg/L;
NPC) or
aggressi
ve at
diagnosi
20
s (tumor stage 3-4, Gleason score ≥ 8, Differentiation G3-G4, N+, M+, PSA ≥ 100 μg/L;APC).

Declaration of Conflicting Interests:


The authors declare that there is no conflict of interest.

21
References

1 Austin, P. and Tu, J (2004), Automated Variable Selection Methods for logistic regression produced

unstable models for predicting acute myocardial infarction mortality, J. Clinical Epidemiology 57, 1138-

1146.

2 Ayers, K and Cordell, H (2010), SNP Selection in Genome-Wide and Candidate Gene Studies via

Penalized Logistic Regression, Genetic Epidemiology 34: 879-891.

3 Yuan, M. and Lin, Y (2006), Model Selection and Estimation in Regression with Grouped

Variables, J. Royal Statistical Society: Series B 68(1), 49-67

4 Steyerberg, E, Eijkemans, M, Harrell, Jr, F, Habbema, J. (2000), Prognostic Modeling with logistic

regression analysis: a comparison of selection and estimation methods in small data sets, Statist. Med.

2000: 19: 1059-1079

5 Wiegand, R (2009), Performance of Using Multiple Stepwise algorithms for variable selection Statist.

Med. 2010, 29, 1647-1659

6 Breiman, L (1995), Better Subset Regression Using the Nonnegative garrote, Technometrics 37 (4),

373-384

7 Tibshirani, R (1996), Regression Shrinkage and Selection via the lasso, Journal of the Royal Statistical

Society: series B58 (1), 267-288

8 Dahlgren, J (2010), Alternative Regression Methods are not considered in Murtaugh (2009) or by

ecologists in general, Ecology Letters (2010) 13: E7-E9.

9 Efron, B. and Hastie, T. (2016), Computer Age Statistical Inference, Cambridge, Cambridge University

Press.

22
10 Chatterjee, S and Price B, (1977), Regression Analysis by Example, New York: John Wiley and Sons.

11 Neter J, Wasserman, W and Kutner, M (1983) Applied Linear Regression Models, Homewood:

Richard D. Irwin

12 Kutner, M, Nachtsheim, C, Neter, J, Li, W, (2005) Applied Linear Statistical Models, 5th ed., New

York; McGraw-Hill Irwin.

13 Labidi, M, Baillot, R, Dionne, B, LaCasse, Y, Maltais, F and Boulet, L (2009), Pleural Effusions

following Cardiac Surgery, Chest 2009: I 36 : 1604-1611

14 Queiroz, N, Sampaio, D, Santos, E, Bezerra, A. 2012, Logistic model for determining factors,

associated with HIV infection among blood donor candidates at the Fundacao HEMOPE

Rev Bras Hematologia Hemoterapia, 2012; 34(3): 217-21

15 Qiu, l, Cheng, X, Wu, J, Liu, J, Xu, T, Ding, H, Liu, Y, Ge, Z, Wang, Y, Han, H, Liu, J, Zhu, G, 2013,

Prevalence of hyperuricemia and its related risk factors in healthy adults from Northern and Northeastern

Chinese Provinces, BMC Public Health, 2013, 13:664

16 Guo, L., Guo, X., Chang, Y., Yang, T., Zhang, L., Li, T., and Sun, Y. Prevalence and Risk Factors of

Heart Failure with the Preserved Injection Fraction, Int. J. Environ., Res. Public Health 2016, 13(8), 770

17 Khan, MS, Pervaiz, MK, Javed, I, Biostatistical Study of Clinical Risk Factors in Myocardial

Infarction, PAFMJ 2016; 66(3): 354-360.

18Arnaud, D. H. (2014), Confronting Irreproducibility, Chemical and Engineering News 92 (50), 28-30.

19 Cooper, M., Adami, H., Gronberg, H., Wiklund, F., Green, F., Rayman, M. (2008), Interaction

between Single Nucleotide Polymorphisms in Selenoprotein P and Mitochondrial Superoxide Dismutase

Determines Prostate Cancer Risk, Cancer Res 2008: 68: (24), 10171-10177

20 Zou, H. (2006) The Adaptive lasso and its Oracle Properties, Journal of the American Statistical

Association 101:476, 1418-142921

21 Wang, H and Leng, C. (2008), A note on adaptive group lasso, Computational Statistics and Data

Analysis 52 (2008), 5277-5286

23
22 Boos, D. (2014) au., Adaptive Lasso in R, 2/9/2014,

http://www.stat.ncsu.edu/~boos/var.select/lasso.adaptive.html

23 Meier, L, Van der Geer, S, Buhlmann, P (2008), The group lasso for logistic regression J.R. Statist,

Soc B, 70, part 1, 53-71

24 Ridgeway, G. (2015), Package ‘gbm’, http://cran.r-project.org 9/17/2016.

25 Kendziorski, C. (2016), https://www.biostat.wisc.edu/~Kendzior/stat877/illustration.pdf

accessed 9/1/2016.

26 James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning,

N.Y.: Springer.

27 Maloney, K., Schmid, M., Weller, D., Applying Additive Modeling and Gradient Boosting to Assess

the Effects of Watershed and Reach Characteristics on Riverine Assemblages, Methods in Ecology and

Evolution, 2012, 3, 116-128.

28 Harrell, Jr., F., (2001) Regression Modeling Strategies; New York: Springer.

29 Pregibon, D (1981), Logistic Regression Diagnostics, Annals of Statistics 9: 705-721

30 Hauser, R. and Booth, D (2011), Predicting Bankruptcy with robust logistic regression, J. Data Sci

9(4), 585-605.

31 Ho, R. (2012), Big Data Machine Learning, DZoneRefCard z #158, Carey NC:DZone Inc

32 Ryan, T (2009), Modern Regression Methods 2nd Ed, Hoboken, NJ: Wiley

33 Li, H, Das, K, Fu, G, Li, R, Wu, R (2011), The Bayesian lasso for genome-wide association studies,

Bioinformatics (2011) 27 (4), 516-523

34 Wu, T, Chen, Y F, Hastie T, Sobel, E, Lange, K, 2009 Genome wide association analysis by lasso

penalized logistic regression, Bioinformatics 25: 714-721

35 Bianco, A and Martinez, E. (2009), Robust testing in the logistic regression model, Computational

Statistics and Data Analysis 53, 4095-4105.

24
36 Ansong, E., Ying, Q., Ekoue, D. N., Deaton, R., Hall, A. R., Kajdacsy-Galla, A., Yang, W., Gann, P.

H., Diamond, A. M. (2015) Evidence that Selenium Binding Protein 1 is a Tumor Suppressor in Prostate

Cancer. PLoS ONE 10(5); e0127295. doi:10.1371/jouenal.pone.0127295

37 Lockhart, R, Taylor, J, Tibshirani, R. J, Tibshirani, R, A significance test for the lasso (2012),

Department of Statistics, paper 131, http://repository.cmu.edu/statistics/131

38 Elith, J, Leathwick, J, Hastie, T (2008) A Working Guide to Boosted Regression Trees, Journal of

Animal Ecology 77, 802-813.

39. Shmueli, G, To Explain or To Predict? Statistical Science 2010 25(3) 289-310

25

You might also like