Lasso logistic regression, GSoft and the cyclic coordinate descent algorithm: application to gene expression data

Stat Appl Genet Mol Biol. 2010:9:Article30. doi: 10.2202/1544-6115.1536. Epub 2010 Aug 12.

Abstract

Statistical methods generating sparse models are of great value in the gene expression field, where the number of covariates (genes) under study moves about the thousands while the sample sizes seldom reach a hundred of individuals. For phenotype classification, we propose different lasso logistic regression approaches with specific penalizations for each gene. These methods are based on a generalized soft-threshold (GSoft) estimator. We also show that a recent algorithm for convex optimization, namely, the cyclic coordinate descent (CCD) algorithm, provides with a way to solve the optimization problem significantly faster than with other competing methods. Viewing GSoft as an iterative thresholding procedure allows us to get the asymptotic properties of the resulting estimates in a straightforward manner. Results are obtained for simulated and real data. The leukemia and colon datasets are commonly used to evaluate new statistical approaches, so they come in useful to establish comparisons with similar methods. Furthermore, biological meaning is extracted from the leukemia results, and compared with previous studies. In summary, the approaches presented here give rise to sparse, interpretable models that are competitive with similar methods developed in the field.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Colonic Neoplasms / genetics
  • Databases, Factual
  • Gene Expression Profiling / methods*
  • Gene Expression*
  • Leukemia / genetics
  • Logistic Models
  • Oligonucleotide Array Sequence Analysis