Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge

Adi L Tarca; Mario Lauria; Michael Unger; Erhan Bilal; Stephanie Boue; Kushal Kumar Dey; Julia Hoeng; Heinz Koeppl; Florian Martin; Pablo Meyer; Preetam Nandy; Raquel Norel; Manuel Peitsch; Jeremy J Rice; Roberto Romero; Gustavo Stolovitzky; Marja Talikka; Yang Xiang; Christoph Zechner; IMPROVER DSC Collaborators

doi:10.1093/bioinformatics/btt492

Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge

Bioinformatics. 2013 Nov 15;29(22):2892-9. doi: 10.1093/bioinformatics/btt492. Epub 2013 Aug 20.

Collaborators

IMPROVER DSC Collaborators:
Alan Veliz-Cuba, Joe Song, Hien Nguyen, Michael Zeller, Peter Sadowski, Steffen Klamt, Sandra Heise, Mika Gustafsson, Alberto Malovini, Francesca Mulas, Nicola Barbarini, Michael Unger, Preetam Nandy, Kushal Kumar Dey, Christoph Zechner, Heinz Koeppl, Nicoleta Cristea, Tom Petty, Yi Liu, Tiffany Ting Liu, Cheng Zhao, Giovanni Felici, Mario Lauria, Paolo Provero, Esteban G Tabak, Benjamin Haibe-Kains, Simon Papillion-Cavanagh, Nicolas De Jay, Rene Dreos, Fengfeng Zhou, Weixiong Zhang, Zheng Chen, Hu Yuxuan, Kenny Chau, Rita M C de Almeida, Samoel R M da Silva, Gabriel Cury Perrone, David Rossell, Patrick Aloy, Mei-lyn Ong, Calin Voichita, Saied Haidarian, Rebecca Tagett, Lydia Hopp, Bo Li, M Bhattacharjee, Jian-lei Gu, Meena Choi, Christopher Poirel, David Badger, Ahsanur Rahman, Richard Rodrigues, Nikolay Balov, Maria Chikina, Elena Zaslavsky, Adi L Tarca, Roberto Romero, Xianwen Ren, Steve Horvath, Lin Song, Sol Efroni, Rotem Ben-Hamo, Mayte Suarez-Farinas, Suyan Tian, Seunghak Lee, Wei Keat Lim, Xiaoping Liu, Luonan Chen, Tao Zeng, Di Huang, Benjamin Haibe-Kains, Inchi Hu, A-mer Sinan Sarac, Torik Ayoubi, Kai Wang, Ji-Hoon Cho, Alan Lin, Chao Ye, Junfeng Li, Hongfei Cui, Peter Salzman, Marko Sysi-Aho, Sandra Castillo Priego, Matej Oresic, Gopal Peddinti, Barbara Di Camillo, Zeke Maier, Zhenshu Wen, Xin-dong Zhang

Affiliation

¹ Department of Computer Science, Wayne State University, Perinatology Research Branch, NICHD/NIH, Detroit, MI 48201, USA, The Microsoft Research - University of Trento Centre for Computational and Systems Biology, Rovereto 38068, Italy, ETH Zurich, Zurich 8092, Switzerland, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA and Philip Morris International, Research & Development, Neuchâtel CH-2000, Switzerland.

Abstract

Motivation: After more than a decade since microarrays were used to predict phenotype of biological samples, real-life applications for disease screening and identification of patients who would best benefit from treatment are still emerging. The interest of the scientific community in identifying best approaches to develop such prediction models was reaffirmed in a competition style international collaboration called IMPROVER Diagnostic Signature Challenge whose results we describe herein.

Results: Fifty-four teams used public data to develop prediction models in four disease areas including multiple sclerosis, lung cancer, psoriasis and chronic obstructive pulmonary disease, and made predictions on blinded new data that we generated. Teams were scored using three metrics that captured various aspects of the quality of predictions, and best performers were awarded. This article presents the challenge results and introduces to the community the approaches of the best overall three performers, as well as an R package that implements the approach of the best overall team. The analyses of model performance data submitted in the challenge as well as additional simulations that we have performed revealed that (i) the quality of predictions depends more on the disease endpoint than on the particular approaches used in the challenge; (ii) the most important modeling factor (e.g. data preprocessing, feature selection and classifier type) is problem dependent; and (iii) for optimal results datasets and methods have to be carefully matched. Biomedical factors such as the disease severity and confidence in diagnostic were found to be associated with the misclassification rates across the different teams.

Availability: The lung cancer dataset is available from Gene Expression Omnibus (accession, GSE43580). The maPredictDSC R package implementing the approach of the best overall team is available at www.bioconductor.org or http://bioinformaticsprb.med.wayne.edu/.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Disease / genetics
Gene Expression Profiling / methods*
Humans
Lung Neoplasms / diagnosis
Lung Neoplasms / genetics
Molecular Diagnostic Techniques*
Multiple Sclerosis / diagnosis
Multiple Sclerosis / genetics
Oligonucleotide Array Sequence Analysis / methods*
Phenotype*
Psoriasis / diagnosis
Psoriasis / genetics
Pulmonary Disease, Chronic Obstructive / diagnosis
Pulmonary Disease, Chronic Obstructive / genetics

Associated data

GEO/GSE43580

Grants and funding

N01-HD-2-3342/HD/NICHD NIH HHS/United States