Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

Álvaro Torres-Martos; Mireia Bustos-Aibar; Alberto Ramírez-Mena; Sofía Cámara-Sánchez; Augusto Anguita-Ruiz; Rafael Alcalá; Concepción M Aguilera; Jesús Alcalá-Fdez

doi:10.3390/genes14020248

Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

Genes (Basel). 2023 Jan 18;14(2):248. doi: 10.3390/genes14020248.

Authors

Álvaro Torres-Martos^{1

2

3}, Mireia Bustos-Aibar^{1

2

3}, Alberto Ramírez-Mena⁴, Sofía Cámara-Sánchez⁵, Augusto Anguita-Ruiz^{2

3

6

7}, Rafael Alcalá⁵, Concepción M Aguilera^{1

2

3

7}, Jesús Alcalá-Fdez⁵

Affiliations

¹ Department of Biochemistry and Molecular Biology II, University of Granada, 18071 Granada, Spain.
² "José Mataix Verdú" Institute of Nutrition and Food Technology (INYTA), Center of Biomedical Research, University of Granada, 18100 Granada, Spain.
³ Biosanitary Research Institute of Granada (IBS.GRANADA), 18012 Granada, Spain.
⁴ Centre for Genomics and Oncological Research (GENYO), 18016 Granada, Spain.
⁵ Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada, 18071 Granada, Spain.
⁶ Barcelona Institute for Global Health (ISGlobal), 08003 Barcelona, Spain.
⁷ CIBER Physiopathology of Obesity and Nutrition (CIBEROBN), Instituto de Salud Carlos III, 28029 Madrid, Spain.

Abstract

The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.

Keywords: data pre-processing; machine learning; omics.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Child
Humans
Machine Learning
Pediatric Obesity*

Grants and funding

This research was funded in part by ERDF/Regional Government of Andalusia/Ministry of Economic Transformation, Industry, Knowledge, and Universities (grant numbers P18-RT-2248 and B-CTS-536-UGR20) and by the ERDF/Health Institute Carlos III/Spanish Ministry of Science, Innovation, and Universities (grant number PI20/00711).