A Workflow for Missing Values Imputation of Untargeted Metabolomics Data

Tariq Faquih; Maarten van Smeden; Jiao Luo; Saskia le Cessie; Gabi Kastenmüller; Jan Krumsiek; Raymond Noordam; Diana van Heemst; Frits R Rosendaal; Astrid van Hylckama Vlieg; Ko Willems van Dijk; Dennis O Mook-Kanamori

doi:10.3390/metabo10120486

A Workflow for Missing Values Imputation of Untargeted Metabolomics Data

Metabolites. 2020 Nov 26;10(12):486. doi: 10.3390/metabo10120486.

Authors

Tariq Faquih¹, Maarten van Smeden², Jiao Luo¹, Saskia le Cessie^{1

3}, Gabi Kastenmüller^{4

5}, Jan Krumsiek⁶, Raymond Noordam⁷, Diana van Heemst⁷, Frits R Rosendaal¹, Astrid van Hylckama Vlieg¹, Ko Willems van Dijk^{8

9

10}, Dennis O Mook-Kanamori^{1

11

12}

Affiliations

¹ Department of Clinical Epidemiology, Leiden University Medical Center, Postal Zone C7-P, PO Box 9600, 2300 RC Leiden, The Netherlands.
² Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht University, 8, 3584 Utrecht, The Netherlands.
³ Department of Biomedical Data Sciences, Section Medical Statistics and Bioinformatics, Leiden University Medical Center, 2, 2333 Leiden, The Netherlands.
⁴ Institute of Bioinformatics and Systems Biology, Helmholtz-Zentrum München, 85764 Neuherberg, Germany.
⁵ Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, 85764 Neuherberg, Germany.
⁶ Department of Physiology, Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10065, USA.
⁷ Department of Internal Medicine, Section of Gerontology and Geriatrics, Leiden University Medical Center, 2333ZA Leiden, The Netherlands.
⁸ Einthoven Laboratory for Experimental Vascular Medicine, Leiden University Medical Center, 2, 2333 Leiden, The Netherlands.
⁹ Department of Internal Medicine, Division of Endocrinology, Leiden University Medical Center, 2, 2333 Leiden, The Netherlands.
¹⁰ Department of Human Genetics, Leiden University Medical Center, 2, 2333 Leiden, The Netherlands.
¹¹ Department of Public Health and Primary Care, Leiden University Medical Center, 2, 233 Leiden, The Netherlands.
¹² Metabolon Inc., Morrisville, NC 27560, USA.

Abstract

Metabolomics studies have seen a steady growth due to the development and implementation of affordable and high-quality metabolomics platforms. In large metabolite panels, measurement values are frequently missing and, if neglected or sub-optimally imputed, can cause biased study results. We provided a publicly available, user-friendly R script to streamline the imputation of missing endogenous, unannotated, and xenobiotic metabolites. We evaluated the multivariate imputation by chained equations (MICE) and k-nearest neighbors (kNN) analyses implemented in our script by simulations using measured metabolites data from the Netherlands Epidemiology of Obesity (NEO) study (n = 599). We simulated missing values in four unique metabolites from different pathways with different correlation structures in three sample sizes (599, 150, 50) with three missing percentages (15%, 30%, 60%), and using two missing mechanisms (completely at random and not at random). Based on the simulations, we found that for MICE, larger sample size was the primary factor decreasing bias and error. For kNN, the primary factor reducing bias and error was the metabolite correlation with its predictor metabolites. MICE provided consistently higher performance measures particularly for larger datasets (n > 50). In conclusion, we presented an imputation workflow in a publicly available R script to impute untargeted metabolomics data. Our simulations provided insight into the effects of sample size, percentage missing, and correlation structure on the accuracy of the two imputation methods.

Keywords: imputation; k-nearest neighbors; metabolon; multiple imputation using chained equations; simulation; untargeted metabolomics; workflow.

Abstract

Grants and funding