Training machine learning models to detect rare inborn errors of metabolism (IEMs) based on GC-MS urinary metabolomics for diseases screening

Haomin Li; Siyuan Gao; Dan Wu; Min Zhu; Zhenzhen Hu; Kexin Fang; Xiuru Chen; Zhou Ni; Jing Li; Beibei Zhao; Xuhui She; Xinwen Huang

doi:10.1016/j.ijmedinf.2024.105765

Training machine learning models to detect rare inborn errors of metabolism (IEMs) based on GC-MS urinary metabolomics for diseases screening

Int J Med Inform. 2024 Dec 16:195:105765. doi: 10.1016/j.ijmedinf.2024.105765. Online ahead of print.

Authors

Haomin Li¹, Siyuan Gao², Dan Wu³, Min Zhu², Zhenzhen Hu⁴, Kexin Fang⁴, Xiuru Chen⁵, Zhou Ni⁵, Jing Li², Beibei Zhao⁵, Xuhui She⁶, Xinwen Huang⁷

Affiliations

¹ Clinical Data Center, the Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
² Digital Management Center, Guangzhou KingMed Diagnostics Group Co., Ltd., Guangzhou 510005, China.
³ The College of Biomedical Engineering and Instrument Science, Zhejiang University, Zhejiang, China.
⁴ Department of Genetics and Metabolism, the Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
⁵ Clinical Mass Spectrometry Center, Guangzhou KingMed Center for Clinical Laboratory Co., Ltd., Guangzhou 510005, China.
⁶ Clinical Mass Spectrometry Center, Guangzhou KingMed Center for Clinical Laboratory Co., Ltd., Guangzhou 510005, China; KingMed College of Laboratory Medicine of Guangzhou Medical University, Guangzhou 510005, China. Electronic address: [email protected].
⁷ Department of Genetics and Metabolism, the Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China. Electronic address: [email protected].

PMID: 39705916
DOI: 10.1016/j.ijmedinf.2024.105765

Abstract

Background: Gas chromatography-mass spectrometry (GC-MS) has been shown to be a potentially efficient metabolic profiling platform in urine analysis. However, the widespread use of GC-MS for inborn errors of metabolism (IEM) screening is constrained by the rarity of IEM in population, and the difficult and specialized complexity of the interpretation of GC-MS organic acid profiles.

Methods: Based on 355,197 GC-MS test cases accumulated from 2013 to 2021 in China, a random forest-based machine learning model was proposed, trained, and evaluated. Weighted undersampling or oversampling data processing and staged modeling strategies were used to handle the highly imbalanced data and improve the ability of the model to identify different types of rare IEM cases.

Result: In the first-stage model, which only identified positive cases without discriminating the specific IEM, the screening sensitivity was 0.938 (or 0.991 if abnormal cases were also included). The average sensitivity of the second-stage models that classify 11 particular IEMs is 0.992, with an average specificity and accuracy of 0.944 and 0.969, respectively. The SHAP values visualized for each model explain the basis for the differential diagnosis made by the model.

Conclusion: With sufficient high-quality data, machine learning models can provide high-sensitivity GC-MS interpretation and greatly improve the efficiency and quality of GC-MS based IEM screening.

Keywords: Disease screening; GC–MS; Imbalanced classification; Inborn error of metabolism; Machine learning.