A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data

Kabilan Sakthivel; Shashi Bhushan Lal; Sudhir Srivastava; Krishna Kumar Chaturvedi; Yasin Jeshima Khan; Dwijesh Chandra Mishra; Sharanbasappa D Madival; Ramasubramanian Vaidhyanathan; Girish Kumar Jha

doi:10.1021/acs.jproteome.4c00552

A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data

J Proteome Res. 2024 Dec 10. doi: 10.1021/acs.jproteome.4c00552. Online ahead of print.

Authors

Kabilan Sakthivel^{1

2}, Shashi Bhushan Lal², Sudhir Srivastava², Krishna Kumar Chaturvedi², Yasin Jeshima Khan³, Dwijesh Chandra Mishra², Sharanbasappa D Madival², Ramasubramanian Vaidhyanathan⁴, Girish Kumar Jha²

Affiliations

¹ The Graduate School, ICAR-Indian Agricultural Research Institute, New Delhi 110012, India.
² Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
³ Division of Genomic Resources, ICAR-National Bureau of Plant Genetic Resources, New Delhi 110012, India.
⁴ Research Systems Management, ICAR-National Academy of Agricultural Research Management, Hyderabad 500030, India.

PMID: 39659155
DOI: 10.1021/acs.jproteome.4c00552

Abstract

Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.

Keywords: bottom-up approach; differential expression analysis; label-free proteomics; missing value imputation; normalization; protein; quality control.