Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data

Shah Atiqur Rahman; Yuxiao Huang; Jan Claassen; Nathaniel Heintzman; Samantha Kleinberg

doi:10.1016/j.jbi.2015.10.004

Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data

J Biomed Inform. 2015 Dec:58:198-207. doi: 10.1016/j.jbi.2015.10.004. Epub 2015 Oct 21.

Authors

Shah Atiqur Rahman¹, Yuxiao Huang², Jan Claassen³, Nathaniel Heintzman⁴, Samantha Kleinberg⁵

Affiliations

¹ Department of Computer Science, Stevens Institute of Technology, NJ, United States. Electronic address: [email protected].
² Department of Computer Science, Stevens Institute of Technology, NJ, United States. Electronic address: [email protected].
³ Division of Critical Care Neurology, Department of Neurology, Columbia University, College of Physicians and Surgeons, New York, NY, United States. Electronic address: [email protected].
⁴ Dexcom Inc., San Diego, CA, United States. Electronic address: [email protected].
⁵ Department of Computer Science, Stevens Institute of Technology, NJ, United States. Electronic address: [email protected].

Abstract

Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length.

Keywords: Biomedical data; Imputation; Missing data; Time series.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Fourier Analysis*
Models, Theoretical

Abstract

Publication types

MeSH terms

Grants and funding