Biomarkers

Alzheimers Dement. 2024 Dec;20 Suppl 2(Suppl 2):e085143. doi: 10.1002/alz.085143.

Abstract

Background: Diagnostic and prognostic decisions about Alzheimer's disease (AD) are more accurate when based on large data sets. We developed and validated a machine learning (ML) data harmonization tool for aggregation of prospective data from neuropsychological tests applied to study AD. The online ML-combine application (OML-combine app) allows researchers to utilize the ML-harmonization method for harmonization of their own data with that from other large available data bases (e.g. AIBL) to enable development of their own neuropsychological models of AD.

Method: The OML-Combine application implements an established neuropsychological test data harmonization method1 based on non-parametric multivariate imputation using random forests (missForest)2. Test data not included in a cohort is classified as missing and imputed using known data from the cohort based on information known from other studies1. A web-based R-Shiny application was developed to facilitate harmonization of data from different cohorts and visualise outcomes. OML-combine also calculates percentages of missing values for each test score across the pooled dataset allowing decisions about the validity of harmonized data. OML-combine also allows harmonization of multiple datasets simultaneously.

Result: The R Shiny package was used to produce an interactive data harmonization tool. Figure 1 displays the interface, showing results from an example harmonization and validation step using simulated data from AIBL (N=1813) and ADNI (N=1945). In the validation tab, users are provided with a figure of the distributions of both raw and harmonized datasets, including predicted test scores and an accuracy measurement for each score. These can be used to validate the outcomes and compare them to known relationships established from the raw data for each dataset.

Conclusion: OML-Combine facilitates the harmonization of neuropsychological test data from established AD cohort studies. Visualization of predicted test scores and the original data sets can thereby assist with decisions about accuracy of harmonized data and can provide a basis for adjustment of inputs to optimize models. This allows researchers to combine their own data with that from other currently available studies to improve diagnostic and prognostic models of AD. References: 1doi: 10.1002/alz.044302 2doi:10.1093/bioinformatics/btr597.

MeSH terms

  • Alzheimer Disease* / diagnosis
  • Biomarkers*
  • Humans
  • Machine Learning*
  • Neuropsychological Tests* / statistics & numerical data

Substances

  • Biomarkers