CohortDiagnostics: Phenotype evaluation across a network of observational data sources using population-level characterization

Gowtham A Rao; Azza Shoaibi; Rupa Makadia; Jill Hardin; Joel Swerdel; James Weaver; Erica A Voss; Mitchell M Conover; Stephen Fortin; Anthony G Sena; Chris Knoll; Nigel Hughes; James P Gilbert; Clair Blacketer; Alan Andryc; Frank DeFalco; Anthony Molinaro; Jenna Reps; Martijn J Schuemie; Patrick B Ryan

doi:10.1371/journal.pone.0310634

CohortDiagnostics: Phenotype evaluation across a network of observational data sources using population-level characterization

PLoS One. 2025 Jan 16;20(1):e0310634. doi: 10.1371/journal.pone.0310634. eCollection 2025.

Authors

Gowtham A Rao^{1

2}, Azza Shoaibi^{1

2}, Rupa Makadia^{1

2}, Jill Hardin^{1

2}, Joel Swerdel^{1

2}, James Weaver^{1

2}, Erica A Voss^{1

2}, Mitchell M Conover^{1

2}, Stephen Fortin^{1

2}, Anthony G Sena^{1

2}, Chris Knoll^{1

2}, Nigel Hughes^{1

2}, James P Gilbert^{1

2}, Clair Blacketer^{1

2}, Alan Andryc^{1

2}, Frank DeFalco^{1

2}, Anthony Molinaro^{1

2}, Jenna Reps^{1

2}, Martijn J Schuemie^{1

2

3}, Patrick B Ryan^{1

2

4}

Affiliations

¹ Observational Health Data Analytics, Janssen Research and Development, LLC, Titusville, NJ, United States of America.
² OHDSI Collaborators, Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States of America.
³ Department of Biostatistics, University of California, Los Angeles, CA, United States of America.
⁴ Department of Biomedical Informatics, Columbia University, New York, NY, United States of America.

PMID: 39820599
DOI: 10.1371/journal.pone.0310634

Abstract

Objective: This paper introduces a novel framework for evaluating phenotype algorithms (PAs) using the open-source tool, Cohort Diagnostics.

Materials and methods: The method is based on several diagnostic criteria to evaluate a patient cohort returned by a PA. Diagnostics include estimates of incidence rate, index date entry code breakdown, and prevalence of all observed clinical events prior to, on, and after index date. We test our framework by evaluating one PA for systemic lupus erythematosus (SLE) and two PAs for Alzheimer's disease (AD) across 10 different observational data sources.

Results: By utilizing CohortDiagnostics, we found that the population-level characteristics of individuals in the cohort of SLE closely matched the disease's anticipated clinical profile. Specifically, the incidence rate of SLE was consistently higher in occurrence among females. Moreover, expected clinical events like laboratory tests, treatments, and repeated diagnoses were also observed. For AD, although one PA identified considerably fewer patients, absence of notable differences in clinical characteristics between the two cohorts suggested similar specificity.

Discussion: We provide a practical and data-driven approach to evaluate PAs, using two clinical diseases as examples, across a network of OMOP data sources. Cohort Diagnostics can ensure the subjects identified by a specific PA align with those intended for inclusion in a research study.

Conclusion: Diagnostics based on large-scale population-level characterization can offer insights into the misclassification errors of PAs.

Copyright: © 2025 Rao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Algorithms*
Alzheimer Disease* / diagnosis
Alzheimer Disease* / epidemiology
Cohort Studies
Female
Humans
Incidence
Information Sources
Lupus Erythematosus, Systemic* / diagnosis
Lupus Erythematosus, Systemic* / epidemiology
Male
Phenotype*