Prospective modeling and estimating the epidemiologically informative match rate within large foodborne pathogen genomic databases

Lanlan Yin; James B Pettengill

doi:10.1186/s13104-024-06847-z

Prospective modeling and estimating the epidemiologically informative match rate within large foodborne pathogen genomic databases

BMC Res Notes. 2024 Jul 9;17(1):191. doi: 10.1186/s13104-024-06847-z.

Authors

Lanlan Yin¹, James B Pettengill²

Affiliations

¹ Biostatistics and Bioinformatics Staff, Office of Analytics and Outreach, Center for Food Safety and Applied Nutrition, U. S. Food and Drug Administration, College Park, MA, USA.
² Biostatistics and Bioinformatics Staff, Office of Analytics and Outreach, Center for Food Safety and Applied Nutrition, U. S. Food and Drug Administration, College Park, MA, USA. [email protected].

Abstract

Objectives: Much has been written about the utility of genomic databases to public health. Within food safety these databases contain data from two types of isolates-those from patients (i.e., clinical) and those from non-clinical sources (e.g., a food manufacturing environment). A genetic match between isolates from these sources represents a signal of interest. We investigate the match rate within three large genomic databases (Listeria monocytogenes, Escherichia coli, and Salmonella) and the smaller Cronobacter database; the databases are part of the Pathogen Detection project at NCBI (National Center for Biotechnology Information).

Results: Currently, the match rate of clinical isolates to non-clinical isolates is 33% for L. monocytogenes, 46% for Salmonella, and 7% for E. coli. These match rates are associated with several database features including the diversity of the organism, the database size, and the proportion of non-clinical BioSamples. Modeling match rate via logistic regression showed relatively good performance. Our prediction model illustrates the importance of populating databases with non-clinical isolates to better identify a match for clinical samples. Such information should help public health officials prioritize surveillance strategies and show the critical need to populate fledgling databases (e.g., Cronobacter sakazakii).

Keywords: Foodborne pathogen; Genomics; Surveillance.

MeSH terms

Databases, Genetic*
Escherichia coli / genetics
Escherichia coli / isolation & purification
Food Microbiology
Foodborne Diseases / epidemiology
Foodborne Diseases / microbiology
Humans
Listeria monocytogenes / genetics
Listeria monocytogenes / isolation & purification
Prospective Studies
Salmonella* / genetics
Salmonella* / isolation & purification