Using genomic data and machine learning to predict antibiotic resistance: A tutorial paper

PLoS Comput Biol. 2024 Dec 30;20(12):e1012579. doi: 10.1371/journal.pcbi.1012579. eCollection 2024 Dec.

Abstract

Antibiotic resistance is a global public health concern. Bacteria have evolved resistance to most antibiotics, which means that for any given bacterial infection, the bacteria may be resistant to one or several antibiotics. It has been suggested that genomic sequencing and machine learning (ML) could make resistance testing more accurate and cost-effective. Given that ML is likely to become an ever more important tool in medicine, we believe that it is important for pre-health students and others in the life sciences to learn to use ML tools. This paper provides a step-by-step tutorial to train 4 different ML models (logistic regression, random forests, extreme gradient-boosted trees, and neural networks) to predict drug resistance for Escherichia coli isolates and to evaluate their performance using different metrics and cross-validation techniques. We also guide the user in how to load and prepare the data used for the ML models. The tutorial is accessible to beginners and does not require any software to be installed as it is based on Google Colab notebooks and provides a basic understanding of the different ML models. The tutorial can be used in undergraduate and graduate classes for students in Biology, Public Health, Computer Science, or related fields.

MeSH terms

  • Anti-Bacterial Agents* / pharmacology
  • Computational Biology / methods
  • Drug Resistance, Bacterial / genetics
  • Drug Resistance, Microbial / genetics
  • Escherichia coli* / drug effects
  • Escherichia coli* / genetics
  • Genome, Bacterial / genetics
  • Genomics* / methods
  • Humans
  • Machine Learning*

Substances

  • Anti-Bacterial Agents

Grants and funding

PP was awarded NSF grant 1655212. FTO was supported by an NSF REPS Supplement under NSF grant 1655212. MJH was supported by Bristol-Myers Squibb Black Excellence in STEM Scholars and Genentech Foundation Scholars G-7874540. JMS was supported by NIH MARC T34-GM008574. JA was supported by NIH MS to Bridges Doctorate T32-GM142515. KR was supported by Genentech Foundation Scholars G-7874540. PG was supported by NIH MARC T34-GM008574. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.