Multi-Label Random Forest Model for Tuberculosis Drug Resistance Classification and Mutation Ranking

Samaneh Kouchaki; Yang Yang; Alexander Lachapelle; Timothy M Walker; A Sarah Walker; CRyPTIC Consortium; Timothy E A Peto; Derrick W Crook; David A Clifton

doi:10.3389/fmicb.2020.00667

Multi-Label Random Forest Model for Tuberculosis Drug Resistance Classification and Mutation Ranking

Front Microbiol. 2020 Apr 22:11:667. doi: 10.3389/fmicb.2020.00667. eCollection 2020.

Authors

Samaneh Kouchaki¹, Yang Yang^{1

2}, Alexander Lachapelle¹, Timothy M Walker^{3

4

5}, A Sarah Walker^{3

4

6}; CRyPTIC Consortium; Timothy E A Peto^{3

4

6}, Derrick W Crook^{3

4

6}, David A Clifton¹

Affiliations

¹ Department of Engineering Science, Institute of Biomedical Engineering, University of Oxford, Oxford, United Kingdom.
² Oxford-Suzhou Centre for Advanced Research, Suzhou, China.
³ Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom.
⁴ National Institute of Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Oxford, United Kingdom.
⁵ Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam.
⁶ NIHR Biomedical Research Centre, Oxford, United Kingdom.

Abstract

Resistance prediction and mutation ranking are important tasks in the analysis of Tuberculosis sequence data. Due to standard regimens for the use of first-line antibiotics, resistance co-occurrence, in which samples are resistant to multiple drugs, is common. Analysing all drugs simultaneously should therefore enable patterns reflecting resistance co-occurrence to be exploited for resistance prediction. Here, multi-label random forest (MLRF) models are compared with single-label random forest (SLRF) for both predicting phenotypic resistance from whole genome sequences and identifying important mutations for better prediction of four first-line drugs in a dataset of 13402 Mycobacterium tuberculosis isolates. Results confirmed that MLRFs can improve performance compared to conventional clinical methods (by 18.10%) and SLRFs (by 0.91%). In addition, we identified a list of candidate mutations that are important for resistance prediction or that are related to resistance co-occurrence. Moreover, we found that retraining our analysis to a subset of top-ranked mutations was sufficient to achieve satisfactory performance. The source code can be found at http://www.robots.ox.ac.uk/~davidc/code.php.

Keywords: MLRF; SLRF; drug resistance; mutation ranking; tuberculosis.

Abstract

Grants and funding