Skip to main content
Publication

Predicting antimicrobial resistance using conserved genes

Authors

Nguyen, Marcus; Olson, Robert; Shukla, Maulik; VanOeffelen, Margo; Davis, James

Abstract

Author summary Machine learning models for predicting AMR phenotypes from sequence data are often built using features derived from well-studied sets of AMR genes, or from whole genome sequences. In this study, we build models using core genes that are held in common among the members of a species and that are not known to confer antimicrobial resistance based on their annotations. We find that there is sufficient variation in these core conserved genes to produce models with accuracies greater than or equal to 80% in four species, using as few as 100 genes. However, we note that these models are less accurate than models built from whole genomes or lists of AMR genes. The results of this study suggest that variations relating to, or co-occurring with AMR are extensive, and that it is possible to use conserved non-AMR genes to predict AMR phenotypes.A growing number of studies are using machine learning models to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. Although these studies are showing promise, the models are typically trained using features derived from comprehensive sets of AMR genes or whole genome sequences and may not be suitable for use when genomes are incomplete. In this study, we explore the possibility of predicting AMR phenotypes using incomplete genome sequence data. Models were built from small sets of randomly-selected core genes after removing the AMR genes. ForKlebsiella pneumoniae,Mycobacterium tuberculosis,Salmonella enterica, andStaphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in cases where the primary AMR mechanisms result from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes, we show that F1 scores and error rates are stable and have little variance between replicates. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes.