Outbreak Analysis¶

MLST¶

Results for multilocus sequence typing (MLST) of each organism in each sample are available in the per-sample reports. MLST schemes are mirrored from PubMLST.

Plasmid Typing¶

Outbreaks of antimicrobial resistance (eg. carbapenemases) are frequently mediated by plasmids and can span across multiple bacterial species. BugSeq performs plasmid detection and typing on each sample. Details on the detected plasmids are available in the per-sample and summary reports.

BugSeq uses MOB-Cluster IDs for plasmid identification and naming. MOB-Clusters are similar to unique taxonomic identifiers (eg. species) for plasmids, and are stable over time. Plasmids which are not in our plasmid database but which appear as novel plasmids will be given the identifier novel_{MD5 hash}. Further detail on MOB-Clusters is available in the respective publication.

Fine-Grained Outbreak Analysis¶

BugSeq identifies all bacterial genomes in the submitted sequencing data and performs refMLST (preprint), a publication-pending method to calculate allele distances between samples without a core genome multilocus sequence typing (cgMLST) scheme. Extensive validation demonstrates that refMLST generates reliable allele count distances that are highly concordant with traditional cgMLST/SNP approaches, yet can be applied to any bacterial species.

BugSeq first collects all genomes in a submission (i.e. across samples) and divides them into bins for each species. If your lab is on a BugSeq subscription, each species from each sample will be saved across analyses and included in the most recent and any future analyses.

Note

Outbreak analysis is not currently available for nanopore R9.4.1 or R10.3 sequencing platforms.

Can BugSeq include our background genomic sequences for epidemiologic comparison?

Yes. For labs on a subscription, please get in touch with us to have your custom genomes added to the background database.

The BugSeq Outbreak Investigation module outputs are discussed in the following sections.

Distance Matrix of Inter-Genome Allele Distances¶

For example, if there were three samples containing Salmonella enterica submitted to BugSeq, a file named Salmonella_enterica.xlsx would be produced with a sheet Distance Matrix.

	GCF_000006945.2¹	Sample 1	Sample 2	Sample 3
GCF_000006945.2¹	0	3405	3652	3503
Sample 1	3405	0	4	12
Sample 2	3652	4	0	25
Sample 3	3503	12	25	0

GCF_000006945.2: Salmonella enterica reference genome

In this example, Sample 1 is 3405 alleles different from the reference profile, 4 alleles different from Sample 2, and 12 alleles different from Sample 3. If a subsequent BugSeq analysis included Sample 4, this sample would be added to the table in the output of that analysis.

Cluster Addresses¶

Cluster addresses are generated for each genome within a species and serve as a method to quickly determine the relatedness of multiple genomes. Cluster addresses follow a nomenclature reflecting the number of allele differences between genomes, and contain seven digits; each digit reflects a different allele distance threshold (1000, 200, 100, 50, 20, 10 and 5 allele differences).

If two genomes are identical or within 5 allele differences of each other, they will share the same cluster address (i.e. all seven digits). If two genomes are between 6 and 10 alleles different from each other, they will share the first six digits. If two genomes are between 11 and 20 alleles different from each other, they will share the first five digits. This pattern continues for all distance thresholds.

For example, for the three genomes of Salmonella enterica from the above distance matrix were sequenced, cluster addresses would be located in a file named Salmonella_enterica.xlsx, sheet Cluster Addresses:

	Date	Cluster Address	Cluster - 1000 alleles	Cluster - 200 alleles	Cluster - 100 alleles	Cluster - 50 alleles	Cluster - 20 alleles	Cluster - 10 alleles	Cluster - 5 alleles
GCF_000006945.2	Feb 3, 2022	1.1.1.1.1.1.1	1	1	1	1	1	1	1
Sample 1	Feb 3, 2022	2.1.1.1.1.1.1	2	1	1	1	1	1	1
Sample 2	Feb 3, 2022	2.1.1.1.1.1.1	2	1	1	1	1	1	1
Sample 3	Feb 3, 2022	2.1.1.1.1.2.1	2	1	1	1	1	2	1

How are clusters identified?¶

As a genome is added to the BugSeq database, a search is performed to find the nearest genome of the same species already in the database. Cluster addresses are calculated relative to this genome. For example, if the new genome is within 5 alleles of the nearest existing genome, it will be given the same cluster code. If the new genome is more than this distance, a new cluster code will be designated based on the distance to this genome and the existence of other subclusters. If the new genome is within 20 but more than 10 alleles different to the nearest genome, as seen with Sample 3 in the above example, a new subcluster will form starting at the sixth digit. Note that if the 2.1.1.1.1.2.1 cluster already existed in the above example, the new subcluster would be designated 2.1.1.1.1.3.1.

Last update: March 21, 2024