Skip to content

Outbreak Analysis

MLST

Results for multilocus sequence typing (MLST) of each organism in each sample are available in the summary and per-sample reports. MLST schemes are mirrored from PubMLST.

BugSeq’s MLST results look for both complete and partial ungapped alignments to known full length alleles. Additional information is provided for the nearest allele(s) if a single exact match is not found:

Symbol Meaning Length Identity
n Exact intact allele 100% 100%
~n Novel full length allele similar to n 100% ≥95%
n? Partial match to known allele ≥10% ≥95%
- Allele missing <10% <95%
n & m Multiple alleles    

A ? symbol occurs, for example, when the alignment length is less than the reference allele length. This will occur when an indel is present in which case the indicated allele may be incorrect.

Plasmid Typing

Outbreaks of antimicrobial resistance (eg. carbapenemases) are frequently mediated by plasmids and can span across multiple bacterial species. BugSeq performs plasmid detection and typing on each sample. Details on the detected plasmids are available in the per-sample and summary reports.

BugSeq uses MOB-Cluster IDs for plasmid identification and naming. MOB-Clusters are similar to unique taxonomic identifiers (eg. species) for plasmids, and are stable over time. Plasmids which are not in our plasmid database but which appear as novel plasmids will be given the identifier novel_{MD5 hash}. Further detail on MOB-Clusters is available in the respective publication.

High-Resolution Outbreak Analysis

BugSeq identifies all bacterial genomes in the submitted sequencing data and performs refMLST (link to publication), a method that calculates allele distances between samples without the need for a core genome multilocus sequence typing (cgMLST) scheme. Extensive validation demonstrates that refMLST generates reliable allele count distances that are highly concordant with traditional cgMLST/SNP approaches, yet can be applied to any bacterial species.

BugSeq first collects all genomes in a submission (i.e. across samples) and divides them into bins for each species. If your lab is on a BugSeq subscription, each species from each sample will be saved across analyses and included in the most recent and any future analyses.

Note

Outbreak analysis is not currently available for nanopore R9.4.1 or R10.3 sequencing platforms.

Note

Outbreak analysis is only performed on high-quality assemblies with at least 90% completeness and less than 5% contamination. Genomes not meeting this criteria will not be passed to high-resolution outbreak analysis.

Can BugSeq include our background genomic sequences for epidemiologic comparison?

Yes. For labs on a subscription, please get in touch with us to have your custom genomes added to the background database.

The BugSeq Outbreak Investigation module outputs are discussed in the following sections.

Distance Matrix of Inter-Genome Allele Distances

For example, if there were three samples containing Salmonella enterica submitted to BugSeq, a file named Salmonella_enterica.xlsx would be produced with a sheet Distance Matrix.

GCF_000006945.21 Sample 1 Sample 2 Sample 3
GCF_000006945.21 0 3405 3652 3503
Sample 1 3405 0 4 12
Sample 2 3652 4 0 25
Sample 3 3503 12 25 0
  1. GCF_000006945.2: Salmonella enterica reference genome

In this example, Sample 1 is 3405 alleles different from the reference profile, 4 alleles different from Sample 2, and 12 alleles different from Sample 3. If a subsequent BugSeq analysis included Sample 4, this sample would be added to the table in the output of that analysis.

Cluster Addresses

Cluster addresses are generated for each genome within a species and serve as a method to quickly determine the relatedness of multiple genomes. Cluster addresses follow a nomenclature reflecting the number of allele differences between genomes, and contain seven digits; each digit reflects a different allele distance threshold (1000, 200, 100, 50, 20, 10 and 5 allele differences).

If two genomes are identical or within 5 allele differences of each other, they will share the same cluster address (i.e. all seven digits). If two genomes are between 6 and 10 alleles different from each other, they will share the first six digits. If two genomes are between 11 and 20 alleles different from each other, they will share the first five digits. This pattern continues for all distance thresholds.

For example, for the three genomes of Salmonella enterica from the above distance matrix were sequenced, cluster addresses would be located in a file named Salmonella_enterica.xlsx, sheet Cluster Addresses:

Date Cluster Address Cluster - 1000 alleles Cluster - 200 alleles Cluster - 100 alleles Cluster - 50 alleles Cluster - 20 alleles Cluster - 10 alleles Cluster - 5 alleles
GCF_000006945.2 Feb 3, 2022 1.1.1.1.1.1.1 1 1 1 1 1 1 1
Sample 1 Feb 3, 2022 2.1.1.1.1.1.1 2 1 1 1 1 1 1
Sample 2 Feb 3, 2022 2.1.1.1.1.1.1 2 1 1 1 1 1 1
Sample 3 Feb 3, 2022 2.1.1.1.1.2.1 2 1 1 1 1 2 1

How are clusters identified?

As a genome is added to the BugSeq database, a search is performed to find the nearest genome of the same species already in the database. Cluster addresses are calculated relative to this genome. For example, if the new genome is within 5 alleles of the nearest existing genome, it will be given the same cluster code. If the new genome is more than this distance, a new cluster code will be designated based on the distance to this genome and the existence of other subclusters. If the new genome is within 20 but more than 10 alleles different to the nearest genome, as seen with Sample 3 in the above example, a new subcluster will form starting at the sixth digit. Note that if the 2.1.1.1.1.2.1 cluster already existed in the above example, the new subcluster would be designated 2.1.1.1.1.3.1.