Skip to content

Allele Thresholds

General Considerations

Choosing an allele distance threshold is a critical step when using NGS for infection control, surveillance, and outbreak investigations. This threshold determines how genetically similar isolates must be to be considered part of the same transmission cluster.

Selecting an appropriate threshold requires consideration of several biological and epidemiological factors:

1. Species-Specific Diversity

Different bacterial species exhibit varying levels of genetic diversity. Clonal organisms (e.g. Mycobacterium tuberculosis) tend to have lower background allele diversity, so a smaller number of allele differences may indicate unrelated isolates. Thresholds should reflect the expected rate of allele variation for the species of interest.

2. Outbreak Time Scale

The time span over which samples are collected directly impacts expected genetic distance:

  • Short outbreaks (days to weeks): Expect minimal variation. A lower allele threshold may be appropriate.
  • Prolonged outbreaks (months to years): More alleles may accumulate, necessitating a higher threshold.

3. Transmission Mode and Clonality

  • Direct transmission events typically result in smaller allele differences.
  • Indirect or environmental transmission may allow for greater divergence before detection.

4. Sequencing Quality and Consistency

Allele distances can be influenced by technical artifacts. Lower quality sequencing data may require higher allele thresholds. See our data recommendations to ensure optimal results.

5. Epidemiological and Geographic Context

Allele distance should always be interpreted alongside epidemiological data:

  • A small allele distance between two isolates from the same patient ward within a short time span may suggest direct transmission.
  • The same distance between isolates from different regions or time periods may reflect background population structure rather than a recent transmission event.

6. Empirical Calibration

Where possible, calibrate thresholds using real-world data:

  • Analyze confirmed outbreak datasets to identify maximum allele distances observed within known transmission clusters.
  • Use public outbreak databases or institution-specific retrospective analyses to inform appropriate thresholds.

Converting Allele Thresholds from other cgMLST Schemes to refMLST Thresholds

In Khdhiri et al, we showed that refMLST and cgMLST produce highly comparable results, with spearman correlation of allele distances exceeding 0.98 across bacterial species.

Figure 2

Figure 2 from Khdhiri et al (2024) demonstrating high correlation between cgMLST and refMLST.

We also showed that refMLST has higher allele distances than cgMLST because it examines more loci, enabling it to resolve differences between more closely related isolates.

Based on this benchmarking data and experience from BugSeq users, we can recommend the following formula to convert external cgMLST schemes to BugSeq refMLST thresholds:

\[ \text{refMLST threshold} = \text{cgMLST threshold} \times \frac{\text{refMLST total loci}}{\text{cgMLST total loci}} \]

We then recommend rounding up to the nearest refMLST address threshold so that addresses can be used for cluster naming. Address thresholds are described in the Cluster Addresses section and are 5, 10, 20, 50, 100, 200 and 1000 alleles.

The final suggested threshold is therefore:

\[ \text{refMLST threshold} = \left\lceil \text{cgMLST threshold} \times \frac{\text{refMLST total loci}}{\text{cgMLST total loci}} \right\rceil_{\text{nearest address breakpoint}} \]

To use cluster addresses with a 20 allele threshold, only use the first five digits in the address and ignore the last two digits.

refMLST Scheme Sizes

The number of loci for each species is found here (note: must be logged in to access).

Example

The PulseNet program recommends 10 cgMLST alleles as a threshold for Salmonella enterica. In the above spreadsheet, we see that BugSeq examines 4112 for S. enterica. Filling in the above formula, we find:

\[ \text{refMLST threshold} = \left\lceil 10 \times \frac{4112}{3002} \right\rceil_{\text{nearest address breakpoint}} \]
\[ \text{refMLST threshold} = \left\lceil 13.7 \right\rceil_{\text{nearest address breakpoint}} \]
\[ \text{refMLST threshold} = 20\text{ alleles} \]

Indeed, in Khdhiri et al, we find the adjusted rand index of clusters to be 0.92 (very high cluster overlap) between refMLST with 20 allele and cgMLST with 10 allele thresholds for S. enterica.

If there are four samples of S. enterica with addresses below, the addresses can then be truncated to obtain cluster codes at the desired allele threshold. Using the truncated addresses, the first three samples are all in the same cluster below the 20 allele threshold, while the fourth sample is part of a separate cluster at this threshold.

Sample Full Cluster Address Truncated Cluster Address to Support 20 Allele Threshold
Sample 1 1.1.1.1.1.1.1 1.1.1.1.1
Sample 2 1.1.1.1.1.1.2 1.1.1.1.1
Sample 3 1.1.1.1.1.2.1 1.1.1.1.1
Sample 4 1.1.1.1.2.1.1 1.1.1.1.2

Published Schemes and Thresholds Converted

Species Publication Published cgMLST Threshold BugSeq refMLST Threshold BugSeq refMLST Threshold (Rounded Up to Nearest Address Breakpoint) cgMLST Scheme Size refMLST Scheme Size Notes
Acinetobacter baumannii https://doi.org/10.1016/j.ijantimicag.2021.106404 10 14 20 2390 3372
Clostridioides difficile https://doi.org/10.3389/fcimb.2023.1109153 6 8 10 2469 3294
Clostridioides difficile https://doi.org/10.1128/jcm.01987-17 6 9 10 2270 3294
Enterococcus faecalis https://doi.org/10.1128/jcm.01686-18 7 9 10 1972 2502
Enterococcus faecium https://doi.org/10.1128/jcm.01946-15 20 32 50 1423 2271
Escherichia coli https://doi.org/10.1099/mgen.0.001126 10 16 20 2513 3907
Klebsiella pneumoniae https://doi.org/10.3389/fmicb.2017.00371 10 41 50 1143 4633
Klebsiella pneumoniae https://doi.org/10.1099/mgen.0.000347 4 8 10 2358 4633
Listeria monocytogenes https://doi.org/10.1186/s12864-022-08437-4 7 11 20 1748 2770
Mycobacterium tuberculosis https://doi.org/10.1128/jcm.00567-14 12 14 20 3257 3735
Salmonella enterica https://doi.org/10.3389/fmicb.2021.649517 10 14 20 3000 4112
Staphylococcus aureus https://doi.org/10.1128/jcm.00029-17 8 10 10 1861 2346 “Related” threshold used. “Possibly related” would lead to a 50 refMLST allele threshold.