Allele Thresholds¶
General Considerations¶
Choosing an allele distance threshold is a critical step when using NGS for infection control, surveillance, and outbreak investigations. This threshold determines how genetically similar isolates must be to be considered part of the same transmission cluster.
Selecting an appropriate threshold requires consideration of several biological and epidemiological factors:
1. Species-Specific Diversity¶
Different bacterial species exhibit varying levels of genetic diversity. Clonal organisms (e.g. Mycobacterium tuberculosis) tend to have lower background allele diversity, so a smaller number of allele differences may indicate unrelated isolates. Thresholds should reflect the expected rate of allele variation for the species of interest.
2. Outbreak Time Scale¶
The time span over which samples are collected directly impacts expected genetic distance:
- Short outbreaks (days to weeks): Expect minimal variation. A lower allele threshold may be appropriate.
- Prolonged outbreaks (months to years): More alleles may accumulate, necessitating a higher threshold.
3. Transmission Mode and Clonality¶
- Direct transmission events typically result in smaller allele differences.
- Indirect or environmental transmission may allow for greater divergence before detection.
4. Sequencing Quality and Consistency¶
Allele distances can be influenced by technical artifacts. Lower quality sequencing data may require higher allele thresholds. See our data recommendations to ensure optimal results.
5. Epidemiological and Geographic Context¶
Allele distance should always be interpreted alongside epidemiological data:
- A small allele distance between two isolates from the same patient ward within a short time span may suggest direct transmission.
- The same distance between isolates from different regions or time periods may reflect background population structure rather than a recent transmission event.
6. Empirical Calibration¶
Where possible, calibrate thresholds using real-world data:
- Analyze confirmed outbreak datasets to identify maximum allele distances observed within known transmission clusters.
- Use public outbreak databases or institution-specific retrospective analyses to inform appropriate thresholds.
Converting Allele Thresholds from other cgMLST Schemes to refMLST Thresholds¶
In Khdhiri et al, we showed that refMLST and cgMLST produce highly comparable results, with spearman correlation of allele distances exceeding 0.98 across bacterial species.
We also showed that refMLST has higher allele distances than cgMLST because it examines more loci, enabling it to resolve differences between more closely related isolates.
Based on this benchmarking data and experience from BugSeq users, we can recommend the following formula to convert external cgMLST schemes to BugSeq refMLST thresholds:
We then recommend rounding up to the nearest refMLST address threshold so that addresses can be used for cluster naming. Address thresholds are described in the Cluster Addresses section and are 5, 10, 20, 50, 100, 200 and 1000 alleles.
The final suggested threshold is therefore:
To use cluster addresses with a 20 allele threshold, only use the first five digits in the address and ignore the last two digits.
refMLST Scheme Sizes¶
The number of loci for each species is found here (note: must be logged in to access).
Example¶
The PulseNet program recommends 10 cgMLST alleles as a threshold for Salmonella enterica. In the above spreadsheet, we see that BugSeq examines 4112 for S. enterica. Filling in the above formula, we find:
Indeed, in Khdhiri et al, we find the adjusted rand index of clusters to be 0.92 (very high cluster overlap) between refMLST with 20 allele and cgMLST with 10 allele thresholds for S. enterica.
If there are four samples of S. enterica with addresses below, the addresses can then be truncated to obtain cluster codes at the desired allele threshold. Using the truncated addresses, the first three samples are all in the same cluster below the 20 allele threshold, while the fourth sample is part of a separate cluster at this threshold.
Sample | Full Cluster Address | Truncated Cluster Address to Support 20 Allele Threshold |
---|---|---|
Sample 1 | 1.1.1.1.1.1.1 | 1.1.1.1.1 |
Sample 2 | 1.1.1.1.1.1.2 | 1.1.1.1.1 |
Sample 3 | 1.1.1.1.1.2.1 | 1.1.1.1.1 |
Sample 4 | 1.1.1.1.2.1.1 | 1.1.1.1.2 |
Published Schemes and Thresholds Converted¶
Species | Publication | Published cgMLST Threshold | BugSeq refMLST Threshold | BugSeq refMLST Threshold (Rounded Up to Nearest Address Breakpoint) | cgMLST Scheme Size | refMLST Scheme Size | Notes |
---|---|---|---|---|---|---|---|
Acinetobacter baumannii | https://doi.org/10.1016/j.ijantimicag.2021.106404 | 10 | 14 | 20 | 2390 | 3372 | |
Clostridioides difficile | https://doi.org/10.3389/fcimb.2023.1109153 | 6 | 8 | 10 | 2469 | 3294 | |
Clostridioides difficile | https://doi.org/10.1128/jcm.01987-17 | 6 | 9 | 10 | 2270 | 3294 | |
Enterococcus faecalis | https://doi.org/10.1128/jcm.01686-18 | 7 | 9 | 10 | 1972 | 2502 | |
Enterococcus faecium | https://doi.org/10.1128/jcm.01946-15 | 20 | 32 | 50 | 1423 | 2271 | |
Escherichia coli | https://doi.org/10.1099/mgen.0.001126 | 10 | 16 | 20 | 2513 | 3907 | |
Klebsiella pneumoniae | https://doi.org/10.3389/fmicb.2017.00371 | 10 | 41 | 50 | 1143 | 4633 | |
Klebsiella pneumoniae | https://doi.org/10.1099/mgen.0.000347 | 4 | 8 | 10 | 2358 | 4633 | |
Listeria monocytogenes | https://doi.org/10.1186/s12864-022-08437-4 | 7 | 11 | 20 | 1748 | 2770 | |
Mycobacterium tuberculosis | https://doi.org/10.1128/jcm.00567-14 | 12 | 14 | 20 | 3257 | 3735 | |
Salmonella enterica | https://doi.org/10.3389/fmicb.2021.649517 | 10 | 14 | 20 | 3000 | 4112 | |
Staphylococcus aureus | https://doi.org/10.1128/jcm.00029-17 | 8 | 10 | 10 | 1861 | 2346 | “Related” threshold used. “Possibly related” would lead to a 50 refMLST allele threshold. |