Pipeline Change Log¶
- AMR gene alleles may be incorrectly called if they have an abundance of silent mutations. The AMR gene family should still be correctly called.
- Multiple AMR genes may be called for the same genomic region/sequence if there is a tie to two or more nearest alleles for that sequence.
- AMR genes reported on plasmids may be discordant from AMR genes reported in the bacterial genome. Plasmid AMR gene detection currently relies on a different method from bacterial genome AMR gene detection, and these methods will be harmonized in a future update.
- Mean Phred scores per base in summary report are calculated as a simple average of all base Phred scores, instead of incorporating considerations for a logarithmic scale. Additional details are available here.
- Barcode crosstalk correction for nanopore assembly-based metagenomic abundance calculation.
This change only affects abundance calculation and does not affect assembly or taxonomic binning. Metagenomic bins for taxa deriving from crosstalk, even if corrected to 0% abundance, will still be output. A future update will hide metagenomic bins purely deriving from barcode crosstalk.
- Name of reference database used to all reports.
- Additional QC metrics included in bacterial isolate summary report, including mean read qualities, assembly N50, assembly L50, percent assembly duplication, and percent identity to reference genome.
- Ability to opt out of outbreak analysis (which adds samples to permanent laboratory database). A check box is now available on the submission page to use this feature.
- Haemophilus influenzae serotype prediction.
- Streptococcus pyogenes (Group A Streptococcus) emm typing.
- Reporting of localization of AMR genes to plasmids. See Known Issues for limitations around this reporting.
- Reporting of additional information on detected plasmids, including coverage, and replicon, relaxase and mate-pair formation typing.
- The median base quality, both within and across reads, is properly calculated in the summary report. Additional details are available here.
- Median read length in the General Statistics table of summary report is properly calculated.
- Zip file of summary results may have included a zip file of summary results.
- Capitalization of some taxon names.
- Clarified text in reports describing plasmid cluster IDs.
- Number of reads reported for Illumina analyses in general statistics table of summary report now accurately reflects the number of all reads passing quality control filters. Read counts may have been underestimated before this fix, as only a subset of reads were counted.
- Text now wraps inside cells of tables on per-sample and summary reports.
- Assembly completeness in the bacterial isolate summary now reflects both single-copy and duplicated single-copy orthologs. The previous metric reflected only single-copy orthologs, and can be derived from the current metrics as
Assembly Completeness - Assembly Duplication = Unduplicated Assembly Completeness. For pure bacterial isolates, the difference should be less than 3% to Assembly Completeness.
- Removed minimizer duplication heatmap from summary reports for Illumina read-based metagenomic classification. This plot did not add value to assessing false positive classifications. Future updates will integrate genome coverage metrics into Illumina read-based metagenomic classification for improved classification precision.
- Stringent demultiplexing is now enabled for nanopore sequencing data by default if submitted without barcoding data or if submitted as FAST5 files.
Files submitted with barcode information in the filename or folder name (if a folder was uploaded) will be unaffected.
- Improved coloring of “Genome Completeness” visualizations to reflect the severity of missing, fragmented and duplicated single-copy orthologs.
- Speed optimizations for large Illumina metagenomic samples. Results should not be affected.
- Major AMR database updates. Improvements are made to multiple beta-lactamase groups, along with better consistency across drugs within the same class.
- Additional selective reporting of antimicrobials for specific taxa (eg. Enterobacter and Serratia spp.).
- New 16S analysis for Illumina sequencing data. Individual reads are error-corrected and classified against a reference database. This new analysis brings increased classification recall and precision on internal benchmarks. For users interested in diversity estimation, we advise to filter taxa with less than 0.1-1% abundance, or to analyse at the genus rank.
- Updated MLST database.
- Updated SARS-CoV-2 lineage database.
- New default reference sequence database. This update brings significant accuracy improvements to classification of sequences deriving from plasmids.
- Better visualization of detected genotypic markers in the summary report via heatmap-style table.
- Plot sequencing depth by amplicon for amplicon sequencing experiments.
- Novel plasmids are named by the closest sequence in NCBI. The novel plasmid name format is “Novel_
- Better precision for Illumina read-level classification.
- Improved host read filtering by building a more robust host reference database.
- Improved host read filterieng for Illumina sequencing data, resulting in improved performance for bacterial isolates with regions of sequence similarity to host organism (eg. Neisseria gonorrhoeae).
- Improved quality control reporting for Salmonella serotyping by including descriptive warnings.
- Improved read quality control for nanopore datasets which were not multiplexed.
4.0 - November 7, 2022¶
- QuAISAR-style coverage calculation available in the bacterial isolate summary spreadsheet.
- Conditional formatting of outbreak analysis distance matrices.
- Filter host and Phi X reads before Illumina assembly.
- Monkeypox consensus sequence generation and clade classification.
- Visualization of nanopore and Illumina read-level metagenomic classification results in summary reports.
Illumina read-level metagenomic classification is currently experimental and functionality may change in the future.
- Barcode crosstalk correction for nanopore sequencing data. We follow our previously published algorithm described and validated by Gauthier et al. (2021).
- Report percentage of host reads in the general statistics table of the summary report for nanopore data.
- BugSeq evaluated a new algorithm to generate our taxonomic classification database. After hearing from our users, this algorithm did not generate databases with the hoped performance and we have reverted the database change.
Analyses run between August 15 and September 6 may have been affected, and we encourage users to resubmit their data if analysed during this period.
- Sort columns by sample name in summary AMR table.
- Erroneous blank line in some tables of PDF reports.
- Fixed cefixime reporting for ESBLs. Cefixime should now be flagged as having a genotypic predictor of resistance if an ESBL is present.
- Fixed ceftriaxone reporting for carbapenemases. Ceftriaxone should now be flagged as having a genotypic predictor of resistance if certain carbapenemases are present.
- Aggregate plasmid table reported a plasmid with name “0” as found when no plasmids were found.
- Barcode trimming is skipped if reads have already had barcodes trimmed.
- Bacterial isolate summary table is now sorted by sample name.
- All databases were updated. New database include broader taxonomic representation and should therefore provide increased classification accuracy.
- Outbreak analysis module can now handle isolates that have been submitted to BugSeq multiple times with the same name. Each time an isolate is submitted to BugSeq, we now record the date of submission to keep track of resequenced/duplicate isolates.
- Updated AMR database. Major updates are made to OXA-type beta-lactamases, which should more accurately represent the phenotype of individual families and alleles.
- Read count plot has been merged with read filtering plot in summary reports to reduce redundancy of results.
- Nanopore 16S analysis now accepts clusters of 30 reads or larger.
- Stringent demultiplexing for nanopore has temporarily been removed. BugSeq now relies upon the demultiplexing performed by the user. If you would like to analyse stringently demultiplexed data, please perform this before submitting to BugSeq.
- Speed optimizations.
- Keep reads with a greater number of Ns to discard less data.
- Improve nanopore RNA assembly speed and quality.
- Report median read length instead of mean read length in reports.
- Dynamically set contig length suffix (eg. bp, Kbp, Mbp) in reports.
- Filter host (eg. human) reads before metagenomic classification. This improves both speed (fewer reads need to be classified against the full database) and accuracy (some reads may have erroneously been classified to organisms with similar genomes to host).
3.0 - May 16, 2022¶
- Flowcell/sequencing run quality control for nanopore sequencing data. If you submit FASTQ files to BugSeq containing sequencing information in the FASTQ headers, this will now be plotted by run ID on the summary report.
- Legionella serogroup prediction.
- Information on read filtering during preprocessing to summary reports for all nanopore sequencing experiments.
- Sample reports now contain additional strain typing information on Salmonella, Klebsiella and Legionella species.
- Bacterial isolate summary report in Excel format.
- Better detection of nanopore 16S experiments by lowering the acceptable median length of 16S reads.
- Plot titles in reports now reflect the content of the plot instead of the tool used to generate them (which was sometimes erroneous).
- Fixed fraction of referenge genome covered calculation for assemblies which were close to 95% sequence identity to the reference genome. In this scenario, reference genome coverage was vastly underestimated. Bacterial isolates with >99% sequence identity to the reference genome were unlikely to be affected. This issue is related to this bug in QUAST and BugSeq has implemented an internal fix pending an official fix on QUAST.
- refMLST allele calculation if there was a variant in the first or last base of a loci. Accounting for these variants increases resolutions and distance between isolates by an average of 2 alleles at distances less than 50.
- refMSLT clustering of isolates that are equidistant to two separate clusters. Isolates meeting this criteria now cluster with the first cluster observed. Previously, their clustering was assigned randomly to one of the two or more equidistant clusters.
- Platform-specific thresholds for classifying sequencing quality control data as pass/warning/fail.
- Missing tables in PDF reports if they have too many columns. That is now fixed with a message to check the HTML report for the full table.
- Summary reports now show total reads after filtering. This streamlines the summary reports for Illumina paired-end data as individual FASTQ files are no longer reported in the General Statistics table by default. Note that the number of reads after filtering is the sum of both paired-end files for Illumina.
- Krona plots for BugSplit (assembly-level) classification now shows all ranks, including intermediary ranks.
- Unclassified contigs (and the reads mapping to them) are now classified as unclassified instead of root in Kraken report formats.
- No longer expose unbinned assemblies. The unbinned assembly may be obtained by concatenating together all of the binned assemblies.
- Removal of Racon/Medaka/Homopolish polishing of nanopore R9.4.1 and R10.3 assemblies for the following reasons:
- Unfortunately, both Racon and Homopolish had critical issues preventing their widespread use.
- On internal benchmarks, Medaka may decrease assembly quality if there is a mismatch between basecaller version and Medaka model.
- With recent improvements in nanopore basecalling accuracy, polishing no longer has a significant impact on assembly quality.
- Polishing was adding to processing time but was not increasing metagenomic classification or downstream analysis accuracy (eg. AMR). BUSCO completion analysis may be impacted with more fragmented genes detected; however, this does not impact other analyses.
- Faster Illumina analyses by allocating more CPUs to the assembly process.
- Do not bin long-read assembled contigs less than 1000bp in length. This increases accuracy of analysis as these contigs were the most likely to be erroneously classified, and should also not be present in long-read assemblies.
- Expose additional antibiotics on sample reports for Mycobacterium tuberculosis.
- Allow IUPAC characters in input sequencing data.
- Improved taxonomic binning of bacterial isolates.
- Updated AMR and MLST database.
- Bin plasmids to their host bacteria across all sample types.
2.3 - January 25, 2022¶
- Include the following data in individual sample reports:
- Antimicrobial resistance table
- TB spoligotyping
- Sample name
- Table of plasmids detected
- Canonical SNP typing for:
- B. anthracis
- F. tularensis
- Y. pestis
- C. burnetii
- Include the followind data in aggregate reports:
- Plasmids detected across all samples
- Antimicrobial resistance detection across all samples
- Read filtering statistics
- PDF generation of reports
- Assembly and processing of Q20+ ONT sequencing data
- Insertion and deletion detection for Bacillus anthracis
- cDNA amplification primer trimming on metatranscriptomic nanopore data
- Error handling if a user submits duplicated reads across multiple files
- Coverage per contig plot in sample reports
- Reporting of first line antimicrobials for Mycobacterium tuberculosis
- Improved detection of plasmid sequences
- Improved assembly of ONT cDNA/RNA-sequencing data
- Improved abundance calculation from Illumina assemblies
- Improved primer scheme detection and trimming for SARS-CoV-2 amplicon sequencing data
- Faster nanopore 16S processing by setting the QIIME2 vsearch
--maxrejectsflag. On internal evaluations, this has neglibile impact on results but drastically speeds up processing.
- Faster overall analyses by being more selective in which pathogen-specific analyses get run
- Gentler quality filtering for Illumina data should result in better and more contiguous assemblies
- Faster read preprocessing by better leveraging parallel processing
- Removed raw text file output of AMR data as it is now contained in reports and much easier to use
2.2 - October 26, 2021¶
- Plot BUSCO results in reports
- Include MLST results in per-sample report
- Report analysis pipeline version in reports
- Bug handling many Illumina samples in one analysis
- Depth calculations only included reference contigs with mapping assembled contigs. This is now fixed - sequencing depth may be slightly lower than previously reported.
- BugSplit can now report strain-level classifications
seqkitfor faster FASTQ parsing
2.1 - October 14, 2021¶
- Per-sample HTML report
- Sequencing depth calculation performed for each genome in a sample relative to its reference sequence
- Stringent demultiplexing of user submissions with custom file names
- Visualization of Pangolin SARS-CoV-2 lineage results in the summary report
- Improved taxonomic binning of Illumina sequences using assembly graph information
- Improved Illumina assembly by merging paired-end reads
- Faster internal file transfers, resulting in faster analyses
2.0 - October 7, 2021¶
- Assembly and polishing of data from all sequencers
- BugSplit module: high-accuracy taxonomic binning (Citation)
- Respect a user’s custom filename
- Only demultiplex nanopore if it does not have a custom filename
1.0.0 - December 3, 2019¶
- First release!
Last update: March 14, 2023