Pipeline Change Log¶
- Additional QC metrics included in bacterial isolate summary report, including mean read qualities, assembly N50, assembly L50 and percent identity to reference genome.
- Ability to opt out of outbreak analysis (which adds samples to permanent laboratory database). A check box is now available on the submission page to use this feature.
- Zip file of summary results may have included a zip file of summary results.
- Capitalization of some taxon names.
- Clarified text in reports describing plasmid cluster IDs.
- Number of reads reported for Illumina analyses in general statistics table of summary report now accurately reflects the number of all reads passing quality control filters. Read counts may have been underestimated before this fix, as only a subset of reads were counted.
- Major AMR database updates. Improvements are made to multiple beta-lactamase groups, along with better consistency across drugs within the same class.
- New 16S analysis for Illumina sequencing data. Individual reads are error-corrected and classified against a reference database. This new analysis brings increased classification recall and precision on internal benchmarks. For users interested in diversity estimation, we advise to filter taxa with less than 0.1-1% abundance, or to analyse at the genus rank.
- Updated MLST database.
- Updated SARS-CoV-2 lineage database.
- New default reference sequence database. This update brings significant accuracy improvements to classification of sequences deriving from plasmids.
4.0 - November 7, 2022¶
- QuAISAR-style coverage calculation available in the bacterial isolate summary spreadsheet.
- Conditional formatting of outbreak analysis distance matrices.
- Filter host and Phi X reads before Illumina assembly.
- Monkeypox consensus sequence generation and clade classification.
- Visualization of nanopore and Illumina read-level metagenomic classification results in summary reports.
Illumina read-level metagenomic classification is currently experimental and functionality may change in the future.
- Barcode crosstalk correction for nanopore sequencing data. We follow our previously published algorithm described and validated by Gauthier et al. (2021).
- Report percentage of host reads in the general statistics table of the summary report for nanopore data.
- BugSeq evaluated a new algorithm to generate our taxonomic classification database. After hearing from our users, this algorithm did not generate databases with the hoped performance and we have reverted the database change.
Analyses run between August 15 and September 6 may have been affected, and we encourage users to resubmit their data if analysed during this period.
- Sort columns by sample name in summary AMR table.
- Erroneous blank line in some tables of PDF reports.
- Fixed cefixime reporting for ESBLs. Cefixime should now be flagged as having a genotypic predictor of resistance if an ESBL is present.
- Fixed ceftriaxone reporting for carbapenemases. Ceftriaxone should now be flagged as having a genotypic predictor of resistance if certain carbapenemases are present.
- Aggregate plasmid table reported a plasmid with name “0” as found when no plasmids were found.
- Barcode trimming is skipped if reads have already had barcodes trimmed.
- Bacterial isolate summary table is now sorted by sample name.
- All databases were updated. New database include broader taxonomic representation and should therefore provide increased classification accuracy.
- Outbreak analysis module can now handle isolates that have been submitted to BugSeq multiple times with the same name. Each time an isolate is submitted to BugSeq, we now record the date of submission to keep track of resequenced/duplicate isolates.
- Updated AMR database. Major updates are made to OXA-type beta-lactamases, which should more accurately represent the phenotype of individual families and alleles.
- Read count plot has been merged with read filtering plot in summary reports to reduce redundancy of results.
- Nanopore 16S analysis now accepts clusters of 30 reads or larger.
- Stringent demultiplexing for nanopore has temporarily been removed. BugSeq now relies upon the demultiplexing performed by the user. If you would like to analyse stringently demultiplexed data, please perform this before submitting to BugSeq.
- Speed optimizations.
- Keep reads with a greater number of Ns to discard less data.
- Improve nanopore RNA assembly speed and quality.
- Report median read length instead of mean read length in reports.
- Dynamically set contig length suffix (eg. bp, Kbp, Mbp) in reports.
- Filter host (eg. human) reads before metagenomic classification. This improves both speed (fewer reads need to be classified against the full database) and accuracy (some reads may have erroneously been classified to organisms with similar genomes to host).
3.0 - May 16, 2022¶
- Flowcell/sequencing run quality control for nanopore sequencing data. If you submit FASTQ files to BugSeq containing sequencing information in the FASTQ headers, this will now be plotted by run ID on the summary report.
- Legionella serogroup prediction.
- Information on read filtering during preprocessing to summary reports for all nanopore sequencing experiments.
- Sample reports now contain additional strain typing information on Salmonella, Klebsiella and Legionella species.
- Bacterial isolate summary report in Excel format.
- Better detection of nanopore 16S experiments by lowering the acceptable median length of 16S reads.
- Plot titles in reports now reflect the content of the plot instead of the tool used to generate them (which was sometimes erroneous).
- Fixed fraction of referenge genome covered calculation for assemblies which were close to 95% sequence identity to the reference genome. In this scenario, reference genome coverage was vastly underestimated. Bacterial isolates with >99% sequence identity to the reference genome were unlikely to be affected. This issue is related to this bug in QUAST and BugSeq has implemented an internal fix pending an official fix on QUAST.
- refMLST allele calculation if there was a variant in the first or last base of a loci. Accounting for these variants increases resolutions and distance between isolates by an average of 2 alleles at distances less than 50.
- refMSLT clustering of isolates that are equidistant to two separate clusters. Isolates meeting this criteria now cluster with the first cluster observed. Previously, their clustering was assigned randomly to one of the two or more equidistant clusters.
- Platform-specific thresholds for classifying sequencing quality control data as pass/warning/fail.
- Missing tables in PDF reports if they have too many columns. That is now fixed with a message to check the HTML report for the full table.
- Summary reports now show total reads after filtering. This streamlines the summary reports for Illumina paired-end data as individual FASTQ files are no longer reported in the General Statistics table by default. Note that the number of reads after filtering is the sum of both paired-end files for Illumina.
- Krona plots for BugSplit (assembly-level) classification now shows all ranks, including intermediary ranks.
- Unclassified contigs (and the reads mapping to them) are now classified as unclassified instead of root in Kraken report formats.
- No longer expose unbinned assemblies. The unbinned assembly may be obtained by concatenating together all of the binned assemblies.
- Removal of Racon/Medaka/Homopolish polishing of nanopore R9.4.1 and R10.3 assemblies for the following reasons:
- Unfortunately, both Racon and Homopolish had critical issues preventing their widespread use.
- On internal benchmarks, Medaka may decrease assembly quality if there is a mismatch between basecaller version and Medaka model.
- With recent improvements in nanopore basecalling accuracy, polishing no longer has a significant impact on assembly quality.
- Polishing was adding to processing time but was not increasing metagenomic classification or downstream analysis accuracy (eg. AMR). BUSCO completion analysis may be impacted with more fragmented genes detected; however, this does not impact other analyses.
- Faster Illumina analyses by allocating more CPUs to the assembly process.
- Do not bin long-read assembled contigs less than 1000bp in length. This increases accuracy of analysis as these contigs were the most likely to be erroneously classified, and should also not be present in long-read assemblies.
- Expose additional antibiotics on sample reports for Mycobacterium tuberculosis.
- Allow IUPAC characters in input sequencing data.
- Improved taxonomic binning of bacterial isolates.
- Updated AMR and MLST database.
- Bin plasmids to their host bacteria across all sample types.
2.3 - January 25, 2022¶
- Include the following data in individual sample reports:
- Antimicrobial resistance table
- TB spoligotyping
- Sample name
- Table of plasmids detected
- Canonical SNP typing for:
- B. anthracis
- F. tularensis
- Y. pestis
- C. burnetii
- Include the followind data in aggregate reports:
- Plasmids detected across all samples
- Antimicrobial resistance detection across all samples
- Read filtering statistics
- PDF generation of reports
- Assembly and processing of Q20+ ONT sequencing data
- Insertion and deletion detection for Bacillus anthracis
- cDNA amplification primer trimming on metatranscriptomic nanopore data
- Error handling if a user submits duplicated reads across multiple files
- Coverage per contig plot in sample reports
- Reporting of first line antimicrobials for Mycobacterium tuberculosis
- Improved detection of plasmid sequences
- Improved assembly of ONT cDNA/RNA-sequencing data
- Improved abundance calculation from Illumina assemblies
- Improved primer scheme detection and trimming for SARS-CoV-2 amplicon sequencing data
- Faster nanopore 16S processing by setting the QIIME2 vsearch
--maxrejectsflag. On internal evaluations, this has neglibile impact on results but drastically speeds up processing.
- Faster overall analyses by being more selective in which pathogen-specific analyses get run
- Gentler quality filtering for Illumina data should result in better and more contiguous assemblies
- Faster read preprocessing by better leveraging parallel processing
- Removed raw text file output of AMR data as it is now contained in reports and much easier to use
2.2 - October 26, 2021¶
- Plot BUSCO results in reports
- Include MLST results in per-sample report
- Report analysis pipeline version in reports
- Bug handling many Illumina samples in one analysis
- Depth calculations only included reference contigs with mapping assembled contigs. This is now fixed - sequencing depth may be slightly lower than previously reported.
- BugSplit can now report strain-level classifications
seqkitfor faster FASTQ parsing
2.1 - October 14, 2021¶
- Per-sample HTML report
- Sequencing depth calculation performed for each genome in a sample relative to its reference sequence
- Stringent demultiplexing of user submissions with custom file names
- Visualization of Pangolin SARS-CoV-2 lineage results in the summary report
- Improved taxonomic binning of Illumina sequences using assembly graph information
- Improved Illumina assembly by merging paired-end reads
- Faster internal file transfers, resulting in faster analyses
2.0 - October 7, 2021¶
- Assembly and polishing of data from all sequencers
- BugSplit module: high-accuracy taxonomic binning (Citation)
- Respect a user’s custom filename
- Only demultiplex nanopore if it does not have a custom filename
1.0.0 - December 3, 2019¶
- First release!