Skip to content

Output

BugSeq outputs a set of actionable results and processed data for every analysis. Results files are available under analysis history. Example results, for which this tutorial is based, are availabe at our demo.

Quality Control

A summary of most quality control data is found in the multiqc.html file. For a general review of all features in the MultiQC report, we refer the reader to the MultiQC documentation.

Sequencing Data

Quality is assessed before any preprocessing. The most important graph is the FastQC: Status Checks.

Automatic quality control interpretation based on experimental design.

As with a stop light, green boxes reflects high quality data, yellow reflects a warning, and red reflects concern. Thresholds for this graph are custom tailored to your sequencing platform and experimental design. For example, nanopore sequences will flag as a warning if the average Per Sequence Quality (measured in Phred score) is below Q8, whereas Illumina sequences will flag if below Q20.

For a full description of graph content, please see the FastQC manual.

Metagenomic Classification

Visualization

The easiest way to quickly get a sense of the composition of a sample is via the BugSeq metagenomic visualization, found in the metagenomic_classification-RUN_ID.html file. This file can be opened in any modern web browser, or directly within the BugSeq portal via the Visualize button.

Text Format

In the project results root directory, there’s a file named metagenomic_classification-RUN_ID.tsv. This file contains the raw data visualized in the above file. For example, opening this file in Excel will reveal:

Taxon
Name
Taxon
NCBI ID
Taxon
Rank
15f5175d_noBarcode
Read count at this taxon and below
15f5175d_noBarcode
Read count directly assigned to this taxon
15f5175d_noBarcode
Normalized score
stringent_demultiplex_15f5175d_noBarcode
Read count at this taxon and below
stringent_demultiplex_15f5175d_noBarcode
Read count directly assigned to this taxon
stringent_demultiplex_15f5175d_noBarcode
Normalized score
Root 1 no_rank 2317 267 46.1 2317 267 46.1
Bacteria 2 superkingdom 2050 0 51.4 2050 0 51.4
Terrabacteria group 1783272 no_rank 2050 0 51.4 2050 0 51.4
Firmicutes 1239 phylum 2050 0 51.4 2050 0 51.4
Bacilli 91061 class 2050 0 51.4 2050 0 51.4
Bacillales 1385 order 2050 0 51.4 2050 0 51.4
Listeriaceae 186820 family 2050 0 51.4 2050 0 51.4
Listeria 1637 genus 2050 0 51.4 2050 0 51.4
Listeria monocytogenes 1639 species 2050 2050 51.4 2050 2050 51.4

The first three columns are row labels and reflect taxonomic nodes. BugSeq follows the NCBI taxonomic scheme.

Each sample is then included as three columns:

  • Read count at this taxon and below: This field contains the summed read count at this taxon. In this example, 2317 reads were assigned to the superkingdom Bacteria or a rank below Bacteria.
  • Read count directly assigned to this taxon: These reads could not be assigned to a lower taxonomic node given mapping ambiguity and/or the nature of the taxonomic tree (eg. if the reads are assigned to the lowest rank in the tree). In this example, 267 reads were identified as bacterial in origin, but could not be assigned to lower nodes such as Listeria monocytogenes.
  • Normalized score: This field is calculated as the minimap2 alignment score of the read, divided by the read’s total length, multiplied by 50, and averaged across all reads assigned to a node. As minimap2 awards 2 points for each base that matches the reference, a score of 100 means the read has 100% sequence identity, and a score of 50 similarly means approximately 50% sequence identity.
Why "approximate" sequence identity?

Minimap2 awards/subtracts a variable number of alignment points for mismatches, insertions and deletions. See the minimap2 manual for details on its calculation of alignment score.

What’s a good score?

Good scores largely depend on the experimental design and the presence of the sequenced organisms in BugSeq’s reference database. For experiments with high accuracy basecalling of DNA and well-represented organisms, scores above 60 are generally considered good. However, we have also seen good classification performance with scores far below this (eg. 20). We suggest the reader adjust their acceptable score threshold based on anticipated results, with higher scores reflecting greater precision at the cost of lower recall. BugSeq filters (ie. omits) any classification with a normalized score of 5 or lower by default.

Samples may have an additional three columns if they were demultiplexed by BugSeq. See the Multiplexing section for further information on the BugSeq demultiplexing strategy, and the difference between default and stringent modes. Samples labelled with stringent in their name have undergone demultiplexing using the stringent strategy.

Assembly

Assembly Quality Control

General statistics on the assembly, such as total size (in base pairs), number of contigs, NG50 and more can be found in the multiqc.html file.

Sequences

Assembled contigs are found in the assembly folder as .fna files. These files contain all organisms with sufficient depth in the submitted sequencing data to be assembled. Assemblies generated from nanopore sequencing data have already undergone polishing using a best-practices pipeline and do not need any further processing before downstream use.

Classification

Contigs are classified to the most precise taxonomic label possible (publication pending). The label of each contig to taxonomic name and rank is found in the assembly folder for each sample. For example:

/assembly/15f5175d_noBarcode/contig_classification.txt contains:

Contig Classification
contig_1 Root; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Listeriaceae; Listeria; Listeria monocytogenes

Last update: May 14, 2021