Skip to content

Raw Data

BugSeq outputs the following raw data files, depending on analysis:

Isolate/Metagenomic Analyses

Text Format

Under the folder Metagenomic Classification Summary, there’s a file named metagenomic_classification-RUN_ID.tsv. This file contains the raw data visualized in the above file. For example, opening this file in Excel will reveal:

Taxon
Name
Taxon
NCBI ID
Taxon
Rank
15f5175d_noBarcode
Read count at this taxon and below
15f5175d_noBarcode
Read count directly assigned to this taxon
15f5175d_noBarcode
Normalized score
stringent_demultiplex_15f5175d_noBarcode
Read count at this taxon and below
stringent_demultiplex_15f5175d_noBarcode
Read count directly assigned to this taxon
stringent_demultiplex_15f5175d_noBarcode
Normalized score
Root 1 no_rank 2317 267 46.1 2317 267 46.1
Bacteria 2 superkingdom 2050 0 51.4 2050 0 51.4
Terrabacteria group 1783272 no_rank 2050 0 51.4 2050 0 51.4
Firmicutes 1239 phylum 2050 0 51.4 2050 0 51.4
Bacilli 91061 class 2050 0 51.4 2050 0 51.4
Bacillales 1385 order 2050 0 51.4 2050 0 51.4
Listeriaceae 186820 family 2050 0 51.4 2050 0 51.4
Listeria 1637 genus 2050 0 51.4 2050 0 51.4
Listeria monocytogenes 1639 species 2050 2050 51.4 2050 2050 51.4

The first three columns are row labels and reflect taxonomic nodes. BugSeq follows the NCBI taxonomic scheme.

Each sample is then included as three columns:

  • Read count at this taxon and below: This field contains the summed read count at this taxon. In this example, 2317 reads were assigned to the superkingdom Bacteria or a rank below Bacteria.
  • Read count directly assigned to this taxon: These reads could not be assigned to a lower taxonomic node given mapping ambiguity and/or the nature of the taxonomic tree (eg. if the reads are assigned to the lowest rank in the tree). In this example, 267 reads were identified as bacterial in origin, but could not be assigned to lower nodes such as Listeria monocytogenes.
  • Normalized score: This field is calculated as the minimap2 alignment score of the read, divided by the read’s total length, multiplied by 50, and averaged across all reads assigned to a node. As minimap2 awards 2 points for each base that matches the reference, a score of 100 means the read has 100% sequence identity, and a score of 50 similarly means approximately 50% sequence identity.
Why "approximate" sequence identity?

Minimap2 awards/subtracts a variable number of alignment points for mismatches, insertions and deletions. See the minimap2 manual for details on its calculation of alignment score.

What’s a good score?

Good scores largely depend on the experimental design and the presence of the sequenced organisms in BugSeq’s reference database. For experiments with high accuracy basecalling of DNA and well-represented organisms, scores above 60 are generally considered good. However, we have also seen good classification performance with scores far below this (eg. 20). We suggest the reader adjust their acceptable score threshold based on anticipated results, with higher scores reflecting greater precision at the cost of lower recall. BugSeq filters (ie. omits) any classification with a normalized score of 5 or lower by default.

Samples may have an additional three columns if they were demultiplexed by BugSeq. See the Multiplexing section for further information on the BugSeq demultiplexing strategy, and the difference between default and stringent modes. Samples labelled with stringent in their name have undergone demultiplexing using the stringent strategy.mat

In the project results root directory, there’s a file named metagenomic_classification-RUN_ID.tsv. This file contains the raw data visualized in the above file. For example, opening this file in Excel will reveal:

Taxon
Name
Taxon
NCBI ID
Taxon
Rank
15f5175d_noBarcode
Read count at this taxon and below
15f5175d_noBarcode
Read count directly assigned to this taxon
15f5175d_noBarcode
Normalized score
stringent_demultiplex_15f5175d_noBarcode
Read count at this taxon and below
stringent_demultiplex_15f5175d_noBarcode
Read count directly assigned to this taxon
stringent_demultiplex_15f5175d_noBarcode
Normalized score
Root 1 no_rank 2317 267 46.1 2317 267 46.1
Bacteria 2 superkingdom 2050 0 51.4 2050 0 51.4
Terrabacteria group 1783272 no_rank 2050 0 51.4 2050 0 51.4
Firmicutes 1239 phylum 2050 0 51.4 2050 0 51.4
Bacilli 91061 class 2050 0 51.4 2050 0 51.4
Bacillales 1385 order 2050 0 51.4 2050 0 51.4
Listeriaceae 186820 family 2050 0 51.4 2050 0 51.4
Listeria 1637 genus 2050 0 51.4 2050 0 51.4
Listeria monocytogenes 1639 species 2050 2050 51.4 2050 2050 51.4

The first three columns are row labels and reflect taxonomic nodes. BugSeq follows the NCBI taxonomic scheme.

Each sample is then included as three columns:

  • Read count at this taxon and below: This field contains the summed read count at this taxon. In this example, 2317 reads were assigned to the superkingdom Bacteria or a rank below Bacteria.
  • Read count directly assigned to this taxon: These reads could not be assigned to a lower taxonomic node given mapping ambiguity and/or the nature of the taxonomic tree (eg. if the reads are assigned to the lowest rank in the tree). In this example, 267 reads were identified as bacterial in origin, but could not be assigned to lower nodes such as Listeria monocytogenes.
  • Normalized score: This field is calculated as the minimap2 alignment score of the read, divided by the read’s total length, multiplied by 50, and averaged across all reads assigned to a node. As minimap2 awards 2 points for each base that matches the reference, a score of 100 means the read has 100% sequence identity, and a score of 50 similarly means approximately 50% sequence identity.
Why "approximate" sequence identity?

Minimap2 awards/subtracts a variable number of alignment points for mismatches, insertions and deletions. See the minimap2 manual for details on its calculation of alignment score.

What’s a good score?

Good scores largely depend on the experimental design and the presence of the sequenced organisms in BugSeq’s reference database. For experiments with high accuracy basecalling of DNA and well-represented organisms, scores above 60 are generally considered good. However, we have also seen good classification performance with scores far below this (eg. 20). We suggest the reader adjust their acceptable score threshold based on anticipated results, with higher scores reflecting greater precision at the cost of lower recall. BugSeq filters (ie. omits) any classification with a normalized score of 5 or lower by default.

Samples may have an additional three columns if they were demultiplexed by BugSeq. See the Multiplexing section for further information on the BugSeq demultiplexing strategy, and the difference between default and stringent modes. Samples labelled with stringent in their name have undergone demultiplexing using the stringent strategy.

Assembly Bins

Assembled contigs are found in the Assembly folder under Metagenomic Bins as .fna files. These files contain all organisms with sufficient depth in the submitted sequencing data to be assembled. Details on each bin, such as their completeness (eg. BUSCO count), antimicrobial resistance profile and more are found in the summary and per-sample reports.


Last update: March 1, 2022