Metagenomic Classification¶
How BugSeq’s Metagenomic Classification Works¶
BugSeq’s metagenomic classification leverages multiple algorithms to yield optimal results. Results from each algorithm are combined to leverage the strength of each algorithm. The peer-reviewed descriptions of algorithms are found below:
- BugSplit: BugSplit performs especially well on large datasets that can be assembled into contiguous genomes, but also performs well on fragmented assemblies.
- Original BugSeq: Our original algorithm aligns reads to a reference database and refines alignment using a Bayesian statistical framework. While originally designed for ONT data, it was subsequently expanded and validated across sequencing platforms. The Original BugSeq algorithm performs especially well when there are fewer reads in the input data such that they cannot be assembled.
Tip
The BugSeq platform has undergone large performance improvements since these publications. Users are encouraged to evaluate the performance of the latest BugSeq pipeline.
Interpreting Metagenomic Classification Outputs¶
Types of Output Files¶
BugSeq outputs several files to help our users interpret their metagenomics data. Below is a summary of key outputs that we provide to help you interpret your data:
- Per-Sample Reports: For each sample submitted in a given analysis, BugSeq generates a per-sample HTML and PDF report that provides metagenomic classification results and statistics, antimicrobial resistance prediction, plasmid detection, and quality control statistics.
- Metagenomic Classification Interactive Summary: For each analysis, an interactive Krona-formatted output that displays the number of reads assigned to each taxonomic rank. Double-clicking segments of each plot enables users to subset the data to display only reads assigned to the selected rank and below.
- Metagenomic Classification CSV: For each sample submitted in a given analysis, BugSeq generates a CSV file outlining the classification result and taxonomic rank for each read in a given sample, as well as whether the classification was based on the assembly, or read-based classifier.
- Kraken-formatted Reports: For each sample submitted in a given analysis, BugSeq generates Kraken-formatted reports that outline the taxonomic hierarchy of all reads in a given sample.
Per-Sample Reports & Result Interpretation¶
Per-sample reports are a great place to investigate what organisms were detected in each sample. At the top of each per-sample report, you will find key information including the analysis name, analysis ID, pipeline version, sample type, and what reference database was used to generate the results.
The “General Statistics” table contains a detailed breakdown of the organisms found in a given sample and key metrics associated with each organism (sorted in descending order by pathogenicity or read count by default). Details on each column may be found by hovering the mouse over the column name and are summarized below for important columns:
- Pathogenicity Prediction: For each sample type, BugSeq maintains a comprehensive database containing pathogens associated with infection based on the sample type selected when the analysis was submitted. By default, organisms that are “Very Likely” or “Likely” pathogens for a given body site are flagged to the top of the table.
Interpreting pathogen detection results
BugSeq’s pathogenicity prediction does not contain all possible pathogens for a given specimen type. Clinical adjudication is necessary to review the complete list of detected organisms in a given sample. Individual laboratories should validate their own thresholds for sequence data characteristics necessary to call a pathogen “Detected”.
- Read Count: The number of reads assigned to a particular taxon
- Abundance: The number of reads assigned to a particular taxon divided by the total number of reads in a given sample
- Unique Read Alignments: The number of de-duplicated (i.e. unique) reads that align to the reference genome for a particular taxon. The reference genome is selected based on RefSeq’s designation of a reference for the taxon. Unique read alignments is only calculated for species-ranked taxa as there are no reference genomes designated for genus- or above-ranked taxa.
Accounting for reference bias
Although read alignment is performed against the reference genome, BugSeq uses amino acid alignment to overcome reference bias.
- Negative Control Multiplicity (Summed): Reads per million of the row’s taxon and children in sample divided by reads per million of the row’s taxon and children in negative control. This calculation accounts for the varying number of reads in the sample and negative control, as well as classification uncertainty and lower read counts at deeper taxonomic ranks. An additional hidden column contains the negative control multiplicity without summing all taxa; see the “Reformatting the General Statistics table” tip below. Negative controls need to be specified as a metadata field during data submission for this column to be populated, see the Metadata Docs for more detail on how to specify negative controls as a metadata field during data submission.
Interpreting negative control multiplicity
A negative control multiplicity greater than one indicates the taxon was found more abundantly in the sample compared with the negative control. Conversely, a negative control multiplicity less than one indicates the taxon was found more abundantly in the negative control compared with the sample.
Examples
Enterovirus (genus) is detected at 20% relative abundance in the sample. No classifications are made to the species rank in the sample. Enterovirus A is detected at 0.7% and Enterovirus B at 0.3% relative abundance in the negative control.
Escherichia coli was detected in the sample at 1% relative abundance and in the negative control at 2% relative abundance.
Others in the literature (Simner et al, Miller et al) have suggested a negative control multiplicity of 10 or greater to report a pathogen.
- Internal Control Multiplicity: The number of reads assigned to a particular taxon divided by the number of reads assigned to the internal control associated with that sample (normalized reads relative to the internal control). BugSeq maintains a database of common internal process controls that are automatically detected. Please contact support if an internal control value was expected not detected in your samples.
- Assembly Length/N50: These columns are populated when BugSeq was able to generate an assembly for a given taxon. Assembly length and quality may be used when interpreting the likelihood of detection for a given organism and should be interpreted in the context of the sample preparation workflow on a laboratory-by-laboratory basis.
Reformatting the General Statistics table
Certain columns are displayed in each per-sample General Statistics table by default; however, selecting “Configure Columns” at the top of the table enables additional columns to be displayed or hidden depending on the intended use. Users can hold Shift while clicking column titles to sort the General Statistics table based on two separate columns. A CSV file of the General Statistics table can be generated by clicking the button on the top right of the table.