Optimizing data submission & BugSeq results¶

Data compression¶

Uploading gzip compressed files will reduce transfer time of large files to BugSeq.

Optimizing data for upload

We are piloting a tool to validate and optimize input files for submission to BugSeq. It processes data locally, without the need for data upload. Try it at https://tools.bugseq.bio.

Sequencing depth¶

Whole Genome SequencingMetagenomicsAmplicon

For all sequencing platforms, BugSeq recommends a minimum median depth of sequencing of 40X to enable accurate genome assembly, strain characterization and antimicrobial resistance prediction. This recommendation is based on the following:

As per FDA, “Currently, we believe thresholds at…20X depth [at every position across the entire assembled genome] are sufficient to apply these genomes for diagnostic purposes within bounded use cases.”¹ 20X at every position often corresponds to a median depth of 40X, depending on the variation of sequencing depth.
For short read sequencing technologies, assembly quality plateaus at 40X coverage.²
Inter-laboratory studies demonstrate strong reproducibility at 30X coverage and above.³
For ONT, “near-finished microbial reference genomes can be obtained from R10.4 data alone at a coverage of approximately 40-fold”.⁴

Talk to our team of experts to find the right depth of sequencing for your experimental goals.

Get In Touch

Talk to our team of experts to find the right depth of sequencing for your experimental goals.

Get In Touch

Differences in reported coverage from BugSeq compared with alternative approaches

BugSeq users have reported differences in median depth of sequencing in BugSeq reports compared to alternative approaches like samtools depth when Illumina paired-end sequencing is performed. These differences are often a result of BugSeq calculating sequencing depth of paired-end reads correctly: many tools ignore paired-end overlaps and therefore double-count these regions for depth calculation. The bioinformatics community and peer-reviewed literature agree that these overlaps should only be counted once. BugSeq has filed requests to change the default behavior of SAMtools and CDC pipelines but these requests have not been accepted or acknowledged.

Optimizing nanopore outputs for BugSeq¶

Basecalling¶

BugSeq recommends using the latest, SUP-version basecaller from ONT when possible. SUP basecalling enables the most reproducible, accurate BugSeq results; for certain applications which require real-time or faster basecalling, HAC may be an acceptable alternative. FAST-version basecalling may lead to high levels of gene fragmentation and should generally be avoided.

FASTQ file size¶

To achieve maximal upload and runtime performance, prefer larger per-barcode files over thousands of small files.

The number of reads per FASTQ file can be specified in MinKNOW GUI when configuring the sequencing run. At the final step where output format is specified, users can modify the number of reads (records) per FASTQ file. For the most efficient analysis speed, we recommend selecting a value of at least 100,000, and possibly as high as 500,000 (much larger than the default of 4,000 reads per FASTQ file).

Modifying MinKNOW settings is best as the resulting files will still conform to ONT’s file naming conventions, which BugSeq can parse. You can also concatenate FASTQ files manually for Oxford Nanopore sequence data into a smaller number of files per sample. However, please be careful to adhere to BugSeq’s input filename requirements.

Demultiplexing¶

BugSeq automatically performs demultiplexing and adapter trimming on FASTQ nanopore sequencing data. BAM inputs should already be demultiplexed into separate files, for example with dorado demux.

For datasets which have already been demultiplexed, run and barcode information are automatically extracted from file names and FASTQ headers.

Strict (dual-barcode) nanopore demultiplexing

BugSeq parses FASTQ headers for barcoding data. Often, users may want BugSeq to perform strict demultiplexing of nanopore data, looking for barcodes on both ends of reads. Strict demultiplexing reduces the incidence of barcode crosstalk and leads to more accurate results. Users should either perform strict demulitplexing before submitting to BugSeq, or perform no demultiplexing before submitting to BugSeq. Files which have already been demultiplexed with default (single-ended barcode) demultiplexing won’t be further demultiplexed by BugSeq.

Custom sample naming and aggregating barcodes¶

FASTQ files can also be submitted as one FASTQ file per sample. In this case, sample names are extracted from file name; submitting one file per sample enables submitters to use custom sample naming. For example, submitting test.fastq.gz will name the sample test.

Custom sample naming also facilitates combining barcodes into a single sample. FASTQ files (gzipped or not) can be combined on Mac/Linux with cat and Windows with type.