Description

Graphical overview over the processes of the quicksand pipeline

This section describes the processes implemented in the quicksand pipeline. See quicksand-build for a description of the quicksand-build pipeline

Workflow

The basic workflow of the quicksand pipline consists of a metagenomic classification, a mapping and a basic analysis of the alignment. The pipeline produces taxonomic profiles for the analyzed input files, as well as the binned bam files, grouped by redgroup and taxonomy for all levels of the pipeline - as shown in the figure above. Each process is in detail described in the following chapters

EstimateCrossContamination

WIP

fastq2bam

All fastq files entering the pipeline are converted to unpaired and unmapped bam files using samtools import -0. In case of paired end fastq files, make sure to merge them before using the pipeline.

fiterBam

Bam files are filtered based on the provided filterflag using samtools view -F. By default paired-end reads are removed from the bam file.

filterLength

Using a custom bam-lengthfilter package, sequences are removed from the bam file that fall below the --bamfilter_length_cutoff threshold (default: 35bp)

toFasta

As preparation for the krakenuniq run, bam files are converted to fasta files

runKrakenUniq

Sequences contained in the fasta file are classified by the metagenomic classifier krakenuniq. Quicksand uses krakenuniq with a precompiled database created from the current non-redundant mtDNA RefSeq database with a default kmer size of 22 (see quicksand-build or Create datastructure). The speed of krakenuniq allows for a quick sorting of sequences into families. To filter out false-positive assignments, families are removed from the assignment falling below the minimum number of reads (--krakenuniq_min_reads) and the minimal number of kmers (--krakenuniq_min_kmers) on the family node. The result of this process is a taxonomic profile of the readgroup.

findBestNode

This process parses the kraken-reports. For each assigned family reported by krakenuniq, the node with the highest number of assigned unique kmers is picked as the taxon representative for that family.

extractBam

Using the kraken-report and the length-filtered bam file, this process collects all sequences assigned to one clade into a new bam file. Extraction happens either on the family or order-level, as specified with the --taxlvl flag, using the custom bamfilter package.

mapBWA

The extracted sequences are mapped against all the reference genomes of species belonging to the 'bestNode' found in the 'findBestNode' process using the bwa bam2bam command of the network-aware fork of BWA with ancient parameters (n 0.01 -o 2 -l 16500). Unmapped sequences or sequences with a mapping quality of less than 25 are removed from the alignment

filterMappedBam

Mapped bam files are filtered for the set alignment quality score

dedupBam

Exact PCR-duplicates are collapsed into unique sequences using bam-rmdup based on the sharing of identical alignment start and end coordinates. From all mapped genomes, the one with the highest numbers of basepairs covered is picked as _the_ representative species for the subsequent steps.

runIntersectBed

The deduped alignments are then depleted of reads that overlap sites marked as non-informative by dustmasker. That step is skipped for families with a fixed reference genome (see --fixed flag)

analyzeDeamination

The final step(s) in the pipeline look for C to T substitutions in the query sequences in respect to the aligned reference genome. Ancient DNA shows characteristic C to T substitutions at the 3’ and 5’ ends of DNA fragments - a degradation pattern used to identify ancient DNA. Families which sequences show more than 10% of terminal C bases in the reference genome replaced by a T are reported as being ancient (++).

extractDeaminatedReads

For families with a fixed reference genome (see --fixed flag), extract deaminated reads into two different .bam files. One file with sequences that show C to T substitutions on one of the terminal base pairs and one file that looks at the first and last 3 terminal base positions.

maskDeamination

For families with a fixed reference genome (see --fixed flag), take the extracted deaminated reads and set the alignment quality score of the last three terminal T bases to 15.