Description =========== .. image:: images/pipeline_overview_v1.6.png :width: 800 :align: center :alt: Graphical overview over the processes of the quicksand pipeline This section describes the processes implemented in the **quicksand** pipeline. See :ref:`quicksand_build-page` for a description of the quicksand-build pipeline Workflow -------- The basic workflow of the quicksand pipline consists of a **metagenomic classification**, a **mapping** and a basic **analysis** of the alignment. The pipeline produces taxonomic profiles for the analyzed input files, as well as the binned :file:`bam` files, grouped by redgroup and taxonomy for all levels of the pipeline - as shown in the figure above. Each process is in detail described in the following chapters EstimateCrossContamination """""""""""""""""""""""""" WIP fastq2bam """""""""" All :file:`fastq` files entering the pipeline are converted to unpaired and unmapped :file:`bam` files using :code:`samtools import -0`. In case of paired end :file:`fastq` files, make sure to merge them before using the pipeline. fiterBam """"""""" :file:`Bam` files are filtered based on the provided filterflag using :code:`samtools view -F`. By default paired-end reads are removed from the :file:`bam` file. filterLength """""""""""" Using a custom :code:`bam-lengthfilter` package, sequences are removed from the :file:`bam` file that fall below the :code:`--bamfilter_length_cutoff` threshold (default: 35bp) toFasta """""""" As preparation for the krakenuniq run, :file:`bam` files are converted to :file:`fasta` files runKrakenUniq """""""""""""" Sequences contained in the :file:`fasta` file are classified by the metagenomic classifier :code:`krakenuniq`. Quicksand uses krakenuniq with a precompiled database created from the current non-redundant mtDNA RefSeq database with a default kmer size of 22 (see :ref:`quicksand_build-page` or :ref:`setup`). The speed of krakenuniq allows for a quick sorting of sequences into families. To filter out false-positive assignments, families are removed from the assignment falling below the minimum number of reads (:code:`--krakenuniq_min_reads`) and the minimal number of kmers (:code:`--krakenuniq_min_kmers`) on the family node. The result of this process is a taxonomic profile of the readgroup. findBestNode """""""""""" This process parses the kraken-reports. For each assigned family reported by krakenuniq, the node with the highest number of assigned unique kmers is picked as the taxon representative for that family. extractBam """""""""" Using the kraken-report and the length-filtered :file:`bam` file, this process collects all sequences assigned to one clade into a new :file:`bam` file. Extraction happens either on the family or order-level, as specified with the :code:`--taxlvl` flag, using the custom :code:`bamfilter` package. mapBWA """""" The extracted sequences are mapped against all the reference genomes of species belonging to the 'bestNode' found in the 'findBestNode' process using the :code:`bwa bam2bam` command of the `network-aware fork `_ of BWA with ancient parameters (:code:`n 0.01 -o 2 -l 16500`). Unmapped sequences or sequences with a mapping quality of less than 25 are removed from the alignment filterMappedBam """""""""""""""" Mapped bam files are filtered for the set alignment quality score dedupBam """""""" Exact PCR-duplicates are collapsed into unique sequences using `bam-rmdup `_ based on the sharing of identical alignment start and end coordinates. From all mapped genomes, the one with the highest numbers of basepairs covered is picked as _the_ representative species for the subsequent steps. runIntersectBed """""""""""""""" The deduped alignments are then depleted of reads that overlap sites marked as non-informative by :code:`dustmasker`. That step is skipped for families with a fixed reference genome (see :code:`--fixed` flag) analyzeDeamination """""""""""""""""" The final step(s) in the pipeline look for C to T substitutions in the query sequences in respect to the aligned reference genome. Ancient DNA shows characteristic C to T substitutions at the 3’ and 5’ ends of DNA fragments - a degradation pattern used to identify ancient DNA. Families which sequences show more than 10% of terminal C bases in the reference genome replaced by a T are reported as being ancient (++). extractDeaminatedReads """""""""""""""""""""" For families with a fixed reference genome (see :code:`--fixed` flag), extract deaminated reads into two different :code:`.bam` files. One file with sequences that show C to T substitutions on one of the terminal base pairs and one file that looks at the first and last 3 terminal base positions. maskDeamination """""""""""""""" For families with a fixed reference genome (see :code:`--fixed` flag), take the extracted deaminated reads and set the alignment quality score of the last three terminal T bases to 15. createMpileups """"""""""""""