Usage

This section explains the input and output of the pipeline, as well as the commands and flags to run it

Required Flags

Datastructure flags

Flag

Input type

Description

--db

PATH

The directory containing the preindexed Kraken or KrakenUniq database (see: quicksand-build):

Input:

refseq
  ├── kraken
  │    └── Mito_db_kmer22
  │           ├── taxonomy
  │           ├── ...
  │           └── database.kdb


Example:

--db Path/to/refseq/kraken/Mito_db_kmer22

--genomes

PATH

The directory containing the indexed FASTA-FILES of the reference genomes that were used to build
the kraken database. Format PATH/${family}/${species}.fasta (see: quicksand-build)
Input:

refseq
  ├── genomes
  │    └── Hominidae
  │           ├── Homo_sapiens.fasta
  │           ├── Homo_sapiens.fasta.fai
  │           └── ...


Example:

--genomes Path/to/refseq/genomes

--bedfiles

PATH

The directory containing the dustmasked BED-FILES of the reference genomes FASTA-FILES
Format DIR/${species}_masked.bed (see: quicksand-build)
Input:

refseq
  ├── masked
  │      ├── Homo_sapiens_masked.bed
  │      └── ...


Example:

--bedfiles Path/to/refseq/masked
Input flags

Note

The pipeline contains an optional initial demultiplexing process. However, that demultiplexer works only on bam-files provided by the MPI EVA Core Unit as the result of library preparation and sequencing. For the processing of data coming from the MPI EVA run quicksand with the --bam PATH and --rg PATH flags as alternative to the --split PATH parameter.

Flag

Input type

Description

--split

PATH

Standard input
A directory containing the demultiplexed, adapter trimmed (and overlap-merged) input files
Input:

split
  ├── RG1.bam
  ├── RG2.fastq
  ├── RG3.fastq.gz
  ├── RG4.fq.gz
  └── ...

Example:

--split Path/to/split/

--bam

PATH

Use together with the --rg flag
The multiplexed BAM-FILE, as provided by the MPI EVA Core Unit containing
adapter-trimmed and overlap-merged sequencing reads
Example:

--bam Path/to/input.bam

--rg

PATH

Use together with the --bam flag
A TSV-FILE, containing library information for the demultiplexing step.
Provide the readgroups and respective primer combinations contained in the BAM FILE
Input (index.tsv):

#Index library ID     primer_P7       primer_P5
RG1   1113    1137
RG2   1114    1138

Example:

--rg Path/to/index.tsv

Optional Flags

Flag

Input type

Description

--fixed

PATH

Provide a config TSV file
Map Kraken-assigned family reads (extractedReads) to the specified genome instead of the 'best taxa' found by the pipeline.
The species tag is used for the 'Species' column in the reports and the filenames.
Input (fixed.tsv):

Family    Species(tag)  Genome
Hominidae Homo_sapiens  /absolute/path/to/genome.fasta

Example:

--fixed Path/to/fixed.tsv

--rerun

Run the pipeline in an already processed folder
Works together with the --fixed flag
Map already extracted reads of assigned families to all (but only) respective species assigned in the
--fixed references file
Example:

nextflow run mpieva/quicksand --fixed Path/to/fixed.tsv --rerun

--taxlvl

[o,f]

Default: f
Bin assigned reads by the specified taxon level (family or order level): e.g. Homindae or Primates.
These binned reads (extractedReads) are mapped against the genome(s) of each families 'best taxa'.
or --specmap taxa. e.g. Map all reads assigned to Primates to the Homo_sapiens genome
Note: For the order-level bins, extractedReads are mapped several times to different (family) genomes.
Example:

--taxlvl o

Process parameters

Flag

Input type

Description

--bamfilterflag

N

For process filterBam
Filter the BAM file based on the provided SAMTOOLS FLAG (default: 1 = filter paired reads).
see HERE to find a desired filterflag
Example:

--bamfilterflag 5

--bamfilter_length_cutoff

N

For process filterLength
Filter out reads below the given length cutoff (default: 35).
Example:

--bamfilter_length_cutoff 35

--bamfilter_quality_cutoff

N

For process mapBwa
Filter out reads with a mapping quality below the given quality cutoff (default: 25).
Example:

--bamfilter_quality_cutoff 25

--krakenuniq_min_kmers

N

For process runKrakenUniq
A filter to reduce the number of false-positive krakenuniq assignments
Filter out assigned families from the krakenuniq classification with a kmer-count < N (default: 129).
Example:

--krakenuniq_min_kmers 129

--krakenuniq_min_reads

N

For process runKrakenUniq
A filter to reduce the number of false-positive krakenuniq assignments
Filter out families from the krakenuniq classification with < N reads assigned (default: 3).
Example:

--krakenuniq_min_reads 3

--compression_level

[0-9]

For the BAM output processes
Set BGZF compression level (default: 0)
Example:

--compression_level 9

Profiles

Quicksand includes several profiles that can be used with the -profile flag (Be aware: only one dash -)
delimit multiple profiles by comma

Profile

Description

singularity

Use Singularity as container software

docker

Use Docker as container software

test

Use testdata to run the pipeline nextflow run mpieva/quicksand -profile test,singularity

debug

Keep intermediate files in the work directory

Input

The pipeline uses as input .fastq or .bam files that contain demultiplexed, merged, and adapter trimmed reads. Use the --split flag to point to the directory that contains these files. The pipeline refers to the name of the files as readgroups. The reads within the files are assigned, processed and structured by readgroups. .bam and .fastq files can be mixed:

splitdir/
    readgroup1.fastq
    readgroup2.fastq
    readgroup3.bam

Note

Quicksand will process all the .fastq,fq,fastq.gz,fq.gz and .bam-files within the --split directory and ignore all remaining files. Be sure to name/rename your files accordingly

Run the pipeline

There are several ways to run the pipeline. Please see to the installation to set up the required underying databases and the input section to know about the required input.

Attention

The pipeline writes its output into the current working directory!

Run from local repository

To run the pipeline locally, pull the code from github:

git clone git@github.com:mpieva/quicksand

and executed the pipeline by pointing nextflow to the main.nf file in the cloned repository:

nextflow run quicksand/main.nf \
    --split      /path/to/split/ \
    --genomes    /path/to/refseq/genomes \
    --db         /path/to/refseq/kraken/Mito_db_kmer22 \
    --bedfiles   /path/to/refseq/masked \
    -profile     singularity

Run from remote repository

Instead of cloning the repository, run the pipeline directly from github:

nextflow run quicksand/main.nf [-r v1.6] \
    --split      /path/to/split/ \
    --genomes    /path/to/refseq/genomes \
    --db         /path/to/refseq/kraken/Mito_db_kmer22 \
    --bedfiles   /path/to/refseq/masked \
    -profile     singularity

Specify a version (or branch) by providing the -r <branch/tag> flag. By default the master branch is pulled and stored locally in the hidden ~/.nextflow/assets/mpieva/quicksand directory. If the remote pipeline is updated, make sure to pull the updates by running:

nextflow pull mpieva/quicksand

To reduce the number of flags manually handed over to the pipeline, see the configuration section on how to set up an individual configuration of the pipeline.

Output

Based on the --byrg flag, structure the output by either readgroup or family. By default, the output is structured by Taxon

Structured by Taxon

Several directories and files should appear after the run. The 'taxon' corresponds to either the family or the order level name:

RunDir
├── out
│    └── {taxon}
│         ├── 1-extracted
│         │    {RG}_extractedReads-{taxon}.bam
│         ├── best // (for families not in --fixed)
│         │    ├── 2-aligned
│         │    │     └── {RG}.{family}.{species}.bam
│         │    ├── 3-deduped
│         │    │     └── {RG}.{family}.{species}_deduped.bam
│         │    └── 4-bedfiltered
│         │          └── {RG}.{family}.{species}_deduped_bedfiltered.bam
│         └── fixed // (for families in --fixed)
│              ├── 2-aligned
│              │     └── {RG}.{family}.{species}.bam
│              ├── 3-deduped
│              │     └── {RG}.{family}.{species}_deduped.bam
│              ├── 5-deaminated
│              │     ├── {RG}.{family}.{species}_deduped_deaminated_1term.bam
│              │     └── {RG}.{family}.{species}_deduped_deaminated_3term.bam
│              └── 6-mpileups
│                    ├── {RG}.{family}.{species}_term1_mpiled.tsv
│                    ├── {RG}.{family}.{species}_term3_mpiled.tsv
│                    └── {RG}.{family}.{species}_all_mpiled.tsv
├── stats
│    ├── splitcounts.tsv
│    ├── {RG}.kraken.report
│    ├── {RG}.kraken.translate
│    ├── {RG}_00_extracted.tsv
│    ├── {RG}_01_mapped.tsv
│    ├── {RG}_02_deduped.tsv
│    ├── {RG}_03_bedfiltered.tsv
│    └── {RG}_04_deamination.tsv
├── nextflow
│    ├── report.html
│    ├── timeline.html
│    └── trace.tsv
├── work
│    └── ...
├── cc_estimates.tsv
└── final_report.tsv

Files explained

The content of the files is explained here:

out/

File

Description

${RG}.extractedReads-${taxon}.bam

Contains all DNA sequences from one readgroup assigned by krakenuniq to one taxon (family or order).

aligned/${RG}.${family}.${species}.bam

An alignemnt file. The result of mapping all extractedReads (see above) against the genome of the assigned 'best' species

aligend/${RG}.${family}.${species}_deduped.bam

The same alignment file, but depleted of PCR duplicates - reads with the same start and end coordinates in the alignment

bed/${RG}.${family}.${species}_deduped_bedfiltered.bam

The same aligned and deduped bamfile, but additionally depleted of reads overlapping the low-complexity regions specified in the --bedfiles for the given species

deaminated/${RG}.${family}.${species}_deduped_deaminated_1term.bam

The aligned and deduped bamfile, filtered for reads that show a C to T
substitution at one of the terminal basepair positions in respect to the reference genome

deaminated/${RG}.${family}.${species}_deduped_deaminated_3term.bam

The aligned and deduped bamfile, filtered for reads that show a C to T
substitution at one of the terminal three basepair positions in respect to the reference genome

deaminated/${RG}.${family}.${species}_all_mpiled.tsv

The aligned and deduped bamfile, but in mpileup format

deaminated/${RG}.${family}.${species}_1term_mpiled.tsv

The deaminated 1term bamfile, with masked terminal T bases - in mpileup format

deaminated/${RG}.${family}.${species}_3term_mpiled.tsv

The deaminated 3term bamfile, with masked terminal T bases - in mpileup format

stats/

:widths: 20 80 :header-rows: 1

File

Description

${RG}.report

The standard krakenuniq report

${RG}.translate

The read-wise human readable kraken report in mpa-format

stats/splitcounts.tsv

For each readgroup (RG), show the number of reads before (raw) and after the filterBam process, as
well as the number of reads after the bam-lengthfilter process
RG          ReadsRaw      ReadsFiltered ReadsLengthfiltered
test1       235           235           230
test2       235           235           230
test3       235           235           230

${RG}_00_extracted.tsv

Shows the number of reads extracted for each assigned taxon based on the kraken assignments

Taxon       ReadsExtracted
Hominidae   235

${RG}_01_mapped.tsv

For each readgroup (RG) show the number of reads mapped against the reference genome, if the reference genome was fixed
(see --fixed flag) and the proportion of mapped reads (from the number of extracted reads for this family)
Order     Family      Species       Reference    ReadsMapped   ProportionMapped
Primates  Hominidae   Homo_sapiens  fixed        235           0.913

${RG}_02_deduped.tsv

For each readgroup (RG) report the number of unique (deduplicated) reads mapped against the reference genome, the duplication rate
and the number of basepairs covered in the reference genome by the reads
Order     Family      Species       Reference  ReadsDeduped  DuplicationRate  CoveredBP
Primates  Hominidae   Homo_sapiens  fixed      98            2.31             4216

${RG}_03_bedfiltered.tsv

For each readgroup (RG) show the number of reads remaining in the bam-file after bedfiltering as well as the number of covered basepairs
in the reference genome
Order     Family      Species       Reference  ReadsBedfiltered PostBedCoveredBP
Primates  Hominidae   Homo_sapiens  fixed      97               4177

${RG}_04_deamination.tsv

For each readgroup (RG) show the deamination stats for the mapped bam-file after bedfiltering

Ancientness:  ++  = more than 9.5% of the reads that show a terminal C in both the 5' and 3' position in the reference genome, carry a T
              +   = more than 9.5% of the reads that show a terminal C in either the 5' or 3' position in the reference genome, carry a T
              -   = no signs for DNA deamination patterns

ReadsDeam(1term): The number of reads (after deduplication and bedfiltering) that show a deamination in the terminal base positions
ReadsDeam(3term): The number of reads (after deduplication and bedfiltering) that show a deamination in the three terminal base positions
Deam5(95ci):      For the terminal 5' end, the percentage of C to T substitutions (and the 95% confidence interval)
Deam3(95ci):      For the terminal 3' end, the percentage of C to T substitutions (and the 95% confidence interval)
Deam5Cond(95ci):  Taken only 3' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 5' terminal base
Deam3Cond(95ic):  Taken only 5' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 3' terminal base

final report:

File

Description

final_report.tsv

A summary of all the files in the stats dir plus additional information gathered from processes during the pipeline run:

RG                  The analyzed readgroup
ReadsRaw            The total number of reads in the split' file (by RG)
ReadsFiltered       The total number of reads in the split' file after the bamfilter process - removing paired reads. (by RG)
ReadsLengthfiltered The total number of reads in the split' file after the filterLength process - removing reads <35bp length (by RG)
FamilyKmers         The total number of kmers assigned to the family node by krakenuniq
KmerCoverage        The proportion of family kmers assigned by the number of kmers present in the database for that family
KmerDupRate         The average duplication rate of kmers used for the assignment of the given family
ExtractLVL          The taxon level used for extraction of reads (f or o)
ReadsExtracted      The number of reads extracted/assigned to the taxon by krakenuniq
Order               The name of the Order assigned
Family              The name of the Family assigned
Species             The name of the reference species used for mapping assigned taxon reads against
Reference           'fixed' if reference genome is set for that family, 'best' if inferred from the kraken assignments
ReadsMapped         The number of reads mapped against the Species genome
ProportionMapped    The proportion of mapped/extracted reads
ReadsDeduped        The number of deduplicated(unique) reads in the alignment file
DuplicationRate     The duplication rate of mapped reads --> mapped/deduped reads
CoveredBP           The number of basepairs in the reference genome covered by the aligned reads
ReadsBedfiltered    The number of reads not overlapping low-complexity regions
PostBedCoveredBP    The number of basepairs in the reference genome covered by the aligned reads - after bedfiltering
FamPercentage       Taken all bedfiltered reads of the RG, report the percentage of bedfiltered reads for the given family
Ancientness         One of ++,+ or - --> See above for an explanation of the symbols
ReadsDeam(1term)    The number of reads showing a C to T substitution on either of the 5' or 3' ends in respect to the reference
ReadsDeam(3term)    The number of reads showing a C to T substitution in the terminal 3 basepairs in respect to the reference
Deam5(95ci)         For the given family, the percentage of C to T substitutions (and the 95% confidence interval) on the terminal 5' end
Deam3(95ci)         For the given family, the percentage of C to T substitutions (and the 95% confidence interval) on the terminal 3' end
Deam5Cond(95ci)     Taken only 3' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 5' terminal base
Deam3Cond(95ci)     Taken only 5' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 3' terminal base
The cc_estimates.tsv files contains information about index-hopping and cross contamintaion
The nextflow directory contains nextflow specific information about the run
the work directory can be deleted after the run - it contains nextflow specific intermediate files