.. role:: bold
.. role:: mono
.. _usage-page:
Usage
=====
This section explains the input and output of the pipeline, as well as the commands and flags to run it
Required Flags
--------------
| **Datastructure flags**
.. list-table::
:widths: 10 10 60
:header-rows: 1
* - Flag
- Input type
- Description
* - :mono:`--db`
- PATH
- The :bold:`directory` containing the preindexed :file:`Kraken` or :file:`KrakenUniq` database (see: :ref:`quicksand_build-page`)::
Input:
refseq
├── kraken
│ └── Mito_db_kmer22
│ ├── taxonomy
│ ├── ...
│ └── database.kdb
Example:
--db Path/to/refseq/kraken/Mito_db_kmer22
* - :mono:`--genomes`
- PATH
- | The :bold:`directory` containing the indexed FASTA-FILES of the reference genomes that were used to build
| the kraken database. Format :file:`PATH/$\\{family\\}/$\\{species\\}.fasta` (see: :ref:`quicksand_build-page`)
::
Input:
refseq
├── genomes
│ └── Hominidae
│ ├── Homo_sapiens.fasta
│ ├── Homo_sapiens.fasta.fai
│ └── ...
Example:
--genomes Path/to/refseq/genomes
* - :mono:`--bedfiles`
- PATH
- | The :bold:`directory` containing the dustmasked BED-FILES of the reference genomes FASTA-FILES
| Format :file:`DIR/$\\{species\\}_masked.bed` (see: :ref:`quicksand_build-page`)
::
Input:
refseq
├── masked
│ ├── Homo_sapiens_masked.bed
│ └── ...
Example:
--bedfiles Path/to/refseq/masked
| **Input flags**
.. note::
The pipeline contains an optional initial demultiplexing process. However, that demultiplexer works only on bam-files
provided by the `MPI EVA Core Unit `_ as the result of library preparation and sequencing.
For the processing of data coming from the MPI EVA run quicksand with the :code:`--bam PATH` and :code:`--rg PATH` flags as alternative
to the :code:`--split PATH` parameter.
.. list-table::
:widths: 10 10 60
:header-rows: 1
* - Flag
- Input type
- Description
* - :mono:`--split`
- PATH
- | Standard input
| A **directory** containing the demultiplexed, adapter trimmed (and overlap-merged) input files
::
Input:
split
├── RG1.bam
├── RG2.fastq
├── RG3.fastq.gz
├── RG4.fq.gz
└── ...
Example:
--split Path/to/split/
* - :mono:`--bam`
- PATH
- | Use together with the :code:`--rg` flag
| The multiplexed BAM-FILE, as provided by the `MPI EVA Core Unit `_ containing
| adapter-trimmed and overlap-merged sequencing reads
::
Example:
--bam Path/to/input.bam
* - :mono:`--rg`
- PATH
- | Use together with the :code:`--bam` flag
| A TSV-FILE, containing library information for the demultiplexing step.
| Provide the readgroups and respective primer combinations contained in the BAM FILE
::
Input (index.tsv):
#Index library ID primer_P7 primer_P5
RG1 1113 1137
RG2 1114 1138
Example:
--rg Path/to/index.tsv
Optional Flags
--------------
.. list-table::
:widths: 10 10 60
:header-rows: 1
* - Flag
- Input type
- Description
* - :mono:`--fixed`
- PATH
- | Provide a config TSV file
| Map Kraken-assigned family reads (:file:`extractedReads`) to the specified genome instead of the 'best taxa' found by the pipeline.
| The species tag is used for the 'Species' column in the reports and the filenames.
::
Input (fixed.tsv):
Family Species(tag) Genome
Hominidae Homo_sapiens /absolute/path/to/genome.fasta
Example:
--fixed Path/to/fixed.tsv
* - :mono:`--rerun`
- -
- | Run the pipeline in an already processed folder
| Works together with the :code:`--fixed` flag
| Map already extracted reads of assigned families to all (but only) respective species assigned in the
| :code:`--fixed` references file
::
Example:
nextflow run mpieva/quicksand --fixed Path/to/fixed.tsv --rerun
* - :mono:`--taxlvl`
- [o,f]
- | Default: f
| Bin assigned reads by the specified taxon level (family or order level): e.g. Homindae or Primates.
| These binned reads (:file:`extractedReads`) are mapped against the genome(s) of each families 'best taxa'.
| or :code:`--specmap` taxa. e.g. Map all reads assigned to Primates to the Homo_sapiens genome
| **Note:** For the order-level bins, :file:`extractedReads` are mapped several times to different (family) genomes.
::
Example:
--taxlvl o
**Process parameters**
.. list-table::
:widths: 10 10 60
:header-rows: 1
* - Flag
- Input type
- Description
* - :mono:`--bamfilterflag`
- N
- | For process filterBam
| Filter the BAM file based on the provided SAMTOOLS FLAG (default: 1 = filter paired reads).
| see `HERE `_ to find a desired filterflag
::
Example:
--bamfilterflag 5
* - :mono:`--bamfilter_length_cutoff`
- N
- | For process filterLength
| Filter out reads below the given length cutoff (default: 35).
::
Example:
--bamfilter_length_cutoff 35
* - :mono:`--bamfilter_quality_cutoff`
- N
- | For process mapBwa
| Filter out reads with a mapping quality below the given quality cutoff (default: 25).
::
Example:
--bamfilter_quality_cutoff 25
* - :mono:`--krakenuniq_min_kmers`
- N
- | For process runKrakenUniq
| A filter to reduce the number of false-positive krakenuniq assignments
| Filter out assigned families from the krakenuniq classification with a kmer-count < N (default: 129).
::
Example:
--krakenuniq_min_kmers 129
* - :mono:`--krakenuniq_min_reads`
- N
- | For process runKrakenUniq
| A filter to reduce the number of false-positive krakenuniq assignments
| Filter out families from the krakenuniq classification with < N reads assigned (default: 3).
::
Example:
--krakenuniq_min_reads 3
* - :mono:`--compression_level`
- [0-9]
- | For the BAM output processes
| Set BGZF compression level (default: 0)
::
Example:
--compression_level 9
Profiles
--------
| Quicksand includes several profiles that can be used with the :code:`-profile` flag (Be aware: only one dash -)
| delimit multiple profiles by comma
.. list-table::
:widths: 20 60
:header-rows: 1
* - Profile
- Description
* - singularity
- Use Singularity as container software
* - docker
- Use Docker as container software
* - test
- Use testdata to run the pipeline :code:`nextflow run mpieva/quicksand -profile test,singularity`
* - debug
- Keep intermediate files in the :file:`work` directory
.. _input:
Input
-----
The pipeline uses as input :file:`.fastq` or :file:`.bam` files that contain demultiplexed, merged, and adapter trimmed reads.
Use the :code:`--split` flag to point to the directory that contains these files. The pipeline refers to the name of the files
as readgroups. The reads within the files are assigned, processed and structured by readgroups.
:file:`.bam` and :file:`.fastq` files can be mixed::
splitdir/
readgroup1.fastq
readgroup2.fastq
readgroup3.bam
.. note::
Quicksand will process all the :file:`.fastq,fq,fastq.gz,fq.gz` and :file:`.bam`-files within the :code:`--split` directory and ignore all remaining files.
Be sure to name/rename your files accordingly
Run the pipeline
----------------
There are several ways to run the pipeline.
Please see to the :ref:`installation ` to set up the required underying databases and
the :ref:`input section ` to know about the required input.
.. attention::
The pipeline writes its output into the current working directory!
Run from local repository
"""""""""""""""""""""""""
To run the pipeline locally, pull the code from github::
git clone git@github.com:mpieva/quicksand
and executed the pipeline by pointing nextflow to the :file:`main.nf` file in the cloned repository::
nextflow run quicksand/main.nf \
--split /path/to/split/ \
--genomes /path/to/refseq/genomes \
--db /path/to/refseq/kraken/Mito_db_kmer22 \
--bedfiles /path/to/refseq/masked \
-profile singularity
Run from remote repository
""""""""""""""""""""""""""
Instead of cloning the repository, run the pipeline directly from github::
nextflow run quicksand/main.nf [-r v1.6.1] \
--split /path/to/split/ \
--genomes /path/to/refseq/genomes \
--db /path/to/refseq/kraken/Mito_db_kmer22 \
--bedfiles /path/to/refseq/masked \
-profile singularity
Specify a version (or branch) by providing the :code:`-r ` flag. By default the :code:`master` branch is pulled
and stored locally in the hidden :file:`~/.nextflow/assets/mpieva/quicksand` directory. If the remote pipeline is updated, make sure to pull the updates by
running::
nextflow pull mpieva/quicksand
To reduce the number of flags manually handed over to the pipeline, see the :ref:`configuration section ` on how to set up an individual configuration of the pipeline.
.. _output:
Output
------
Based on the :code:`--byrg` flag, structure the output by either readgroup or family. By default, the output is structured by Taxon
Structured by Taxon
""""""""""""""""""""
Several directories and files should appear after the run. The 'taxon' corresponds to either the family or the order level name::
RunDir
├── out
│ └── {taxon}
│ ├── 1-extracted
│ │ {RG}_extractedReads-{taxon}.bam
│ ├── best // (for families not in --fixed)
│ │ ├── 2-aligned
│ │ │ └── {RG}.{family}.{species}.bam
│ │ ├── 3-deduped
│ │ │ └── {RG}.{family}.{species}_deduped.bam
│ │ └── 4-bedfiltered
│ │ └── {RG}.{family}.{species}_deduped_bedfiltered.bam
│ └── fixed // (for families in --fixed)
│ ├── 2-aligned
│ │ └── {RG}.{family}.{species}.bam
│ ├── 3-deduped
│ │ └── {RG}.{family}.{species}_deduped.bam
│ ├── 5-deaminated
│ │ ├── {RG}.{family}.{species}_deduped_deaminated_1term.bam
│ │ └── {RG}.{family}.{species}_deduped_deaminated_3term.bam
│ └── 6-mpileups
│ ├── {RG}.{family}.{species}_term1_mpiled.tsv
│ ├── {RG}.{family}.{species}_term3_mpiled.tsv
│ └── {RG}.{family}.{species}_all_mpiled.tsv
├── stats
│ ├── splitcounts.tsv
│ ├── {RG}.kraken.report
│ ├── {RG}.kraken.translate
│ ├── {RG}_00_extracted.tsv
│ ├── {RG}_01_mapped.tsv
│ ├── {RG}_02_deduped.tsv
│ ├── {RG}_03_bedfiltered.tsv
│ └── {RG}_04_deamination.tsv
├── nextflow
│ ├── report.html
│ ├── timeline.html
│ └── trace.tsv
├── work
│ └── ...
├── cc_estimates.tsv
└── final_report.tsv
.. _files:
Files explained
"""""""""""""""
The content of the files is explained here:
**out/**
.. list-table::
:widths: 20 80
:header-rows: 1
* - File
- Description
* - :file:`$\\{RG\\}.extractedReads-$\\{taxon\\}.bam`
- Contains all DNA sequences from one readgroup assigned by krakenuniq to one taxon (family or order).
* - :file:`aligned/$\\{RG\\}.$\\{family\\}.$\\{species\\}.bam`
- An alignemnt file. The result of mapping all extractedReads (see above) against the genome of the assigned 'best' species
* - :file:`aligend/$\\{RG\\}.$\\{family\\}.$\\{species\\}_deduped.bam`
- The same alignment file, but depleted of PCR duplicates - reads with the same start and end coordinates in the alignment
* - :file:`bed/$\\{RG\\}.$\\{family\\}.$\\{species\\}_deduped_bedfiltered.bam`
- The same aligned and deduped bamfile, but additionally depleted of reads overlapping the low-complexity regions specified in the :code:`--bedfiles` for the given species
* - :file:`deaminated/$\\{RG\\}.$\\{family\\}.$\\{species\\}_deduped_deaminated_1term.bam`
- | The aligned and deduped bamfile, filtered for reads that show a C to T
| substitution at one of the terminal basepair positions in respect to the reference genome
* - :file:`deaminated/$\\{RG\\}.$\\{family\\}.$\\{species\\}_deduped_deaminated_3term.bam`
- | The aligned and deduped bamfile, filtered for reads that show a C to T
| substitution at one of the terminal three basepair positions in respect to the reference genome
* - :file:`deaminated/$\\{RG\\}.$\\{family\\}.$\\{species\\}_all_mpiled.tsv`
- The aligned and deduped bamfile, but in mpileup format
* - :file:`deaminated/$\\{RG\\}.$\\{family\\}.$\\{species\\}_1term_mpiled.tsv`
- The deaminated 1term bamfile, with masked terminal T bases - in mpileup format
* - :file:`deaminated/$\\{RG\\}.$\\{family\\}.$\\{species\\}_3term_mpiled.tsv`
- The deaminated 3term bamfile, with masked terminal T bases - in mpileup format
**stats/**
.. list-table::
:widths: 20 80
:header-rows: 1
* - File
- Description
* - :file:`$\\{RG\\}.report`
- The standard krakenuniq report
* - :file:`$\\{RG\\}.translate`
- The read-wise human readable kraken report in mpa-format
* - :file:`stats/splitcounts.tsv`
- | For each readgroup (RG), show the number of reads before (raw) and after the filterBam process, as
| well as the number of reads after the bam-lengthfilter process
::
RG ReadsRaw ReadsFiltered ReadsLengthfiltered
test1 235 235 230
test2 235 235 230
test3 235 235 230
* - :file:`$\\{RG\\}_00_extracted.tsv`
- Shows the number of reads extracted for each assigned taxon based on the kraken assignments
::
Taxon ReadsExtracted
Hominidae 235
* - :file:`$\\{RG\\}_01_mapped.tsv`
- | For each readgroup (RG) show the number of reads mapped against the reference genome, if the reference genome was fixed
| (see :code:`--fixed` flag) and the proportion of mapped reads (from the number of extracted reads for this family)
::
Order Family Species Reference ReadsMapped ProportionMapped
Primates Hominidae Homo_sapiens fixed 235 0.913
* - :file:`$\\{RG\\}_02_deduped.tsv`
- | For each readgroup (RG) report the number of unique (deduplicated) reads mapped against the reference genome, the duplication rate
| and the number of basepairs covered in the reference genome by the reads
::
Order Family Species Reference ReadsDeduped DuplicationRate CoveredBP
Primates Hominidae Homo_sapiens fixed 98 2.31 4216
* - :file:`$\\{RG\\}_03_bedfiltered.tsv`
- | For each readgroup (RG) show the number of reads remaining in the bam-file after bedfiltering as well as the number of covered basepairs
| in the reference genome
::
Order Family Species Reference ReadsBedfiltered PostBedCoveredBP
Primates Hominidae Homo_sapiens fixed 97 4177
* - :file:`$\\{RG\\}_04_deamination.tsv`
- For each readgroup (RG) show the deamination stats for the mapped bam-file after bedfiltering
::
Ancientness: ++ = more than 9.5% of the reads that show a terminal C in both the 5' and 3' position in the reference genome, carry a T
+ = more than 9.5% of the reads that show a terminal C in either the 5' or 3' position in the reference genome, carry a T
- = no signs for DNA deamination patterns
ReadsDeam(1term): The number of reads (after deduplication and bedfiltering) that show a deamination in the terminal base positions
ReadsDeam(3term): The number of reads (after deduplication and bedfiltering) that show a deamination in the three terminal base positions
Deam5(95ci): For the terminal 5' end, the percentage of C to T substitutions (and the 95% confidence interval)
Deam3(95ci): For the terminal 3' end, the percentage of C to T substitutions (and the 95% confidence interval)
Deam5Cond(95ci): Taken only 3' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 5' terminal base
Deam3Cond(95ic): Taken only 5' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 3' terminal base
**final report:**
.. list-table::
:widths: 20 80
:header-rows: 1
* - File
- Description
* - :file:`final_report.tsv`
- A summary of all the files in the :file:`stats` dir plus additional information gathered from processes during the pipeline run::
RG The analyzed readgroup
ReadsRaw The total number of reads in the split' file (by RG)
ReadsFiltered The total number of reads in the split' file after the bamfilter process - removing paired reads. (by RG)
ReadsLengthfiltered The total number of reads in the split' file after the filterLength process - removing reads <35bp length (by RG)
FamilyKmers The total number of kmers assigned to the family node by krakenuniq
KmerCoverage The proportion of family kmers assigned by the number of kmers present in the database for that family
KmerDupRate The average duplication rate of kmers used for the assignment of the given family
ExtractLVL The taxon level used for extraction of reads (f or o)
ReadsExtracted The number of reads extracted/assigned to the taxon by krakenuniq
Order The name of the Order assigned
Family The name of the Family assigned
Species The name of the reference species used for mapping assigned taxon reads against
Reference 'fixed' if reference genome is set for that family, 'best' if inferred from the kraken assignments
ReadsMapped The number of reads mapped against the Species genome
ProportionMapped The proportion of mapped/extracted reads
ReadsDeduped The number of deduplicated(unique) reads in the alignment file
DuplicationRate The duplication rate of mapped reads --> mapped/deduped reads
CoveredBP The number of basepairs in the reference genome covered by the aligned reads
ReadsBedfiltered The number of reads not overlapping low-complexity regions
PostBedCoveredBP The number of basepairs in the reference genome covered by the aligned reads - after bedfiltering
FamPercentage Taken all bedfiltered reads of the RG, report the percentage of bedfiltered reads for the given family
Ancientness One of ++,+ or - --> See above for an explanation of the symbols
ReadsDeam(1term) The number of reads showing a C to T substitution on either of the 5' or 3' ends in respect to the reference
ReadsDeam(3term) The number of reads showing a C to T substitution in the terminal 3 basepairs in respect to the reference
Deam5(95ci) For the given family, the percentage of C to T substitutions (and the 95% confidence interval) on the terminal 5' end
Deam3(95ci) For the given family, the percentage of C to T substitutions (and the 95% confidence interval) on the terminal 3' end
Deam5Cond(95ci) Taken only 3' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 5' terminal base
Deam3Cond(95ci) Taken only 5' deaminated sequences, report the percentage of C to T substitutions (and the 95% confidence interval) at the 3' terminal base
| The :file:`cc_estimates.tsv` files contains information about index-hopping and cross contamintaion
| The :file:`nextflow` directory contains nextflow specific information about the run
| the :file:`work` directory can be deleted after the run - it contains nextflow specific intermediate files