Installation

Use this section to set up the pipeline to work with real data. To test if the pipeline works on your computer, see the Quickstart section

Requirements

Make sure you have the following two components installed.

Nextflow: See more details here: Nextflow
Singularity or Docker: See more details here: Singularity or Docker

Nextflow is the pipeline framework, while Singularity/Docker are software tools for the containerization of code.

Tip

check the successful installation of the software by running:

nextflow -v
    >>> nextflow version 20.04.1.5335
singularity --version
    >>> singularity version 3.7.2-dirty

Working with test data

Test runability

To test the runability of the pipeline, see the Quickstart section

Container Cache

Quicksand uses container software (Docker,Singularity) to ensure the pipeline runs within a stable environment. For single-tool processes, container images are either pulled from the Galaxy image repository or the Quai.io biocontainer repository. For multi-tool processes and custom functions, self-built images are hosted on Dockerhub. The underlying Dockerfiles for those images can be found within the assets/docker directory of the repository.

To reuse downloaded images for multiple pipeline runs, specify a cache-directory. type:

mkdir singularity
export NXF_SINGULARITY_CACHEDIR=$PWD/singularity

Note

Make sure that Singularity has the permissions to overlay and access your file-system within the container. Otherwise the pipeline wont be able to read from or write to it. Upon a "file doesn't exist" error, create an additional nextflow.config config-file

Add the following content to the file:

singularity {
  runOptions = "--bind /directory/in/use"
}

Add this file to your nextflow run with the -c flag:

nextflow run ... -c singularity/nextflow.config

Attention

the -profile and the -c flag have only one dash!

Now -again- the pipeline can be tested by running:

nextflow run -profile test,singularity -c singularity/nextflow.config

The meaning of the flags and the different ways of customizing the pipeline is described in the Usage section.

In case of choosing Docker over Singularity, use the -profile test,docker command.

The singularity directory should now contain all the images used in the pipeline:

singularity
├── depot.galaxyproject.org-singularity-samtools-1.15.1--h1170115_0.img
├── merszym-biohazard_bamrmdup-v0.2.img
└── merszym-quicksand-1.2.img

Working with real data

Create datastructure

The required underlying datastructure of the pipeline is in detail described in the quicksand-build section

In short: You need a precompiled kraken database, the respective reference genomes and bedfiles indicating low-complexity regions. Use the supplementary pipeline quicksand-build (once) to download the taxonomy from NCBI/taxonomy, all mitochondrial genomes from NCBI/RefSeq and create the required databases and files for you.

For this session create the datastructure for the Primate mtDNA from RefSeq:

nextflow run mpieva/quicksand-build --outdir refseq --include Primates

Attention

Building the database requires ~40G of RAM

Be patient, downloading the taxonomy plus the creation of the database might take ~1h.

This command creates a directory refseq that contains the files required to run quicksand:

refseq
├── kraken
│    └── Mito_db_kmer22
├── genomes
│    ├── {family}
│    │    └── {species}.fasta
│    └── taxid_map.tsv
└── masked
     └── {species}_masked.bed

With the datastructure created, the pipeline is ready to be used with the following flags:

--db         refseq/kraken/Mito_db_kmer22/
--genomes    refseq/genomes/
--bedfiles   refseq/masked/

Run the pipeline

As input for the pipeline, download the Hominin "Hohlenstein-Stadel" mtDNA 1 into a directory split

wget -P split http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/BAM/mtDNA/HST.raw_data.ALL.bam

And run the quicksand pipeline:

nextflow run mpieva/quicksand \
    --db        refseq/kraken/Mito_db_kmer22/ \
    --genomes   refseq/genomes/ \
    --bedfiles  refseq/masked/ \
    --split     split \
    -profile    singularity

Please see the Usage section for an explaination of the flags and the input!

Please see the Output section for an explaination of the output files!

A summary of all the stats can be found in the final_report.tsv file

Filter the Results

As can be seen in the final_report.tsv, not all sequences were assigned to Homindae, but to a couple of other Primate families too. The assignment of false positive taxa is a well-known problem of kmer-based assignment methods and additional filters need to be applied.

Based on simulated data, our recommended cutoffs are:

FamPercentage cutoff of 1% and/or
ProportionMapped cutoff of 0.5-0.7.

The kmer-information is also indicative. If the FamilyKmers and KmerCoverage values are low and the KmerDupRate value is high, the assigment of the family is only based on a small number of kmers within the reads

1: http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/README