Installation

Use this section to set up the pipeline to work with real data. To test if the pipeline works on your computer, see the Quickstart section

Requirements

Make sure you have the following two components installed.

Nextflow

See more details here: Nextflow

Singularity or Docker

See more details here: Singularity or Docker

Nextflow is the pipeline framework, while Singularity/Docker are software tools for the containerization of code.

Tip

check the successful installation of the software by running:

nextflow -v
    >>> nextflow version 20.04.1.5335
singularity --version
    >>> singularity version 3.7.2-dirty

Working with test data

Test runability

To test the runability of the pipeline, see the Quickstart section

Container Cache

Quicksand uses container software (Docker,Singularity) to ensure the pipeline runs within a stable environment. For single-tool processes, container images are either pulled from the Galaxy image repository or the Quai.io biocontainer repository. For multi-tool processes and custom functions, self-built images are hosted on Dockerhub. The underlying Dockerfiles for those images can be found within the assets/docker directory of the repository.

To reuse downloaded images for multiple pipeline runs, specify a cache-directory. type:

mkdir singularity
export NXF_SINGULARITY_CACHEDIR=$PWD/singularity

Note

Make sure that Singularity has the permissions to overlay and access your file-system within the container. Otherwise the pipeline wont be able to read from or write to it. Upon a "file doesn't exist" error, create an additional nextflow.config config-file

Add the following content to the file:

singularity {
  runOptions = "--bind /directory/in/use"
}

Add this file to your nextflow run with the -c flag:

nextflow run ... -c singularity/nextflow.config

Attention

the -profile and the -c flag have only one dash!

Now -again- the pipeline can be tested by running:

nextflow run -profile test,singularity -c singularity/nextflow.config
The meaning of the flags and the different ways of customizing the pipeline is described in the Usage section.
In case of choosing Docker over Singularity, use the -profile test,docker command.

The singularity directory should now contain all the images used in the pipeline:

singularity
├── depot.galaxyproject.org-singularity-samtools-1.15.1--h1170115_0.img
├── merszym-biohazard_bamrmdup-v0.2.img
└── merszym-quicksand-1.2.img

Working with real data

Create datastructure

The required underlying datastructure of the pipeline is in detail described in the quicksand-build section

In short: You need a precompiled kraken database, the respective reference genomes and bedfiles indicating low-complexity regions. Use the supplementary pipeline quicksand-build (once) to download the taxonomy from NCBI/taxonomy, all mitochondrial genomes from NCBI/RefSeq and create the required databases and files for you.

For this session create the datastructure for the Primate mtDNA from RefSeq:

nextflow run mpieva/quicksand-build --outdir refseq --include Primates

Attention

Building the database requires ~40G of RAM
Be patient, downloading the taxonomy plus the creation of the database might take ~1h.

This command creates a directory refseq that contains the files required to run quicksand:

refseq
├── kraken
│    └── Mito_db_kmer22
├── genomes
│    ├── {family}
│    │    └── {species}.fasta
│    └── taxid_map.tsv
└── masked
     └── {species}_masked.bed

With the datastructure created, the pipeline is ready to be used with the following flags:

--db         refseq/kraken/Mito_db_kmer22/
--genomes    refseq/genomes/
--bedfiles   refseq/masked/

Run the pipeline

As input for the pipeline, download the Hominin "Hohlenstein-Stadel" mtDNA 1 into a directory split

wget -P split http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/BAM/mtDNA/HST.raw_data.ALL.bam

And run the quicksand pipeline:

nextflow run mpieva/quicksand \
    --db        refseq/kraken/Mito_db_kmer22/ \
    --genomes   refseq/genomes/ \
    --bedfiles  refseq/masked/ \
    --split     split \
    -profile    singularity
Please see the Usage section for an explaination of the flags and the input!
Please see the Output section for an explaination of the output files!

A summary of all the stats can be found in the final_report.tsv file

Filter the Results

As can be seen in the final_report.tsv, not all sequences were assigned to Homindae, but to a couple of other Primate families too. The assignment of false positive taxa is a well-known problem of kmer-based assignment methods and additional filters need to be applied.

Based on simulated data, our recommended cutoffs are:

  • FamPercentage cutoff of 1% and/or

  • ProportionMapped cutoff of 0.5-0.7.

The kmer-information is also indicative. If the FamilyKmers and KmerCoverage values are low and the KmerDupRate value is high, the assigment of the family is only based on a small number of kmers within the reads

1

http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/README