Installation
Use this section to set up the pipeline to work with real data. To test if the pipeline works on your computer, see the Quickstart section
Requirements
Make sure you have the following two components installed.
- Nextflow
See more details here: Nextflow
- Singularity or Docker
See more details here: Singularity or Docker
Nextflow is the pipeline framework, while Singularity/Docker are software tools for the containerization of code.
Tip
check the successful installation of the software by running:
nextflow -v
>>> nextflow version 20.04.1.5335
singularity --version
>>> singularity version 3.7.2-dirty
Working with test data
Test runability
To test the runability of the pipeline, see the Quickstart section
Container Cache
Quicksand uses container software (Docker,Singularity) to ensure the pipeline runs within a stable environment.
For single-tool processes, container images are either pulled from the Galaxy image repository or the
Quai.io biocontainer repository. For multi-tool processes and
custom functions, self-built images are hosted on Dockerhub. The
underlying Dockerfiles for those images can be found within the assets/docker
directory of the repository.
To reuse downloaded images for multiple pipeline runs, specify a cache-directory. type:
mkdir singularity
export NXF_SINGULARITY_CACHEDIR=$PWD/singularity
Note
Make sure that Singularity has the permissions to overlay and access your file-system within the
container. Otherwise the pipeline wont be able to read from or write to it. Upon a
"file doesn't exist" error, create an additional nextflow.config
config-file
Add the following content to the file:
singularity {
runOptions = "--bind /directory/in/use"
}
Add this file to your nextflow run with the -c
flag:
nextflow run ... -c singularity/nextflow.config
Attention
the -profile
and the -c
flag have only one dash!
Now -again- the pipeline can be tested by running:
nextflow run -profile test,singularity -c singularity/nextflow.config
-profile test,docker
command.The singularity
directory should now contain all the images used in the pipeline:
singularity
├── depot.galaxyproject.org-singularity-samtools-1.15.1--h1170115_0.img
├── merszym-biohazard_bamrmdup-v0.2.img
└── merszym-quicksand-1.2.img
Working with real data
Create datastructure
The required underlying datastructure of the pipeline is in detail described in the quicksand-build section
In short: You need a precompiled kraken database, the respective reference genomes and bedfiles indicating low-complexity regions.
Use the supplementary pipeline quicksand-build
(once) to download the taxonomy from NCBI/taxonomy, all mitochondrial
genomes from NCBI/RefSeq and create the required databases and files for you.
For this session create the datastructure for the Primate mtDNA from RefSeq:
nextflow run mpieva/quicksand-build --outdir refseq --include Primates
Attention
This command creates a directory refseq
that contains the files required to run quicksand:
refseq
├── kraken
│ └── Mito_db_kmer22
├── genomes
│ ├── {family}
│ │ └── {species}.fasta
│ └── taxid_map.tsv
└── masked
└── {species}_masked.bed
With the datastructure created, the pipeline is ready to be used with the following flags:
--db refseq/kraken/Mito_db_kmer22/
--genomes refseq/genomes/
--bedfiles refseq/masked/
Run the pipeline
As input for the pipeline, download the Hominin "Hohlenstein-Stadel" mtDNA 1 into a directory split
wget -P split http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/BAM/mtDNA/HST.raw_data.ALL.bam
And run the quicksand pipeline:
nextflow run mpieva/quicksand \
--db refseq/kraken/Mito_db_kmer22/ \
--genomes refseq/genomes/ \
--bedfiles refseq/masked/ \
--split split \
-profile singularity
A summary of all the stats can be found in the final_report.tsv
file
Filter the Results
As can be seen in the final_report.tsv
, not all sequences were assigned to Homindae, but to a couple of other Primate families too.
The assignment of false positive taxa is a well-known problem of kmer-based assignment methods and additional filters need to be applied.
Based on simulated data, our recommended cutoffs are:
FamPercentage cutoff of 1% and/or
ProportionMapped cutoff of 0.5-0.7.
The kmer-information is also indicative. If the FamilyKmers and KmerCoverage values are low and the KmerDupRate value is high, the assigment of the family is only based on a small number of kmers within the reads