.. role:: bold .. role:: heading1 .. _install-page: Installation ============ Use this section to set up the pipeline to work with real data. To test if the pipeline works on your computer, see the :ref:`quickstart-page` section .. _requirements: Requirements ------------ Make sure you have the following two components installed. :Nextflow: See more details here: `Nextflow `_ :Singularity or Docker: See more details here: `Singularity `_ or `Docker `_ Nextflow is the pipeline framework, while Singularity/Docker are software tools for the containerization of code. .. tip:: check the successful installation of the software by running:: nextflow -v >>> nextflow version 20.04.1.5335 singularity --version >>> singularity version 3.7.2-dirty :heading1:`Working with test data` Test runability --------------- To test the runability of the pipeline, see the :ref:`quickstart-page` section .. _container: Container Cache --------------- Quicksand uses container software (Docker,Singularity) to ensure the pipeline runs within a stable environment. For single-tool processes, container images are either pulled from the `Galaxy image repository `_ or the `Quai.io biocontainer repository `_. For multi-tool processes and custom functions, self-built images are hosted on `Dockerhub `_. The underlying Dockerfiles for those images can be found within the :file:`assets/docker` directory of the `repository `_. To reuse downloaded images for multiple pipeline runs, specify a cache-directory. type:: mkdir singularity export NXF_SINGULARITY_CACHEDIR=$PWD/singularity .. note:: Make sure that Singularity has the permissions to overlay and access your file-system within the container. Otherwise the pipeline wont be able to read from or write to it. Upon a :bold:`"file doesn't exist" error`, create an additional :file:`nextflow.config` config-file Add the following content to the file:: singularity { runOptions = "--bind /directory/in/use" } Add this file to your nextflow run with the :code:`-c` flag:: nextflow run ... -c singularity/nextflow.config .. attention:: the :code:`-profile` and the :code:`-c` flag have only one dash! Now -again- the pipeline can be tested by running:: nextflow run -profile test,singularity -c singularity/nextflow.config | The meaning of the flags and the different ways of customizing the pipeline is described in the :ref:`usage-page` section. | In case of choosing Docker over Singularity, use the :code:`-profile test,docker` command. The :file:`singularity` directory should now contain all the images used in the pipeline:: singularity ├── depot.galaxyproject.org-singularity-samtools-1.15.1--h1170115_0.img ├── merszym-biohazard_bamrmdup-v0.2.img └── merszym-quicksand-1.2.img :heading1:`Working with real data` .. _setup: Create datastructure -------------------- The required underlying datastructure of the pipeline is in detail described in the :ref:`quicksand_build-page` section In short: You need a :bold:`precompiled kraken database`, the respective :bold:`reference genomes` and :bold:`bedfiles` indicating low-complexity regions. Use the supplementary pipeline :code:`quicksand-build` (once) to download the taxonomy from NCBI/taxonomy, all mitochondrial genomes from NCBI/RefSeq and create the required databases and files for you. For this session create the datastructure for the :bold:`Primate mtDNA` from RefSeq:: nextflow run mpieva/quicksand-build --outdir refseq --include Primates .. attention:: | Building the database requires ~40G of RAM | Be patient, downloading the taxonomy plus the creation of the database might take :bold:`~1h`. This command creates a directory :file:`refseq` that contains the files required to run quicksand:: refseq ├── kraken │ └── Mito_db_kmer22 ├── genomes │ ├── {family} │ │ └── {species}.fasta │ └── taxid_map.tsv └── masked └── {species}_masked.bed With the datastructure created, the pipeline is ready to be used with the following flags:: --db refseq/kraken/Mito_db_kmer22/ --genomes refseq/genomes/ --bedfiles refseq/masked/ Run the pipeline ---------------- As :bold:`input` for the pipeline, download the Hominin "Hohlenstein-Stadel" mtDNA [1]_ into a directory :bold:`split` :: wget -P split http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/BAM/mtDNA/HST.raw_data.ALL.bam And run the quicksand pipeline:: nextflow run mpieva/quicksand \ --db refseq/kraken/Mito_db_kmer22/ \ --genomes refseq/genomes/ \ --bedfiles refseq/masked/ \ --split split \ -profile singularity | Please see the :ref:`usage-page` section for an explaination of the flags and the input! | Please see the :ref:`output` section for an explaination of the output files! A summary of all the stats can be found in the :file:`final_report.tsv` file Filter the Results ------------------ As can be seen in the :code:`final_report.tsv`, not all sequences were assigned to Homindae, but to a couple of other Primate families too. The assignment of false positive taxa is a well-known problem of kmer-based assignment methods and additional filters need to be applied. Based on simulated data, our recommended cutoffs are: - :bold:`FamPercentage` cutoff of 1% and/or - :bold:`ProportionMapped` cutoff of 0.5-0.7. The kmer-information is also indicative. If the :bold:`FamilyKmers` and :bold:`KmerCoverage` values are low and the :bold:`KmerDupRate` value is high, the assigment of the family is only based on a small number of kmers within the reads .. [1] http://ftp.eva.mpg.de/neandertal/Hohlenstein-Stadel/README