quicksand-build

Description

To run quicksand, an underlying datastructure with databases and reference genomes is required. This datastructure consists of:

A preindexed Kraken-database
A directory containing the indexed fasta files of the mt-genomes from RefSeq that were used to create the kraken-database
In the same directory, a file taxid_map.tsv, linking the files in the folder to NCBI Taxonomy Ids
A directory containing a bed-file for each genome, indicating non-informative and low-complexity regions within the genome

Instead of creating this files manually, the supplementary pipeline quicksand-build is used for that. Note that quicksand-build is a separate pipeline hosted here.

Workflow

Graphical overview over the processes of the quicksand pipeline

downloadTaxonomy / downloadGenomes

The first step of the pipeline is downloading of the genbank-files from the `NCBI RefSeq mitochondrion FTP-Server<https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/>`_ and the NCBI taxonomy using the kraken-build --download-taxonomy command.

extractFamilies

A custim python script is used to extract the fasta-files from the NCBI genbank files. Use the --include flag to specify taxa that should be extracted. (default: all). All included taxa are used to build the kraken-database and can thus be identified by the quicksand pipeline.

indexFasta

The extracted fasta files are indexed using the bwa index command.

writeBedfiles

For each fasta file, dustmasker is used to specify non-informative and low-complexity regions and write them: as coordinates into a separate bed file.

createKrakenDB

Use kraken-build to add the extracted fasta files to the database. The database is then built using the with the flag --kmers specified kmer sizes (default: 22)

Usage

Note

quicksand-build uses singularity by default!

Use the -profile docker flag to use docker instead.

To run the pipeline with default parameters open the terminal and type:

nextflow run mpieva/quicksand-build --outdir <PATH> [ --kmers KMERS --include TAXA --exclude PATH ]

Flags

Flag	Input type	Description
--outdir	STRING	Specify the directory used to save the datastructure in Default: 'out'
--include	STRING	A string of comma-separated taxa (exact match to the ncbi taxonomy names) that should be included in the kraken database Default: 'root' (all) Example: --include Mammalia,Aves
--kmers	STRING	A string of comma-separated numbers. For each number a kraken database is built with the respective kmer-size Default: '22' outdir ├── kraken │ ├── Mito_db_kmer22 │ ├── Mito_db_kmer23 │ └── Mito_db_kmer24 Example: --kmers 22,23,24
--exclude	PATH	A TSV file containing family and comma-separated species names. \| The listed species are excluded from the kraken database input (exclude.tsv): Hominidae Homo_sapiens,Homo_neandertalensis Bovidae Capra_aegagrus Example: --exclude exclude.tsv

Output

The output of quicksand-build is structured as follows:

outdir
├── kraken
│    ├── Mito_db_kmer22
│    │      ├── taxonomy
│    │      ├── ...
│    │      └── database.kdb
│    └── Mito_db_kmer24
│           ├── taxonomy
│           ├── ...
│           └── database.kdb
├── genomes
│    ├── ${Family}
│    │      ├── ${Species}.fasta
│    │      ├── ${Species}.fasta.fai
│    │      └── ...
│    └── taxid_map.tsv
├── masked
│    └── ${Species}.masked.bed
├── ncbi
│    └── raw gbff.gz files
└── work
     └── intermediate nextflow files

The taxid_map.tsv file contains the following information:

1425170  Hominidae   Homo_heidelbergensis              Primates
   Hominidae   Homo_heidelbergensis              Primates
   Hominidae   Homo_heidelbergensis              Primates
   Hominidae   Homo_heidelbergensis              Primates
  Hominidae   Homo_sapiens_neanderthalensis     Primates
   Hominidae   Homo_sapiens_neanderthalensis     Primates
   Hominidae   Homo_sapiens_neanderthalensis     Primates
   Hominidae   Homo_sapiens_neanderthalensis     Primates
   Hominidae   Homo_sapiens                      Primates
   Hominidae   Homo_sapiens                      Primates
   Hominidae   Homo_sapiens                      Primates
   Hominidae   Homo_sapiens                      Primates
741158   Hominidae   Homo_sapiens_subsp._'Denisova'    Primates
   Hominidae   Homo_sapiens_subsp._'Denisova'    Primates
   Hominidae   Homo_sapiens_subsp._'Denisova'    Primates
   Hominidae   Homo_sapiens_subsp._'Denisova'    Primates

The columns are "NCBI TaxonID", "Family", "Species", "Order"
The species are represented in the file multiple times for different nodes.
e.g. TaxId '9605' (Homo)

Hominidae  Homo_heidelbergensis            Primates
Hominidae  Homo_sapiens_neanderthalensis   Primates
Hominidae  Homo_sapiens                    Primates
Hominidae  Homo_sapiens_subsp._'Denisova'  Primates

The quicksand process 'findBestNode' returns a taxon id. This taxid_map.tsv file is used to provide the 'mapBwa' process

with all the species reference genomes linked to that TaxonId

Note

In case of editing or manually building the datastructure, be aware that the $\{Species\} in the taxid_map.tsv file must correspond to the $\{Species\} filenames in the genomes and masked directories. Otherwise quicksand won't find the appropriate files