Configuration guide¶
The configuration guide provides a basic overview of the general options and settings layout for the supported workflow managers. Please refer to the workflow documentation pages for more specifice instructions regarding a particular workflow.
Currently, the supported workflow managers are:
- snakemake: a workflow management system written in python
- nextflow: a fluent DSL for data-driven computational pipelines
Snakemake¶
Snakemake can be configured through a configuration file that is
passed either via the --configfile
command line option, or the
Snakefile configfile:
directive. Once loaded, the configuration
settings can be accessed through the global python object config
.
Internally, configuration objects are python dictionaries, where the keys correspond to configuration options. This has the unfortunate consequence that it is difficult to provide a documentation API to the options. This text tries to address this issue, albeit in an unsufficient manner. As a last resort, for now at least, one simply has to look at the source code to get an idea of what the options do. In most cases though, the key names themselves should give an idea of what behaviour they target.
It is important to keep in mind that no validation of user-supplied configuration files is done. Consequently, should the user supply a non-defined configuration key, it will passed unnoticed by Snakemake. This can be frustrating when debugging; you are sure that you have changed a configuration value, only to notice later that the configuration key was misspelled.
Implementation¶
The configuration is constructed as a hierarchy of at most three levels:[1]
section:
subsection:
option:
The section
level corresponds to an application, or a configuration
group of more general nature. The subsection
can either be a new
configuration grouping, or an option to be set. For applications, the
subsection
often corresponds to a given rule. Finally, at the
option
level, an option is set.
Configuration sections¶
For each workflow, there is a subdirectory named rules
. The
directory contains rules organized by directories and settings
files
that provide default configuration values. Every rule directory has
its own settings file. There are two top-level settings file located
directly in the rules
directory, namely main.settings
and
ngs.settings
.
settings¶
The settings
section defines configurations of a general nature.
settings:
sampleinfo: sampleinfo.csv
email: # email
java:
java_mem: 8g
java_tmpdir: /tmp
runfmt: "{SM}/{SM}_{PU}"
samplefmt: "{SM}/{SM}"
threads: 8
temporary_rules:
- picard_merge_sam
For all settings, see rules/main.settings
.
Importantly, many of these settings are inherited by the application
rules, so that changing threads
to 4 in settings
, will set the
number of threads for all configurations that inherit this option.
However, you can fine-tune the behaviour of the inheriting rules to
override the value in settings
; see Application settings.
Here, the most important option is sampleinfo
, which must be
set. The runfmt
and samplefmt
options describe how the data is
organized. They represent python miniformat strings, where the
entries correspond to columns in the sampleinfo file; hence, in this
case, the column SM and PU must be present in the sampleinfo
file. So, given the following sampleinfo file
SM,PU,DT,fastq
s1,AAABBB11XX,010101,s1_AAABBB11XX_010101_1.fastq.gz
s1,AAABBB11XX,010101,s1_AAABBB11XX_010101_2.fastq.gz
s1,AAABBB22XX,020202,s1_AAABBB22XX_020202_1.fastq.gz
s1,AAABBB22XX,020202,s1_AAABBB22XX_020202_2.fastq.gz
samplefmt will be formatted as s1/s1
and runfmt as
s1/s1_AAABBB11XX
or s1/s1_AAABBB22XX
, depending on the run. The
formatted strings are used in the workflows as prefixes to identify
targets. Rules that operate on the runfmt will be prefixed by
s1/s1_AAABBB11XX
or s1/s1_AAABBB22XX
, rules that operate on the
sample level (i.e. after merging) will be prefixed by s1/s1
.
Currently, the tests define three different sample organizations.
sample:
runfmt: "{SM}/{SM}_{PU}_{DT}"
samplefmt: "{SM}/{SM}"
sample_run:
runfmt: "{SM}/{PU}_{DT}/{SM}_{PU}_{DT"}
samplefmt: "{SM}/{SM}"
sample_project_run:
runfmt: "{SM}/{PID}/{PU}_{DT}/{PID}_{PU}_{DT"}
samplefmt: "{SM}/{SM}"
However, it is trivial to add more configurations, should that be deemed necessary.
ngs.settings¶
Warning
The ngs.settings section is slightly disorganized.
ngs.settings
affect settings related to ngs analyses:
ngs.settings:
annotation:
annot_label: ""
transcript_annot_gtf: ""
sources: []
db:
dbsnp: ""
ref: ref.fa
transcripts: []
build: ""
fastq_suffix: ".fastq.gz"
read1_label: "_1"
read2_label: "_2"
read1_suffix: ".fastq.gz"
read2_suffix: ".fastq.gz"
regions: []
sequence_capture:
bait_regions: []
target_regions: []
For all settings, see rules/ngs.settings
.
samples¶
The samples
section is one of the few top-level configuration keys
that are actually set, in this case to a list of sample names.
Application settings¶
Applications, i.e. bioinformatics software, are grouped in sections by their application name. Subsections correspond to rules, or subprograms. For instance, the entire bwa section looks as follows (with a slight abuse of notation as we here mix yaml with python objects):
bwa:
cmd: bwa
ref: config['ngs.settings']['db']['ref']
index: ""
index_ext: ['.amb', '.ann', '.bwt', '.pac', '.sa']
threads: config['settings']['threads']
mem:
options:
Setting option threads
would then override the value in settings
,
providing a means to fine-tune options on a per-application basis.
Workflow settings¶
Finally, the workflows comes with a configuration section called
workflow
.