Configuration¶

Note

The configuration key nomenclature hasn’t been settled yet

Note

The main lts_workflows documentation provides more information about general configuration settings.

Required configuration¶

The following options must be set in the configuration file:

settings:
  sampleinfo: sampleinfo.csv
  runfmt: "{SM}/{SM}_{PU}_{DT}"
  samplefmt: "{SM}/{SM}"
ngs.settings:
  db:
    ref: # Reference sequences
      - ref.fa
      - gfp.fa
      - ercc.fa
    transcripts:
      - ref-transcripts.fa
      - gfp.fa
      - ercc.fa
  annotation:
    sources:
      - ref-transcripts.gtf
      - gfp.genbank
      - ercc.gb
  # Optional; change these if read names and fastq file suffixes differ
  read1_label: "_1"
  read2_label: "_2"
  fastq_suffix: ".fastq.gz"

# list of sample identifiers corresponding to the sampleinfo 'SM'
# column
samples:
  - sample1
  - sample2

The configuration settings runfmt and samplefmt describe how your data is organized. They represent python miniformat strings, where the entries correspond to columns in the sampleinfo file; hence, in this case, the columns SM, PU and DT must be present in the sampleinfo file.

Note

Since the runfmt and samplefmt can represent any format you wish, in principle, you could use any label formatting names. This is true except for SM, which represents the sample name and must be present in the sampleinfo file. The two-letter sample labels above are convienient representations of metadata and correspond to samtools read group record types.

Example sampleinfo.csv¶

SM,PU,DT,fastq
s1,AAABBB11XX,010101,s1_AAABBB11XX_010101_1.fastq.gz
s1,AAABBB11XX,010101,s1_AAABBB11XX_010101_2.fastq.gz
s1,AAABBB22XX,020202,s1_AAABBB22XX_020202_1.fastq.gz
s1,AAABBB22XX,020202,s1_AAABBB22XX_020202_2.fastq.gz
s2,AAABBB11XX,010101,s2_AAABBB11XX_010101_1.fastq.gz
s2,AAABBB11XX,010101,s2_AAABBB11XX_010101_2.fastq.gz

The example sampleinfo file would work with the required settings above. The following runfmt and samplefmt would be generated for sample s2, read 1:

runfmt = s2/s2_AAABBB11XX_010101
samplefmt = s2/s2

Workflow specific configuration¶

In addition to the required configuration, there are some configuration settings that affect the workflow itself. These settings are accessed and set via config['workflow'].

use_multimapped: (boolean) Use multimapped reads for quantification. Default false.
quantification: (list) List quantification methods to use. Available options are rsem and rpkmforgenes.

Example workflow configuration section¶

workflow:
  use_multimapped: false
  quantification:
    - rsem
    - rpkmforgenes

Application level configuration¶

Note

Unfortunately, there is no straightforward way to automatically list the available application configuration options. You therefore have look in the rule files themselves for available options. In most cases, the default settings should work fine.

Note

Rules live in separate files whose names consist of the application name followed by the rule name. Rules are located in package subdirectory rules, in which each application lives in a separate directory.

Tip

There is a option configuration key for each rule. Most often, this is the setting one wants to modify.

Individual applications (e.g. star) are located at the top level, with sublevels corresponding to specific application rules. For instance, the following configuration would affect settings in star and rsem:

star:
  star_index:
    # The test genome is small; 2000000 bases. --genomeSAindexNbases
    # needs to be adjusted to (min(14, log2(GenomeLength)/2 - 1))
    options: --genomeSAindexNbases 10

rsem:
  index: ../ref/rsem_index

Additional advice¶

There are a couple of helper rules for generating spikein input files and the transcript annotation file.

dbutils_make_transcript_annot_gtf: For QC statistics calculated by RSEQC, the gtf annotation file should reflect the content of the alignment index. You can automatically create the file name defined in ['ngs.settings']['annotation']['transcript_annot_gtf'] from the list of files defined in ['ngs.settings']['annotation']['sources'] via the rule dbutils_make_transcript_annot_gtf. gtf and genbank input format is accepted.
ercc_create_ref: The ERCC RNA Spike-In Mix is commonly used as spike-in. The rule ercc_create_ref automates download of the sequences in fasta and genbank formats.