Snakemake is a tool to build reproducible bioinformatic workflows. It has convenient support for cluster queuing systems, which makes it attractive for us.
Requirements
Snakemake is written in python 3. I installed it in the anaconda environment py3
that we have on the cluster.
You can activate this environment by
source activate py3
Basic idea
Similar to the classic GNU make
, you specify what you want how that is made.
These instructions are saved in a snakefile
.
Their website has this example
rule targets:
input:
"plots/dataset1.pdf",
"plots/dataset2.pdf"
rule plot:
input:
"raw/{dataset}.csv"
output:
"plots/{dataset}.pdf"
shell:
"somecommand {input} {output}"
The first rule specifies what you want to have as inputs.
Snakemake will then look at all other rules to see whether the required files match the output of another rule.
This matching can contain wild cards like {dataset}
in this case.
This is done recursively until snakemake arrives at a rule that has all required inputs on disk.
Example: Influenza viruses and segments
We commonly analyze different flu subtypes (A/H3N2
, A/H1N1
, B/YAM
and B/VIC
), each of which has eight segments.
To process sequence data for all of them, we could set up a snakefile as
segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'mp', 'na', 'ns']
viruses = ['vic', 'yam', 'h1n1pdm', 'h3n2']
# this makes a list of filenames for each segment and virus
alignments=expand("seasonal_flu/{v}/{seg}/results/alignment.fasta", seg=segments, v=viruses)
rule all:
input:
alignments
rule align:
input:
"seasonal_flu/{virus}/{segment}/results/sequences.fasta"
output:
"seasonal_flu/{virus}/{segment}/results/alignment.fasta"
shell:
mafft {input} > {output}
Adding parameters
Sometimes, your scripts needs additional parameters and structured input of files. This can be achieved very elegantly in the following way
segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'mp', 'na', 'ns']
viruses = ['vic', 'yam', 'h1n1pdm', 'h3n2']
# this makes a list of filenames for each segment and virus
alignments=expand("seasonal_flu/{v}/{seg}/results/alignment.fasta", seg=segments, v=viruses)
def ref_seq_name(v):
references = {'h3n2':"A/Beijing/32/1992",
'h1n1pdm':"A/California/07/2009",
'vic':"B/HongKong/02/1993",
'yam':"B/Singapore/11/1994"}
return references[v.virus]
rule all:
input:
alignments
rule align:
input:
sequences="seasonal_flu/{virus}/{segment}/results/sequence.fasta",
refseq="seasonal_flu/{virus}/metadata/{virus}_{segment}_outgroup.gb"
output:
"seasonal_flu/{virus}/{segment}/results/alignment.fasta"
params:
path="seasonal_flu/{virus}/{segment}",
ref = ref_seq_name
shell:
"python src/align.py --path {params.path} --reference {input.refseq} &&"
"python src/map_to_reference.py --path {params.path} --reference {params.ref}"
This does a number of additional things.
- input files now have names and can be addressed as
input.sequences
etc - a
params
field with variables that contain wild cards - it defines a function
ref_seq_name
which is called upon execution with wild cards as arguments.
These additional features allow for complex and powerful pipelines.
Running snakemake on the cluster
Snakemake can run all the jobs necessary in correct order on the cluster, but you'll have to tell it how to submit and what kind of resources it is allowed to use.
Furthermore, you have to run snakemake on a submit and keep that session alive, for example using tmux
or nohup
(very ugly but easy). This would work the following
nohup snakemake --snakefile seasonal_flu/flu_snake --jobs 32 --cluster "sbatch -t 05:59:00" 1>log2 &
This tells snakemake to use sbatch
to submit jobs with a time limit of 6h and run at most 32 jobs in parallel.
In many cases, it might be better to specify rule specific resources via a cluster configuration json or individual submit scripts for the rules.
Furthermore, if you scripts use modules that need to be loaded or need to modify you path, you might want to do something like
rule align:
...
shell:
"export PATH=/path_to_anaconda/anaconda2/bin/:$PATH && "
"ml MAFFT && "
"python src/align.py --path {params.path} --reference {input.refseq} &&"
"python src/map_to_reference.py --path {params.path} --reference {params.ref}"