Pipelines with snakemake

Snakemake is a tool to build reproducible bioinformatic workflows. It has convenient support for cluster queuing systems, which makes it attractive for us.

Requirements

Snakemake is written in python 3. I installed it in the anaconda environment py3 that we have on the cluster. You can activate this environment by

source activate py3

Basic idea

Similar to the classic GNU make, you specify what you want how that is made. These instructions are saved in a snakefile. Their website has this example

rule targets:
    input:
        "plots/dataset1.pdf",
        "plots/dataset2.pdf"

rule plot:
    input:
        "raw/{dataset}.csv"
    output:
        "plots/{dataset}.pdf"
    shell:
        "somecommand {input} {output}"

The first rule specifies what you want to have as inputs. Snakemake will then look at all other rules to see whether the required files match the output of another rule. This matching can contain wild cards like {dataset} in this case. This is done recursively until snakemake arrives at a rule that has all required inputs on disk.

Example: Influenza viruses and segments

We commonly analyze different flu subtypes (A/H3N2, A/H1N1, B/YAM and B/VIC), each of which has eight segments. To process sequence data for all of them, we could set up a snakefile as

segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'mp', 'na', 'ns']
viruses = ['vic', 'yam', 'h1n1pdm', 'h3n2']

# this makes a list of filenames for each segment and virus
alignments=expand("seasonal_flu/{v}/{seg}/results/alignment.fasta", seg=segments, v=viruses)

rule all:
    input:
        alignments

rule align:
    input:
        "seasonal_flu/{virus}/{segment}/results/sequences.fasta"
    output:
        "seasonal_flu/{virus}/{segment}/results/alignment.fasta"
    shell:
        mafft {input} > {output}

Adding parameters

Sometimes, your scripts needs additional parameters and structured input of files. This can be achieved very elegantly in the following way

segments = ['pb2', 'pb1', 'pa', 'ha', 'np', 'mp', 'na', 'ns']
viruses = ['vic', 'yam', 'h1n1pdm', 'h3n2']

# this makes a list of filenames for each segment and virus
alignments=expand("seasonal_flu/{v}/{seg}/results/alignment.fasta", seg=segments, v=viruses)

def ref_seq_name(v):
    references = {'h3n2':"A/Beijing/32/1992",
                  'h1n1pdm':"A/California/07/2009",
                  'vic':"B/HongKong/02/1993",
                  'yam':"B/Singapore/11/1994"}
    return references[v.virus]


rule all:
    input:
        alignments

rule align:
    input:
        sequences="seasonal_flu/{virus}/{segment}/results/sequence.fasta",
        refseq="seasonal_flu/{virus}/metadata/{virus}_{segment}_outgroup.gb"
    output:
        "seasonal_flu/{virus}/{segment}/results/alignment.fasta"
    params:
        path="seasonal_flu/{virus}/{segment}",
        ref = ref_seq_name
    shell:
        "python src/align.py --path  {params.path} --reference {input.refseq} &&"
        "python src/map_to_reference.py --path {params.path} --reference {params.ref}"

This does a number of additional things.

input files now have names and can be addressed as input.sequences etc
a params field with variables that contain wild cards
it defines a function ref_seq_name which is called upon execution with wild cards as arguments.

These additional features allow for complex and powerful pipelines.

Running snakemake on the cluster

Snakemake can run all the jobs necessary in correct order on the cluster, but you'll have to tell it how to submit and what kind of resources it is allowed to use. Furthermore, you have to run snakemake on a submit and keep that session alive, for example using tmux or nohup (very ugly but easy). This would work the following

nohup snakemake --snakefile seasonal_flu/flu_snake --jobs 32 --cluster "sbatch -t 05:59:00" 1>log2 &

This tells snakemake to use sbatch to submit jobs with a time limit of 6h and run at most 32 jobs in parallel. In many cases, it might be better to specify rule specific resources via a cluster configuration json or individual submit scripts for the rules. Furthermore, if you scripts use modules that need to be loaded or need to modify you path, you might want to do something like

rule align:
    ...
    shell:
        "export PATH=/path_to_anaconda/anaconda2/bin/:$PATH && "
        "ml MAFFT && "
        "python src/align.py --path  {params.path} --reference {input.refseq} &&"
        "python src/map_to_reference.py --path {params.path} --reference {params.ref}"

Pipelines with snakemake

Requirements

Basic idea

Example: Influenza viruses and segments

Adding parameters

Running snakemake on the cluster

Published

Category

Tags