Augur and Snakemake

The tool-chain augur is the bioinformatics engine of nextstrain and produces the files that can be visualized in the webbrowser using auspice. Augur consists of a number of tools that allow the user to filter and align sequences, build trees, and integrate the phylogenetic analysis with meta data. The different tools are meant to be composable and the output of one tool will serve as the input of other tools. To compose such tools into an analysis pipeline, augur advocates the workflow management tool Snakemake.

Snakemake

Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile. Each rule takes a number of input files, specifies a few parameters, and produces output files. A simple rule would look like this:

rule align:
    input:
        seq="data/sequences.fasta"
    params:
        nthreads = 2
    output:
        aln="results/alignment.fasta"
    shell:
        '''
        augur align --sequences {input.seq} --nthreads {params.nthreads} --ouput {output.aln}
        '''

This rule would produce results/alignment.fasta from the input file data/sequences.fasta using the augur align command and additionally specifies that augur is supposed to use 2 CPUs for this task.

When executing a rule, Snakemake will check whether all necessary input files are present, if not, it will determine which rules produce the necessary files and execute those first. Say there was an additional rule to build a tree that dependended on the alignment:

rule tree:
    input:
        aln="results/alignment.fasta"
    output:
        tree="results/tree.nwk"
    shell:
        '''
        augur tree --alignment {input.aln} --ouput {output.tree}
        '''

To build the tree, you would now call Snakemake as

>snakemake results/tree.nwk

and Snakemake would

find the rule that produces results/tree.nwk (rule tree in this case)
determine that it needs to run rule align to produce results/alignment.fasta
run rules align and tree

These simple examples only scratch the surface of what Snakemake can do but should give you the general idea.

Augur commands and Snakefiles

We will work off the tutorial for Zika virus on the nextstrain web site, but instead of the Zika virus data, we will analyze Yellow Fever sequences sampled in Brazil in 2016-2017 plus a bunch of older sequences. Those data you can find inside the nextstrain@krisp workshop repository.

Augur and Snakemake

Snakemake

Augur commands and Snakefiles

Published

Category

Tags