The tool-chain augur is the bioinformatics engine of nextstrain and produces the files that can be visualized in the webbrowser using auspice. Augur consists of a number of tools that allow the user to filter and align sequences, build trees, and integrate the phylogenetic analysis with meta data. The different tools are meant to be composable and the output of one tool will serve as the input of other tools. To compose such tools into an analysis pipeline, augur advocates the workflow management tool Snakemake.
Snakemake
Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile
.
Each rule takes a number of input files, specifies a few parameters, and produces output files.
A simple rule would look like this:
rule align:
input:
seq="data/sequences.fasta"
params:
nthreads = 2
output:
aln="results/alignment.fasta"
shell:
'''
augur align --sequences {input.seq} --nthreads {params.nthreads} --ouput {output.aln}
'''
This rule would produce results/alignment.fasta
from the input file data/sequences.fasta
using the augur align
command and additionally specifies that augur
is supposed to use 2 CPUs for this task.
When executing a rule, Snakemake will check whether all necessary input files are present, if not, it will determine which rules produce the necessary files and execute those first. Say there was an additional rule to build a tree that dependended on the alignment:
rule tree:
input:
aln="results/alignment.fasta"
output:
tree="results/tree.nwk"
shell:
'''
augur tree --alignment {input.aln} --ouput {output.tree}
'''
To build the tree, you would now call Snakemake as
>snakemake results/tree.nwk
and Snakemake would
- find the rule that produces
results/tree.nwk
(ruletree
in this case) - determine that it needs to run rule
align
to produceresults/alignment.fasta
- run rules
align
andtree
These simple examples only scratch the surface of what Snakemake can do but should give you the general idea.
Augur commands and Snakefiles
We will work off the tutorial for Zika virus on the nextstrain web site, but instead of the Zika virus data, we will analyze Yellow Fever sequences sampled in Brazil in 2016-2017 plus a bunch of older sequences. Those data you can find inside the nextstrain@krisp workshop repository.