The tool-chain augur is the bioinformatics engine of nextstrain and produces the files that can be visualized in the webbrowser using auspice. Augur consists of a number of tools that allow the user to filter and align sequences, build trees, and integrate the phylogenetic analysis with meta data. The different tools are meant to be composable and the output of one tool will serve as the input of other tools. To compose such tools into an analysis pipeline, augur advocates the workflow management tool Snakemake.
Snakemake
Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile
.
Each rule takes a number of input files, specifies a few parameters, and produces output files.
A simple rule would look like this:
rule align:
input:
seq="data/sequences.fasta"
params:
nthreads = 2
output:
aln="results/alignment.fasta"
shell:
'''
augur align --sequences {input.seq} --nthreads {params.nthreads} --ouput {output.aln}
'''
This rule would produce results/alignment.fasta
from the input file data/sequences.fasta
using the augur align
command and additionally specifies that augur
is supposed to use 2 CPUs for this task.
When executing a rule, Snakemake will check whether all necessary input files are present, if not, it will determine which rules produce the necessary files and execute those first. Say there was an additional rule to build a tree that dependended on the alignment:
rule tree:
input:
aln="results/alignment.fasta"
output:
tree="results/tree.nwk"
shell:
'''
augur tree --alignment {input.aln} --ouput {output.tree}
'''
To build the tree, you would now call Snakemake as
>snakemake results/tree.nwk
and Snakemake would
- find the rule that produces
results/tree.nwk
(ruletree
in this case) - determine that it needs to run rule
align
to produceresults/alignment.fasta
- run rules
align
andtree
These simple examples only scratch the surface of what Snakemake can do but should give you the general idea.
Augur commands and Snakefiles
We will work off the tutorial for Zika virus on the nextstrain web site and the github repository nextstrain/zika-tutorial. Clone the tutorial using
git clone https://github.com/nextstrain/zika-tutorial.git
and open the Snakefile
in your text editor.
Filter the Sequences
Filter the parsed sequences and metadata to exclude strains from subsequent analysis and subsample the remaining strains to a fixed number of samples per group.
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--exclude config/dropped_strains.txt \
--output results/filtered.fasta \
--group-by country year month \
--sequences-per-group 20 \
--min-date 2012
Align the Sequences
Create a multiple alignment of the sequences using a custom reference.
After this alignment, columns with gaps in the reference are removed.
Additionally, the --fill-gaps
flag fills gaps in non-reference sequences with "N" characters.
These modifications force all sequences into the same coordinate space as the reference sequence.
augur align \
--sequences results/filtered.fasta \
--reference-sequence config/zika_outgroup.gb \
--output results/aligned.fasta \
--fill-gaps
Now the pathogen sequences are ready for analysis.
Construct the Phylogeny
Infer a phylogenetic tree from the multiple sequence alignment.
augur tree \
--alignment results/aligned.fasta \
--output results/tree_raw.nwk
The resulting tree is stored in Newick format. Branch lengths in this tree measure nucleotide divergence.
Get a Time-Resolved Tree
Augur can also adjust branch lengths in this tree to position tips by their sample date and infer the most likely time of their ancestors, using TreeTime.
Run the refine
command to apply TreeTime to the original phylogenetic tree and produce a "time tree".
augur refine \
--tree results/tree_raw.nwk \
--alignment results/aligned.fasta \
--metadata data/metadata.tsv \
--output-tree results/tree.nwk \
--output-node-data results/branch_lengths.json \
--timetree \
--coalescent opt \
--date-confidence \
--date-inference marginal \
--clock-filter-iqd 4
In addition to assigning times to internal nodes, the refine
command filters tips that are likely outliers and assigns confidence intervals to inferred dates.
Branch lengths in the resulting Newick tree measure adjusted nucleotide divergence.
All other data inferred by TreeTime is stored by strain or internal node name in the corresponding JSON file.
Annotate the Phylogeny
Reconstruct Ancestral Traits
TreeTime can also infer ancestral traits from an existing phylogenetic tree and metadata annotating each tip of the tree.
The following command infers the region and country of all internal nodes from the time tree and original strain metadata.
As with the refine
command, the resulting JSON output is indexed by strain or internal node name.
augur traits \
--tree results/tree.nwk \
--metadata data/metadata.tsv \
--output results/traits.json \
--columns region country \
--confidence
Infer Ancestral Sequences
Next, infer the ancestral sequence of each internal node and identify any nucleotide mutations on the branches leading to any node in the tree.
augur ancestral \
--tree results/tree.nwk \
--alignment results/aligned.fasta \
--output results/nt_muts.json \
--inference joint
Identify Amino-Acid Mutations
Identify amino acid mutations from the nucleotide mutations and a reference sequence with gene coordinate annotations.
The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name.
To export a FASTA file with the complete amino acid translations for each gene from each node’s sequence, specify the --alignment-output
parameter in the form of results/aligned_aa_%GENE.fasta
.
augur translate \
--tree results/tree.nwk \
--ancestral-sequences results/nt_muts.json \
--reference-sequence config/zika_outgroup.gb \
--output results/aa_muts.json
Export the Results
Finally, collect all node annotations and metadata and export it all in auspice’s JSON format.
This refers to three config files to define colors via config/colors.tsv
, lat/long coordinates via config/lat_longs.tsv
and page title, maintainer, filters present, etc... via config/auspice_config.json
.
The resulting tree and metadata JSON files are the inputs to the auspice visualization tool.
augur export \
--tree results/tree.nwk \
--metadata data/metadata.tsv \
--node-data results/branch_lengths.json \
results/traits.json \
results/nt_muts.json \
results/aa_muts.json \
--colors config/colors.tsv \
--lat-longs config/lat_longs.tsv \
--auspice-config config/auspice_config.json \
--output-tree auspice/zika_tree.json \
--output-meta auspice/zika_meta.json