The tool-chain augur is the bioinformatics engine of nextstrain and produces the files that can be visualized in the webbrowser using auspice. Augur consists of a number of tools that allow the user to filter and align sequences, build trees, and integrate the phylogenetic analysis with meta data. The different tools are meant to be composable and the output of one tool will serve as the input of other tools. To compose such tools into an analysis pipeline, augur advocates the workflow management tool Snakemake.

Snakemake

Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile. Each rule takes a number of input files, specifies a few parameters, and produces output files. A simple rule would look like this:

rule align:
    input:
        seq="data/sequences.fasta"
    params:
        nthreads = 2
    output:
        aln="results/alignment.fasta"
    shell:
        '''
        augur align --sequences {input.seq} --nthreads {params.nthreads} --ouput {output.aln}
        '''

This rule would produce results/alignment.fasta from the input file data/sequences.fasta using the augur align command and additionally specifies that augur is supposed to use 2 CPUs for this task.

When executing a rule, Snakemake will check whether all necessary input files are present, if not, it will determine which rules produce the necessary files and execute those first. Say there was an additional rule to build a tree that dependended on the alignment:

rule tree:
    input:
        aln="results/alignment.fasta"
    output:
        tree="results/tree.nwk"
    shell:
        '''
        augur tree --alignment {input.aln} --ouput {output.tree}
        '''

To build the tree, you would now call Snakemake as

>snakemake results/tree.nwk

and Snakemake would

find the rule that produces results/tree.nwk (rule tree in this case)
determine that it needs to run rule align to produce results/alignment.fasta
run rules align and tree

These simple examples only scratch the surface of what Snakemake can do but should give you the general idea.

Augur commands and Snakefiles

We will work off the tutorial for Zika virus on the nextstrain web site and the github repository nextstrain/zika-tutorial. Clone the tutorial using

git clone https://github.com/nextstrain/zika-tutorial.git

and open the Snakefile in your text editor.

Filter the Sequences

Filter the parsed sequences and metadata to exclude strains from subsequent analysis and subsample the remaining strains to a fixed number of samples per group.

augur filter \
  --sequences data/sequences.fasta \
  --metadata data/metadata.tsv \
  --exclude config/dropped_strains.txt \
  --output results/filtered.fasta \
  --group-by country year month \
  --sequences-per-group 20 \
  --min-date 2012

Align the Sequences

Create a multiple alignment of the sequences using a custom reference. After this alignment, columns with gaps in the reference are removed. Additionally, the --fill-gaps flag fills gaps in non-reference sequences with "N" characters. These modifications force all sequences into the same coordinate space as the reference sequence.

augur align \
  --sequences results/filtered.fasta \
  --reference-sequence config/zika_outgroup.gb \
  --output results/aligned.fasta \
  --fill-gaps

Now the pathogen sequences are ready for analysis.

Construct the Phylogeny

Infer a phylogenetic tree from the multiple sequence alignment.

augur tree \
  --alignment results/aligned.fasta \
  --output results/tree_raw.nwk

The resulting tree is stored in Newick format. Branch lengths in this tree measure nucleotide divergence.

Get a Time-Resolved Tree

Augur can also adjust branch lengths in this tree to position tips by their sample date and infer the most likely time of their ancestors, using TreeTime. Run the refine command to apply TreeTime to the original phylogenetic tree and produce a "time tree".

augur refine \
  --tree results/tree_raw.nwk \
  --alignment results/aligned.fasta \
  --metadata data/metadata.tsv \
  --output-tree results/tree.nwk \
  --output-node-data results/branch_lengths.json \
  --timetree \
  --coalescent opt \
  --date-confidence \
  --date-inference marginal \
  --clock-filter-iqd 4

In addition to assigning times to internal nodes, the refine command filters tips that are likely outliers and assigns confidence intervals to inferred dates. Branch lengths in the resulting Newick tree measure adjusted nucleotide divergence. All other data inferred by TreeTime is stored by strain or internal node name in the corresponding JSON file.

Annotate the Phylogeny

Reconstruct Ancestral Traits

TreeTime can also infer ancestral traits from an existing phylogenetic tree and metadata annotating each tip of the tree. The following command infers the region and country of all internal nodes from the time tree and original strain metadata. As with the refine command, the resulting JSON output is indexed by strain or internal node name.

augur traits \
  --tree results/tree.nwk \
  --metadata data/metadata.tsv \
  --output results/traits.json \
  --columns region country \
  --confidence

Infer Ancestral Sequences

Next, infer the ancestral sequence of each internal node and identify any nucleotide mutations on the branches leading to any node in the tree.

augur ancestral \
  --tree results/tree.nwk \
  --alignment results/aligned.fasta \
  --output results/nt_muts.json \
  --inference joint

Identify Amino-Acid Mutations

Identify amino acid mutations from the nucleotide mutations and a reference sequence with gene coordinate annotations. The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name. To export a FASTA file with the complete amino acid translations for each gene from each node’s sequence, specify the --alignment-output parameter in the form of results/aligned_aa_%GENE.fasta.

augur translate \
  --tree results/tree.nwk \
  --ancestral-sequences results/nt_muts.json \
  --reference-sequence config/zika_outgroup.gb \
  --output results/aa_muts.json

Export the Results

Finally, collect all node annotations and metadata and export it all in auspice’s JSON format. This refers to three config files to define colors via config/colors.tsv, lat/long coordinates via config/lat_longs.tsv and page title, maintainer, filters present, etc... via config/auspice_config.json. The resulting tree and metadata JSON files are the inputs to the auspice visualization tool.

augur export \
  --tree results/tree.nwk \
  --metadata data/metadata.tsv \
  --node-data results/branch_lengths.json \
              results/traits.json \
              results/nt_muts.json \
              results/aa_muts.json \
  --colors config/colors.tsv \
  --lat-longs config/lat_longs.tsv \
  --auspice-config config/auspice_config.json \
  --output-tree auspice/zika_tree.json \
  --output-meta auspice/zika_meta.json

Augur and Snakemake