The tool-chain augur is the bioinformatics engine of nextstrain and produces the files that can be visualized in the webbrowser using auspice. Augur consists of a number of tools that allow the user to filter and align sequences, build trees, and integrate the phylogenetic analysis with meta data. The different tools are meant to be composable and the output of one tool will serve as the input of other tools. We will work off the tutorial for Zika virus on the nextstrain web site and the github repository nextstrain/zika-tutorial.
Move to the zika-tutorial folder and ensure your nextstrain environment is activated.
Augur commands
As an example, we'll look at the filter
command in greater detail.
This command allows you to selected various subsets of your input data for different types of analysis.
A simple example use of this command would be
augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --min-date 2012 --output filtered.fasta
This command will select all sequences with collection date in 2012 or later. The filter command has a large number of options that allow flexible filtering for many common situations. One such use-case is the exclusion of sequences that are known to be outliers (e.g.because of sequencing errors, cell-culture adaptation, ...). These can be specified in a separate file:
BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...
To drop such strains, you can pass the name of this file to the augur filter command:
augur filter --sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude config/dropped_strains.txt \
--output filtered.fasta
(To improve legibility, we have wrapped the command across multiple lines.)
If you run this command (you should be able to copy-paste this into your terminal), you should see that one of the sequences in the data set was dropped since its name was in the drooped_strain.txt
file.
Another common filtering operation is subsetting of data to achieve a more even spatio-temporal distribution or reduce the data set size to more manageable numbers. The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country:
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude config/dropped_strains.txt \
--group-by country year month \
--sequences-per-group 1 \
--output filtered.fasta
This subsampling and filtering will reduce the number of sequences in this tutorial data set from 34 to 24.
Chaining of augur commands with snakemake.
The output of one augur command serves as input to the next and hence these commands need to be executed in order. This is a common pattern in bioinformatics and multiple tools -- so called workflow managers -- exist to facilitate this. Within nextstrain, we have relied on Snakemake.
Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile
.
Each rule takes a number of input files, specifies a few parameters, and produces output files.
We suggest to build up the Snakefile one rule at a time, starting with the stub that just defines a bunch of files.
input_fasta = "data/sequences.fasta"
input_metadata = "data/metadata.tsv"
dropped_strains = "config/dropped_strains.txt"
reference = "config/zika_outgroup.gb"
colors = "config/colors.tsv"
lat_longs = "config/lat_longs.tsv"
auspice_config = "config/auspice_config.json"
Copy this stub and name the file Snakefile_Qlife
.
A simple rule would look like this:
rule index_sequences:
message:
"""
Creating an index of sequence composition for filtering.
"""
input:
sequences = "data/sequences.fasta"
output:
sequence_index = "results/sequence_index.tsv"
shell:
"""
augur index \
--sequences {input.sequences} \
--output {output.sequence_index}
"""
Note that we explicitly specify what is an input and what is an output file.
Add this rule to the snakefile Snakefile_Qlife
, save the file, and execute
snakemake -j 1 -s Snakefile_Qlife index_sequences
This will run the rule and produce the desired output files. Check and inspect the output. Then repeat this process with all the rules below.
rule filter:
input:
sequences = "data/sequences.fasta",
metadata = "data/metadata.tsv"
output:
sequences = "results/filtered.fasta"
params:
min_date = 2012
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--output {output.sequences} \
--min-date {params.min_date}
"""
This rule would produce results/filtered.fasta
from the input files data/sequences.fasta
and data/metadata.tsv
using the augur filter
command.
Note that we explicitly specify what is an input and what is an output file.
To filter our data, we would now call snakemake as
snakemake -s Snakefile_Qlife results/filtered.fasta
and snakemake will run the same command as specified above. Note that you can either specify an explicit output file, or call the rule by name.
So far, this is just a complicated reformulation of what we did above, but the benefit of workflow management becomes clear once we add more steps.
The next natural step in our phylogenetic pipeline is aligning the filtered sequences and we define a rule align
.
rule align:
input:
sequences = rules.filter.output.sequences,
reference = "config/zika_outgroup.gb"
output:
alignment = "results/aligned.fasta"
shell:
"""
augur align \
--sequences {input.sequences} \
--reference-sequence {input.reference} \
--output {output.alignment}
"""
You can download a Snakefile containing these two rules here. If you now want to generate the alignment, you can type
snakemake -s Snakefile_Qlife results/aligned.fasta --cores 1
and snakemake will
- determine that
results/aligned.fasta
is an output of rulealign
- check whether all required input files are in place and run the necessary rules if not
- run rule
align
and check whether the fileresults/aligned.fasta
appeared.
If you supply a reference sequence to augur align
, augur will include that reference sequence in the alignment and strip all insertions relative to that reference.
This will guarantee a consistent reference coordinate system and genome annotation later on.
Construct the Phylogeny
Infer a phylogenetic tree from the multiple sequence alignment.
rule tree:
input:
alignment = rules.align.output.alignment
output:
tree = "results/tree_raw.nwk"
shell:
"""
augur tree \
--alignment {input.alignment} \
--output {output.tree}
"""
The resulting tree is stored in Newick format. Branch lengths in this tree measure nucleotide divergence.
You can run the command by typing
snakemake -s Snakefile_Qlife results/tree_raw.nwk --cores 1
Get a Time-Resolved Tree
Most phylogenies you see on nextstrain are time-resolved, that is the branch lengths of the tree correspond to calendar time rather than evolutionary distance.
The default method to infer time-resolved phylogenies in nextstrain is TreeTime and this analysis is an done using the augur command refine
(this steps "refines" the existing tree...).
The corresponding rule for in the snakefile would look as follows:
rule refine:
message: "Refining the tree"
input:
tree = rules.tree.output.tree,
alignment = rules.align.output,
metadata = input_metadata
output:
tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--metadata {input.metadata} \
--timetree \
--output-tree {output.tree} \
--output-node-data {output.node_data}
"""
This command requires the output of the tree and align rules as input and will produce the file results/tree.nwk
and results/branch_lengths.json
.
The latter contains the inferred clock model and the inferred dates of all nodes, including internal nodes.
Each internal node has been given a unique name (e.g. NODE_0000001211
) which we will use in the following steps to attach more information to the tree.
The refine
command has many different options that allow to specify clock rates, filter sequences that don't follow the molecular clock, calculate confidence intervals, etc.
We'll get into these details later.
You can run the command by typing
snakemake -s Snakefile_Qlife results/tree.nwk --cores 1
Annotate the Phylogeny
Reconstruct Ancestral Traits
TreeTime can also infer ancestral traits from an existing phylogenetic tree and metadata annotating each tip of the tree.
The following command infers the region and country of all internal nodes from the time tree and original strain metadata.
As with the refine
command, the resulting JSON output is indexed by strain or internal node name.
Specifying --confidence
means that the confidence intervals for the reconstructed trait values will be estimated.
rule traits:
message: "Inferring ancestral traits"
input:
tree = rules.refine.output.tree,
metadata = "data/metadata.tsv"
output:
node_data = "results/traits.json",
params:
columns = "region country"
shell:
"""
augur traits \
--tree {input.tree} \
--metadata {input.metadata} \
--output {output.node_data} \
--columns {params.columns} \
--confidence
"""
You can run the command by typing
snakemake -s Snakefile_Qlife results/traits.json --cores 1
Infer Ancestral Sequences
Using ancestral
, we can reconstruct what the ancesters of our samples' sequences must have looked like, and record the mutations that occurred on each branch.
Next, infer the ancestral sequence of each internal node and identify any nucleotide mutations on the branches leading to any node in the tree.
rule ancestral:
message: "Reconstructing ancestral sequences and mutations"
input:
tree = rules.refine.output.tree,
alignment = rules.align.output
output:
node_data = "results/nt_muts.json"
shell:
"""
augur ancestral \
--tree {input.tree} \
--alignment {input.alignment} \
--output-node-data {output.node_data}
"""
You can run the command by typing
snakemake -s Snakefile_Qlife results/nt_muts.json --cores 1
Identify Amino-Acid Mutations
Identify amino acid mutations from the nucleotide mutations and a reference sequence with gene coordinate annotations.
The resulting JSON file contains amino acid mutations indexed by strain or internal node name and by gene name.
To export a FASTA file with the complete amino acid translations for each gene from each node’s sequence, specify the --alignment-output
parameter in the form of results/aligned_aa_%GENE.fasta
.
The reference sequence needs to be in Genbank format (if working with Fasta input) or GFF format (if working with VCF input) - see the Prerequisites page for more information on how to find and download a reference sequence from Genbank.
Open the Zika tutorial reference sequence config/zika_outgroup.gb
in a text editor.
For translation, the important parts are the source
and CDS
sections that are part of the Features
part of the file:
[...]
FEATURES Location/Qualifiers
source 1..10769
/collection_date="25-Oct-2013"
/country="French Polynesia"
/db_xref="taxon:64320"
/host="Homo sapiens"
/isolation_source="serum"
/mol_type="genomic RNA"
/organism="Zika virus"
/strain="PF13/251013-18"
[...]
CDS 91..456
/product="capsid protein"
/gene="CA"
CDS 457..735
/product="propeptide"
/gene="PRO"
CDS 736..960
/product="membrane protein"
/gene="MP"
CDS 961..2472
/product="envelope protein"
/gene="ENV"
[...]
The source
section tells augur
how long the whole genome is.
The CDS
sections describe each gene (using /gene=
) that should be translated by giving its name and its start and end location in the genome.
In your downloaded file from Genbank, you might notice that it has only one or no CDS
sections, and that instead the sections you would like to translate as genes are described as mat_peptide
, like the below:
mat_peptide 91..465
/product="capsid protein"
mat_peptide 466..735
/product="propeptide"
mat_peptide 736..960
/product="membrane protein"
mat_peptide 961..2475
/product="envelope protein"
In this case, you will have to manually modify this file to work in augur
.
Anything you wish to translate as a gene should be changed from mat_peptide
to CDS
as its section name.
You will also need to add /gene=
and a gene name to the sections you'd like to translate.
If you have one really long CDS
section that covers most of the genome, we'd recommend not including this to translate (don't provide a /gene=
) if you can include smaller genes instead, as such a long polyprotein is not very informative.
If you are working with VCF-input, the GFF file contains the same information, but augur
will only translate those items coded as gene
.
rule translate:
input:
tree = rules.refine.output.tree,
node_data = rules.ancestral.output.node_data,
reference = "config/zika_outgroup.gb"
output:
node_data = "results/aa_muts.json"
shell:
"""
augur translate \
--tree {input.tree} \
--ancestral-sequences {input.node_data} \
--reference-sequence {input.reference} \
--output {output.node_data} \
"""
Export the Results
Finally, collect all node annotations and metadata and export it all in auspice’s JSON format.
This command pulls together the metadata file and all of the output files generated by the previous steps, and combines it into two output files that contain the annotated tree ("auspice/zika_tree.json"
) and associated metadata ("auspice/zika_meta.json"
).
There is one input file below that is not metadata or output from a previous step: the config file ("config/auspice_config.json"
).
This file is very important - it species the page title, maintainer(s), what filters are present, what places should be mapped by, and what the user is able to color the tree by ("color-bys").
Open up "config/auspice_config.json"
in a text editor.
The title
and maintiner
fields should be easy to spot.
You can modify them if you wish (and should, if you are using your own data!).
The filters
section specifies what traits will be available for users to filter the data by.
For example, if "country" is a filter, users will be able to only show data from the countries they select.
The color_options
section specifies what users will be allowed to color the data by. You will almost always have the first two sections: "gt"
and "num_date"
, unless you are using trees without sequences or trees that aren't time-resolved.
After that, you can include any trait in your data.
If you have additional traits in your metadata, you will need to add them here to have them show up in auspice! You can copy the entry for "country"
and re-name it to match the appropriate column from your metadata file.
The geo
section informs auspice what traits should be used to draw the samples onto the map. Obviously, they must be location-related, like "country" and "region".
You can use other locations if that's more relevant to your data. We won't cover this right now, but can explain in detail as you come to this with your own data.
Don't worry about the defaults
section for the moment.
The resulting tree and metadata JSON files are the inputs to the auspice visualization tool.
rule export:
input:
tree = rules.refine.output.tree,
metadata = input_metadata,
branch_lengths = rules.refine.output.node_data,
traits = rules.traits.output.node_data,
nt_muts = rules.ancestral.output.node_data,
aa_muts = rules.translate.output.node_data,
auspice_config = "config/auspice_config.json"
output:
auspice_json = "auspice/zika.json",
shell:
"""
augur export v2 \
--tree {input.tree} \
--metadata {input.metadata} \
--node-data {input.branch_lengths} {input.traits} {input.nt_muts} {input.aa_muts} \
--auspice-config {input.auspice_config} \
--output {output.auspice_json}
"""
We have now step-by-step reproduced the original Snakefile
, which in addition contains a rule clean
, that simply removes all outputs, and all
, which is a default rule at the beginning of the Snakefile.
The whole purpose of Snakemake is to run all these rules in the right order and manage all the steps. To get a glimpse of this, run
snakemake --forceall --dag | dot -Tpdf > dag.pdf
and open the dag.pdf
depiction of the execution graph.
Here, we have omitted the -s Snakefile
argument, which defaults to Snakefile
when not specified.
Then type
snakemake -j1 clean
snakemake -j1
and snakemake will first remove the output of the previous runs and then the default rule all
to execute the entire workflow.