Sequence data
Your sequence data should
- consist of homologous sequences that can be aligned unambiguously
- needs to contain sufficient diversity to allow reliable tree reconstruction
- should be of similar length. Mixing short sequences (300bp) with much longer ones (10000bp) often yields unexpected results.
Reference sequence
A reference sequence is needed to obtain consistent alignment coordinates and sequence numbering. In addition, a reference sequence provides the annotation. For long genomes, annotation will typically be provided as a GFF file, short annotated genomes are often stored in genbank format. Augur will not use all annotated features listed in a genbank file but only those marked in a specific way:
To ensure augur interprets an annotation as a protein feature, it should be of type CDS
and contain a tag gene
or locus_tag
, for example:
CDS 961..2472
/product="envelope protein"
/gene="ENV"
You might have to manually adjust the genbank file to make that work.
Metadata
Metadata would typically include collection dates, geographic location, symptoms of patients, host characteristics, etc. But for Nextstrain to be able to parse these and visualize them, they need to be formated consistently. An example meta data file is shown here:
strain accession date region
1_0087_PF KX447509 2013-12-XX oceania
1_0181_PF KX447512 2013-12-XX oceania
1_0199_PF KX447519 2013-11-XX oceania
BRA/2016 KY785433 2016-04-08 south_america
BRA/2015 KY558989 2015-02-23 south_america
There needs to be at least one column named strain
or name
.
These need to match the identifiers of your sequences (in the Fasta or VCF file) exactly and must not contain characters such as spaces, or ()[]{}|#><
.
Dates should be formated according as YYYY-MM-DD
.
You can specify unknown dates or month by replacing the respected values by XX
(ex: 2013-01-XX
or 2011-XX-XX
) and completely unknown dates can be shown with 20XX-XX-XX
(which does not restrict the sequence to being in the 21st century - they could be earlier).
Because Excel will automatically change the date formatting, we recommend not opening or preparing your meta data file in Excel. If the meta data is already in Excel, or you decide to prepare it in Excel, we recommend using another program to correct the dates afterwards (and don't open it in Excel again!).
Rooting of the tree
Augur (through treetime) will attempt to find a root for the inferred tree that is most compatible with the collection dates of your sequences. This doesn't always work, in particular if sampling through time is very uneven and there is little temporal signal in the data (e.g. your sample consists of several distant clades). In this case, you might need to provide an explicit outgroup to root the tree.
Auspice config file
The current version of auspice requires a file that specifies some aspects of how your analysis is visualized. The following shows an example json file with ADDED comments (following //). These NEED to be REMOVED for it to be a valid json and work!
{
// This will appear as the title of your analysis
"title": "Real-time tracking of Zika virus evolution",
// the following will appear in the menu from which you choose colorings
// to be understood by auspice, they need to follow as specific formating
// and the options `gt` and `num_date` should be left as is.
// you can add additional color options corresponding to columns in your metadata
"color_options": {
"gt": {
"menuItem": "genotype",
"legendTitle": "Genotype",
"type": "discrete",
"key": "genotype"
},
"num_date": {
"menuItem": "date",
"legendTitle": "Sampling date",
"type": "continuous",
"key": "num_date"
},
"authors": {
"key":"authors",
"legendTitle":"Authors",
"menuItem":"authors",
"type":"discrete"
},
"country": {
},
"region": {
}
},
// specifies how strains are aggregated on the map
"geo": [
"country",
"region"
],
// which panels to show. additional options are `entropy` and `frequency`
"panels":[
"tree",
"map"
],
"defaults": {
"mapTriplicate": true
},
// put your name and contact details...
"maintainer": [
"Trevor Bedford",
"http://bedford.io/team/trevor-bedford/"
],
// discrete fields by which you can filter the data. corresponds to
// columns in meta data
"filters": [
"country",
"region",
"authors"
]
}