Important tips for your own analysis

Sequence data

Your sequence data should

consist of homologous sequences that can be aligned unambiguously
needs to contain sufficient diversity to allow reliable tree reconstruction
should be of similar length. Mixing short sequences (300bp) with much longer ones (10000bp) often yields unexpected results.

Reference sequence

A reference sequence is needed to obtain consistent alignment coordinates and sequence numbering. In addition, a reference sequence provides the annotation. For long genomes, annotation will typically be provided as a GFF file, short annotated genomes are often stored in genbank format. Augur will not use all annotated features listed in a genbank file but only those marked in a specific way:

To ensure augur interprets an annotation as a protein feature, it should be of type CDS and contain a tag gene or locus_tag, for example:

     CDS             961..2472
                     /product="envelope protein"
                     /gene="ENV"

You might have to manually adjust the genbank file to make that work.

Metadata

Metadata would typically include collection dates, geographic location, symptoms of patients, host characteristics, etc. But for Nextstrain to be able to parse these and visualize them, they need to be formated consistently. An example meta data file is shown here:

strain      accession   date        region
1_0087_PF   KX447509    2013-12-XX  oceania
1_0181_PF   KX447512    2013-12-XX  oceania
1_0199_PF   KX447519    2013-11-XX  oceania
BRA/2016    KY785433    2016-04-08  south_america
BRA/2015    KY558989    2015-02-23  south_america

There needs to be at least one column named strain or name. These need to match the identifiers of your sequences (in the Fasta or VCF file) exactly and must not contain characters such as spaces, or ()[]{}|#><. Dates should be formated according as YYYY-MM-DD. You can specify unknown dates or month by replacing the respected values by XX (ex: 2013-01-XX or 2011-XX-XX) and completely unknown dates can be shown with 20XX-XX-XX (which does not restrict the sequence to being in the 21st century - they could be earlier).

Because Excel will automatically change the date formatting, we recommend not opening or preparing your meta data file in Excel. If the meta data is already in Excel, or you decide to prepare it in Excel, we recommend using another program to correct the dates afterwards (and don't open it in Excel again!).

Rooting of the tree

Augur (through treetime) will attempt to find a root for the inferred tree that is most compatible with the collection dates of your sequences. This doesn't always work, in particular if sampling through time is very uneven and there is little temporal signal in the data (e.g. your sample consists of several distant clades). In this case, you might need to provide an explicit outgroup to root the tree.

Auspice config file

The current version of auspice requires a file that specifies some aspects of how your analysis is visualized. The following shows an example json file with ADDED comments (following //). These NEED to be REMOVED for it to be a valid json and work!

{
  // This will appear as the title of your analysis
  "title": "Real-time tracking of Zika virus evolution",
  // the following will appear in the menu from which you choose colorings
  // to be understood by auspice, they need to follow as specific formating
  // and the options `gt` and `num_date` should be left as is.
  // you can add additional color options corresponding to columns in your metadata
  "color_options": {
    "gt": {
      "menuItem": "genotype",
      "legendTitle": "Genotype",
      "type": "discrete",
      "key": "genotype"
    },
    "num_date": {
      "menuItem": "date",
      "legendTitle": "Sampling date",
      "type": "continuous",
      "key": "num_date"
    },
    "authors": {
      "key":"authors",
      "legendTitle":"Authors",
      "menuItem":"authors",
      "type":"discrete"
    },
    "country": {
    },
    "region": {
    }
  },
  // specifies how strains are aggregated on the map
  "geo": [
    "country",
    "region"
  ],
  // which panels to show. additional options are `entropy` and `frequency`
  "panels":[
     "tree",
     "map"
  ],
  "defaults": {
    "mapTriplicate": true
  },
  // put your name and contact details...
  "maintainer": [
    "Trevor Bedford",
    "http://bedford.io/team/trevor-bedford/"
  ],
  // discrete fields by which you can filter the data. corresponds to
  // columns in meta data
  "filters": [
    "country",
    "region",
    "authors"
  ]
}

Important tips for your own analysis

Sequence data

Reference sequence

Metadata

Rooting of the tree

Auspice config file

Published

Category

Tags