Mugration models and homoplasy analysis in TreeTime

We just pushed a new TreeTime release. In addition to all sorts of little fixes, we added two new command-line scripts that use TreeTime's discrete ancestral state reconstruction to gather statistics about homoplasies (recurrent parallel mutations) and to infer ancestral states for characters like geographic location, host, or similar. We hope they are useful, let us know if you have comments/questions.

In additition, the command-line scripts are now installed via the setup.py script and are available system wide.

homoplasy_scanner.py

Homoplasies are common if the same mutations are selected independently in many lineages, for example resistance mutations under treatment or cell-culture adaptation. Homoplasies are also observed if the sequences used to build a tree underwent recombination or if there was contamination during sample prep and sequencing.

After joint reconstruction of ancestral states, it is straight forward to gather statistics on how often a particular site mutates and how often it hits a particular state. The homoplasy_scanner.py does exactly that. You would call is as

:\>homoplasy_scanner.py --aln flu_alignment.fasta --tree flu_tree.nwk

(it will attempt to build a tree if none is provided, but this will require a working installation of fasttree, iqtree or raxml). When run on a sample of ~500 recent H3N2 sequences, we get the following output (input files: flu_alignment.fasta.gz(please unpack) and flu_tree.nwk).

How many mutations are observed once, twice, etc...

The TOTAL tree length is 1.438e+00, expecting 2465.1 mutations vs an observed 2396
Of these 2396 mutations,
   - 633 occur 1 times
   - 196 occur 2 times
   - 86 occur 3 times
     :

How many positions were hit once, twice, etc...

Of the 1736 positions in the genome,
   - 942 were hit 0 times (expected 436.66)
   - 329 were hit 1 times (expected 602.67)
   - 188 were hit 2 times (expected 415.90)
   - 80 were hit 3 times (expected 191.34)
   - 49 were hit 4 times (expected 66.02)
   - 34 were hit 5 times (expected 18.22)
   - 21 were hit 6 times (expected 4.19)
   - 24 were hit 7 times (expected 0.83)
     :
   - 1 were hit 22 times (expected 0.00)
   - 2 were hit 23 times (expected 0.00)
   - 1 were hit 25 times (expected 0.00)

This tells us that mutations are not randomly distributed across the genome (which is obvious) and that some sites are hit two dozen times. One reason for clustering of mutations is purifying selection, which result in higher diversity at less conserved regions, e.g. 3rd codon positions. The sites that are hit more than a handful of times, however, likely mutated due to egg/culture adaptation.

Multiplicity of specific mutations

The ten most homoplasic mutations are:
  mut multiplicity
  G1671A  20
  G1557A  19
  A55C  17
  C527A 15
     :

Additional arguments

The flag --detailed will trigger the output of additional stats on mutations that happen on terminal branches only. In addition the script will output the names of strains that contain many mutations on their terminal branch that also occur elsewhere on the tree.

If the aligment used only contains variable sites (as is often done for longer bacterial genomes), the user can specify the number of constant sites via --const #of_const_sites. The script will then use the sum of this number and the length of the alignment as total sequence length.

mugration.py

Several people have asked me to provide an interface to perform the type of discrete ancstral character inference that nextstrain does for geographic regions or hosts. This inference uses the machinery to reconstruct ancestral sequences and applies to discrete characters such as country. Due to its analogy to mutation, this type of inference of migration dynamics between geographic regions is often called "mugration". TreeTime can be repurposed to do mugration inference by coding discrete states as single letters in "sequences" of length one.

The mugration.py script is called as

>mugration.py --tree Zika_tree.newick --states Zika_metadata.csv --attribute country

which will use the tree Zika_tree.newick, the table Zika_metadata.csv with metadata, and reconstruct the ancestral state of the attribute country.

The script saves a tree in nexus format with the inferred state added as a comment and the inferred GTR model is written to file. When run with the flag --confidence, an additional table containing full posterior distributions for each node is saved.

If equilibrium probabilities across states are known, they can be passed to the script as follows

\:> mugration.py --tree Zika_tree.newick --states Zika_metadata.csv --attribute country --confidence --weights Zika_country_weights.csv

The table Zika_country_weights.csv contains the state (country) and the associated weight in csv format. These will be normalized and used as equilibrium probabilities of the GTR model.

Command line reference

usage: homoplasy_scanner.py [-h] --aln ALN [--tree TREE] [--const CONST]
                            [--rescale RESCALE] [--detailed] [--gtr GTR]
                            [--gtr_params GTR_PARAMS [GTR_PARAMS ...]]
                            [--prot] [--zero_based] [-n N] [--verbose VERBOSE]

Reconstructs ancestral sequences and maps mutations to the tree. The tree is
then scanned for homoplasies. An excess number of homoplasies might suggest
contamination, recombination, culture adaptation or similar.

optional arguments:
  -h, --help            show this help message and exit
  --aln ALN             fasta file with input nucleotide sequences
  --tree TREE           newick file with tree (optional if tree builders
                        installed)
  --const CONST         number of constant sites not included in alignment
  --rescale RESCALE     rescale branch lengths
  --detailed            generate a more detailed report
  --gtr GTR             GTR model to use. Type 'infer' to infer the model from
                        the data. Or, specify the model type. If the specified
                        model requires additional options, use '--gtr_args' to
                        specify those
  --gtr_params GTR_PARAMS [GTR_PARAMS ...]
                        GTR parameters for the model specified by the --gtr
                        argument. The parameters should be feed as 'key=value'
                        list of parameters. Example: '--gtr K80 --gtr_params
                        kappa=0.2 pis=0.25,0.25,0.25,0.25'. See the exact
                        definitions of the parameters in the GTR creation
                        methods in treetime/nuc_models.py. Only nucleotide
                        models supported at present
  --zero_based          zero based SNP indexing
  -n N                  number of mutations/nodes that are printed to screen
  --verbose VERBOSE     verbosity of output 0-6

\:> mugration.py -h
usage: mugration.py [-h] --tree TREE [--attribute ATTRIBUTE] --states STATES
                    [--weights WEIGHTS] [--confidence] [--pc PC]
                    [--verbose VERBOSE]

Reconstructs discrete ancestral states, for example geographic location, host,
or similar.

optional arguments:
  -h, --help            show this help message and exit
  --tree TREE           newick file with tree
  --attribute ATTRIBUTE
                        attribute to reconstruct, e.g. country
  --states STATES       csv or tsv file with discrete characters.
                          #name,country,continent
                          taxon1,micronesia,oceania
                          ...
  --weights WEIGHTS     csv or tsv file with probabilities of that a randomly
                        sampled sequence at equilibrium has a particular state. E.g.
                        population of different continents or countries. E.g.:
                          #country,weight
                          micronesia,0.1
                          ...
  --confidence          output confidence of mugration inference
  --pc PC               pseudo-counts higher numbers will results in 'flatter'
                        models
  --verbose VERBOSE     verbosity of output 0-6