We just pushed a new TreeTime release. In addition to all sorts of little fixes, we added two new command-line scripts that use TreeTime's discrete ancestral state reconstruction to gather statistics about homoplasies (recurrent parallel mutations) and to infer ancestral states for characters like geographic location, host, or similar. We hope they are useful, let us know if you have comments/questions.
In additition, the command-line scripts are now installed via the setup.py
script and are available system wide.
homoplasy_scanner.py
Homoplasies are common if the same mutations are selected independently in many lineages, for example resistance mutations under treatment or cell-culture adaptation. Homoplasies are also observed if the sequences used to build a tree underwent recombination or if there was contamination during sample prep and sequencing.
After joint reconstruction of ancestral states, it is straight forward to gather statistics on how often a particular site mutates and how often it hits a particular state.
The homoplasy_scanner.py
does exactly that.
You would call is as
:\>homoplasy_scanner.py --aln flu_alignment.fasta --tree flu_tree.nwk
(it will attempt to build a tree if none is provided, but this will require a working installation of fasttree
, iqtree
or raxml
).
When run on a sample of ~500 recent H3N2 sequences, we get the following output (input files: flu_alignment.fasta.gz
(please unpack) and flu_tree.nwk
).
How many mutations are observed once, twice, etc...
The TOTAL tree length is 1.438e+00, expecting 2465.1 mutations vs an observed 2396
Of these 2396 mutations,
- 633 occur 1 times
- 196 occur 2 times
- 86 occur 3 times
:
How many positions were hit once, twice, etc...
Of the 1736 positions in the genome,
- 942 were hit 0 times (expected 436.66)
- 329 were hit 1 times (expected 602.67)
- 188 were hit 2 times (expected 415.90)
- 80 were hit 3 times (expected 191.34)
- 49 were hit 4 times (expected 66.02)
- 34 were hit 5 times (expected 18.22)
- 21 were hit 6 times (expected 4.19)
- 24 were hit 7 times (expected 0.83)
:
- 1 were hit 22 times (expected 0.00)
- 2 were hit 23 times (expected 0.00)
- 1 were hit 25 times (expected 0.00)
This tells us that mutations are not randomly distributed across the genome (which is obvious) and that some sites are hit two dozen times. One reason for clustering of mutations is purifying selection, which result in higher diversity at less conserved regions, e.g. 3rd codon positions. The sites that are hit more than a handful of times, however, likely mutated due to egg/culture adaptation.
Multiplicity of specific mutations
The ten most homoplasic mutations are:
mut multiplicity
G1671A 20
G1557A 19
A55C 17
C527A 15
:
Additional arguments
The flag --detailed
will trigger the output of additional stats on mutations that happen on terminal branches only.
In addition the script will output the names of strains that contain many mutations on their terminal branch that also occur elsewhere on the tree.
If the aligment used only contains variable sites (as is often done for longer bacterial genomes), the user can specify the number of constant sites via --const #of_const_sites
.
The script will then use the sum of this number and the length of the alignment as total sequence length.
mugration.py
Several people have asked me to provide an interface to perform the type of discrete ancstral character inference that nextstrain does for geographic regions or hosts. This inference uses the machinery to reconstruct ancestral sequences and applies to discrete characters such as country. Due to its analogy to mutation, this type of inference of migration dynamics between geographic regions is often called "mugration". TreeTime can be repurposed to do mugration inference by coding discrete states as single letters in "sequences" of length one.
The mugration.py
script is called as
>mugration.py --tree Zika_tree.newick --states Zika_metadata.csv --attribute country
which will use the tree Zika_tree.newick, the table Zika_metadata.csv with metadata, and reconstruct the ancestral state of the attribute country
.
The script saves a tree in nexus format with the inferred state added as a comment and the inferred GTR model is written to file.
When run with the flag --confidence
, an additional table containing full posterior distributions for each node is saved.
If equilibrium probabilities across states are known, they can be passed to the script as follows
\:> mugration.py --tree Zika_tree.newick --states Zika_metadata.csv --attribute country --confidence --weights Zika_country_weights.csv
The table Zika_country_weights.csv contains the state (country) and the associated weight in csv format. These will be normalized and used as equilibrium probabilities of the GTR model.
Command line reference
usage: homoplasy_scanner.py [-h] --aln ALN [--tree TREE] [--const CONST]
[--rescale RESCALE] [--detailed] [--gtr GTR]
[--gtr_params GTR_PARAMS [GTR_PARAMS ...]]
[--prot] [--zero_based] [-n N] [--verbose VERBOSE]
Reconstructs ancestral sequences and maps mutations to the tree. The tree is
then scanned for homoplasies. An excess number of homoplasies might suggest
contamination, recombination, culture adaptation or similar.
optional arguments:
-h, --help show this help message and exit
--aln ALN fasta file with input nucleotide sequences
--tree TREE newick file with tree (optional if tree builders
installed)
--const CONST number of constant sites not included in alignment
--rescale RESCALE rescale branch lengths
--detailed generate a more detailed report
--gtr GTR GTR model to use. Type 'infer' to infer the model from
the data. Or, specify the model type. If the specified
model requires additional options, use '--gtr_args' to
specify those
--gtr_params GTR_PARAMS [GTR_PARAMS ...]
GTR parameters for the model specified by the --gtr
argument. The parameters should be feed as 'key=value'
list of parameters. Example: '--gtr K80 --gtr_params
kappa=0.2 pis=0.25,0.25,0.25,0.25'. See the exact
definitions of the parameters in the GTR creation
methods in treetime/nuc_models.py. Only nucleotide
models supported at present
--zero_based zero based SNP indexing
-n N number of mutations/nodes that are printed to screen
--verbose VERBOSE verbosity of output 0-6
\:> mugration.py -h
usage: mugration.py [-h] --tree TREE [--attribute ATTRIBUTE] --states STATES
[--weights WEIGHTS] [--confidence] [--pc PC]
[--verbose VERBOSE]
Reconstructs discrete ancestral states, for example geographic location, host,
or similar.
optional arguments:
-h, --help show this help message and exit
--tree TREE newick file with tree
--attribute ATTRIBUTE
attribute to reconstruct, e.g. country
--states STATES csv or tsv file with discrete characters.
#name,country,continent
taxon1,micronesia,oceania
...
--weights WEIGHTS csv or tsv file with probabilities of that a randomly
sampled sequence at equilibrium has a particular state. E.g.
population of different continents or countries. E.g.:
#country,weight
micronesia,0.1
...
--confidence output confidence of mugration inference
--pc PC pseudo-counts higher numbers will results in 'flatter'
models
--verbose VERBOSE verbosity of output 0-6