We just pushed a new TreeTime release. In addition to all sorts of little fixes, we added two new command-line scripts that use TreeTime's discrete ancestral state reconstruction to gather statistics about homoplasies (recurrent parallel mutations) and to infer ancestral states for characters like geographic location, host, or similar. We hope they are useful, let us know if you have comments/questions.
In additition, the command-line scripts are now installed via the
setup.py script and are available system wide.
Homoplasies are common if the same mutations are selected independently in many lineages, for example resistance mutations under treatment or cell-culture adaptation. Homoplasies are also observed if the sequences used to build a tree underwent recombination or if there was contamination during sample prep and sequencing.
After joint reconstruction of ancestral states, it is straight forward to gather statistics on how often a particular site mutates and how often it hits a particular state.
homoplasy_scanner.py does exactly that.
You would call is as
:\>homoplasy_scanner.py --aln flu_alignment.fasta --tree flu_tree.nwk
(it will attempt to build a tree if none is provided, but this will require a working installation of
When run on a sample of ~500 recent H3N2 sequences, we get the following output (input files:
flu_alignment.fasta.gz(please unpack) and
How many mutations are observed once, twice, etc...
The TOTAL tree length is 1.438e+00, expecting 2465.1 mutations vs an observed 2396 Of these 2396 mutations, - 633 occur 1 times - 196 occur 2 times - 86 occur 3 times :
How many positions were hit once, twice, etc...
Of the 1736 positions in the genome, - 942 were hit 0 times (expected 436.66) - 329 were hit 1 times (expected 602.67) - 188 were hit 2 times (expected 415.90) - 80 were hit 3 times (expected 191.34) - 49 were hit 4 times (expected 66.02) - 34 were hit 5 times (expected 18.22) - 21 were hit 6 times (expected 4.19) - 24 were hit 7 times (expected 0.83) : - 1 were hit 22 times (expected 0.00) - 2 were hit 23 times (expected 0.00) - 1 were hit 25 times (expected 0.00)
This tells us that mutations are not randomly distributed across the genome (which is obvious) and that some sites are hit two dozen times. One reason for clustering of mutations is purifying selection, which result in higher diversity at less conserved regions, e.g. 3rd codon positions. The sites that are hit more than a handful of times, however, likely mutated due to egg/culture adaptation.
Multiplicity of specific mutations
The ten most homoplasic mutations are: mut multiplicity G1671A 20 G1557A 19 A55C 17 C527A 15 :
--detailed will trigger the output of additional stats on mutations that happen on terminal branches only.
In addition the script will output the names of strains that contain many mutations on their terminal branch that also occur elsewhere on the tree.
If the aligment used only contains variable sites (as is often done for longer bacterial genomes), the user can specify the number of constant sites via
The script will then use the sum of this number and the length of the alignment as total sequence length.
Several people have asked me to provide an interface to perform the type of discrete ancstral character inference that nextstrain does for geographic regions or hosts. This inference uses the machinery to reconstruct ancestral sequences and applies to discrete characters such as country. Due to its analogy to mutation, this type of inference of migration dynamics between geographic regions is often called "mugration". TreeTime can be repurposed to do mugration inference by coding discrete states as single letters in "sequences" of length one.
mugration.py script is called as
>mugration.py --tree Zika_tree.newick --states Zika_metadata.csv --attribute country
The script saves a tree in nexus format with the inferred state added as a comment and the inferred GTR model is written to file.
When run with the flag
--confidence, an additional table containing full posterior distributions for each node is saved.
If equilibrium probabilities across states are known, they can be passed to the script as follows
\:> mugration.py --tree Zika_tree.newick --states Zika_metadata.csv --attribute country --confidence --weights Zika_country_weights.csv
The table Zika_country_weights.csv contains the state (country) and the associated weight in csv format. These will be normalized and used as equilibrium probabilities of the GTR model.
Command line reference
usage: homoplasy_scanner.py [-h] --aln ALN [--tree TREE] [--const CONST] [--rescale RESCALE] [--detailed] [--gtr GTR] [--gtr_params GTR_PARAMS [GTR_PARAMS ...]] [--prot] [--zero_based] [-n N] [--verbose VERBOSE] Reconstructs ancestral sequences and maps mutations to the tree. The tree is then scanned for homoplasies. An excess number of homoplasies might suggest contamination, recombination, culture adaptation or similar. optional arguments: -h, --help show this help message and exit --aln ALN fasta file with input nucleotide sequences --tree TREE newick file with tree (optional if tree builders installed) --const CONST number of constant sites not included in alignment --rescale RESCALE rescale branch lengths --detailed generate a more detailed report --gtr GTR GTR model to use. Type 'infer' to infer the model from the data. Or, specify the model type. If the specified model requires additional options, use '--gtr_args' to specify those --gtr_params GTR_PARAMS [GTR_PARAMS ...] GTR parameters for the model specified by the --gtr argument. The parameters should be feed as 'key=value' list of parameters. Example: '--gtr K80 --gtr_params kappa=0.2 pis=0.25,0.25,0.25,0.25'. See the exact definitions of the parameters in the GTR creation methods in treetime/nuc_models.py. Only nucleotide models supported at present --zero_based zero based SNP indexing -n N number of mutations/nodes that are printed to screen --verbose VERBOSE verbosity of output 0-6
\:> mugration.py -h usage: mugration.py [-h] --tree TREE [--attribute ATTRIBUTE] --states STATES [--weights WEIGHTS] [--confidence] [--pc PC] [--verbose VERBOSE] Reconstructs discrete ancestral states, for example geographic location, host, or similar. optional arguments: -h, --help show this help message and exit --tree TREE newick file with tree --attribute ATTRIBUTE attribute to reconstruct, e.g. country --states STATES csv or tsv file with discrete characters. #name,country,continent taxon1,micronesia,oceania ... --weights WEIGHTS csv or tsv file with probabilities of that a randomly sampled sequence at equilibrium has a particular state. E.g. population of different continents or countries. E.g.: #country,weight micronesia,0.1 ... --confidence output confidence of mugration inference --pc PC pseudo-counts higher numbers will results in 'flatter' models --verbose VERBOSE verbosity of output 0-6