NextStrain architecture
Using treetime to rapidly compute timetrees
TreeTime: maximum likelihood phylodynamic analysis
Phylogenetic trees record history:
- transmission
- divergence times
- population dynamics
- ancestral geographic distribution/migrations
Typical approach: Bayesian parameter estimation
- flexible
- probabilistic → confidence intervals etc
- but: computationally expensive
TreeTime by Pavel Sagulenko
- probabilistic treatment of divergence times
- dates trees with thousand sequences in a few minutes
- linear time complexity
- fixed tree topology
- github.com/neherlab/treetime
West African Ebola virus outbreak
Molecular clock phylogenies of ~2000 A/H3N2 HA sequences -- a few minutes
What about bacteria?
- vertical and horizontal transmission
- genome rearrangements
- much larger genomes
- variation of divergence along the genome
- NGS genomes tend to be fragmented
- annotations of variable quality
- pan-genome identification pipeline
- phylogenetic analysis of each orthologous cluster
- detect associations with phenotypes
- fast: analyze hundreds of genomes in a few hours
- github.com/neherlab/pan-genome-analysis
S. pneumoniae data set by Croucher et al.
Pan-genome statistics and filters
Species trees and gene trees
Links between species trees and gene trees
- Python 2.7, raxml/iqtree/fasttree, MAFFT, TreeTime
- run time of approx one hour, 1-4CPUs, limited by tree building
- Sequences: for viruses we use fasta, for bacteria vcf
- Metadata: best as tsv or json
- Ideally we would want some hybrid between nextstrain and panX for bacteria
- Python 2.7, raxml/fasttree, DIAMOND, MCL, MAFFT, TreeTime
- run time of approx six hours, 64CPUs, limited by all-against-all comparison and tree building for every gene
- Sequences: genbank files with annotation, gff could be used as well
- Metadata: best as tsv
Visualization: nextstrain
nextstrain.org
- Trevor Bedford
- Colin Megill
- Sidney Bell
- James Hadfield
- All the scientist that share virus sequence data