Richard Neher

Biozentrum, University of Basel

slides at neherlab.org/201803_XLAB.html

tobacco mosaic virus
(Thomas Splettstoesser, wikipedia)

bacteria phage (adenosine, wikipedia)

influenza virus wikipedia

human immunodeficiency virus wikipedia

- rely on host to replicate
- little more than genome + capsid
- genomes typically 5-200k bases (+exceptions)
- many infectious diseases are caused by viruses
- very important function in microbial eco-systems
- most abundant organisms on earth $\sim 10^{31}$

- thousands of bacterial isolates
- thousands of single cells
- populations of viruses, bacteria or flies
- diverse ecosystems

- Chimp → human transmission around 1900 gave rise to HIV-1 group M
- ~100 million infected people since
- subtypes differ at 10-20% of their genome
- HIV-1 evolves ~0.1% per year

- $10^8$ cells are infected every day
- the virus repeatedly escapes immune recognition
- integrates into T-cells as
latent provirus

- Viruses carrying a beneficial mutation have more offspring: on average $1+s$ instead of $1$
- $s$ is called selection coefficient
- Fraction $x$ of viruses carrying the mutation changes as $$x(t+1) = \frac{(1+s)x(t)}{(1+s)x(t) + (1-x(t))}$$
- In continuous time → logistic differential equation: $$\frac{dx}{dt} = sx(1-x) \Rightarrow x(t) = \frac{e^{s(t-t_0)}}{1+ e^{s(t-t_0)}}$$

- diverge at 0.1-1% per year
- almost whole genome coverage in 10 patients
- full data set at hiv.tuebingen.mpg.de

- envelope changes fastest, enzymes lowest
- identical rate of synonymous evolution
- diversity saturates where evolution is fast
- synonymous mutations stay at low frequency

- mutation away from preferred state with rate $\mu$
- selection against non-preferred state with strength $s$
- variant frequency dynamics: $\frac{d x}{dt} = \mu -s x $
- equilibrium frequency: $\bar{x} = \mu/s $
- fitness cost: $s = \mu/\bar{x}$

- Influenza viruses evolve to avoid human immunity
- Vaccines need frequent updates

- can we predict fitness?
- pick the closest relative of the future?

$$P(\mathbf{x}|T) = \frac{1}{Z(T)} p_0(x_0) \prod_{i=0}^{n_{int}} g(x_{i_1}, t_{i_1}| x_i, t_i)g(x_{i_2}, t_{i_2}| x_i, t_i)$$

RN, Russell, Shraiman, eLife, 2014
- simulate evolution
- sample sequences
- reconstruct trees
- infer fitness
- predict ancestor of future
- compare to truth

- no influenza specific input
- how can the model be improved? (see model by Luksza & Laessig)
- what other context might this apply?

- There are super-exponentially many trees
- for $n$ taxa, there are $N = (2n-5)!! = (2n-5)*(2n-7)*\cdots*3*1$ trees
- There are efficient heuristics to reconstruct trees, e.g. Neighbor Joining

- transmission
- divergence times
- population dynamics
- ancestral geographic distribution/migrations

- flexible
- probabilistic → confidence intervals etc
- but: computationally expensive

- probabilistic treatment of divergence times
- dates trees with thousand sequences in a few minutes
- linear time complexity
- fixed tree topology
- github.com/neherlab/treetime

- vertical and horizontal transmission
- genome rearrangements
- much larger genomes
- variation of divergence along the genome
- NGS genomes tend to be fragmented
- annotations of variable quality

- pan-genome identification pipeline
- phylogenetic analysis of each orthologous cluster
- detect associations with phenotypes
- fast: analyze hundreds of genomes in a few hours
- github.com/neherlab/pan-genome-analysis

- Data set are growing rapidly

→ tools for interpretation and exploration are crucial - Breadth and depth

→ provide an overview, integrate, and go deep - Actionable outputs require (near)real-time analysis

→ fast analysis pipelines are essential - We are just scratching the surface...

- Boris Shraiman
- Colin Russell
- Trevor Bedford
- Oskar Hallatschek
- All the NICs and WHO CCs that provide influenza sequence data

- Trevor Bedford
- Colin Megill
- Sidney Bell
- James Hadfield
- All the scientist that share virus sequence data

webserver at treetime.ch

manuscript on bioRxiv

live site at pangenome.de

manuscript on bioRxiv