Real-time tracking and visualization of pathogen sequence data


Richard Neher
Biozentrum & SIB, University of Basel


slides at neherlab.org/201902_SIB_Biology19.html

Sequences record the spread of pathogens

Mutations accumulate at a rate of $10^{-5}$ per site and day!
images by Trevor Bedford

Influenza virus genome - 8 segments

Zika virus genome $\sim 10000$ bases

Ebola virus genome $\sim 20000$ bases

Many RNA viruses pick up one mutation every 2-4 weeks!

Frequent mutations imply...

  • most viruses in an outbreak differ from each other
  • transmission chains are can be inferred
  • transmission can be ruled out!
  • geographic spread can be reconstructed


  • Influenza viruses evolve to avoid human immunity
  • Vaccines need frequent updates

GISRS -- Influenza virus surveillance

  • comprehensive coverage of the world
  • timely sharing of data -- often within 2-3 weeks of sampling
  • hundreds of sequences per week (in peak months)
→ requires continuous analysis and easy dissemination
→ interpretable and intuitive visualization

nextstrain.org

joint project with Trevor Bedford & his lab

Phylodynamic analysis with nextstrain

  • input: metadata (csv table) + sequences (fasta or vcf)
  • snakemake pipeline
    • filtering
    • alignment
    • tree building (+time scaled trees)
    • ancestral state reconstruction and phylogeography
  • export to visualization
  • runs in minutes to 1h
Hadfield et al, 2018

Web visualization with nextstrain

  • can be run locally (localhost)
    auspice view --datasetDir mydata
  • share your own builds through nextstrain/community
  • deploy nextstrain on your own servers
  • work in progress:
    • flexible branding
    • drag and drop features
    • (better docs...)
Hadfield et al, 2018

Links and Tutorials

Integration of different data types is key!

Hemagglutination Inhibition assays

Slide by Trevor Bedford

Antigenic distance tables

  • Long list of distances between sera and viruses
  • Structure of space is not immediately clear
  • MDS in 2 or 3 dimensions
Slide by Trevor Bedford

HI distances on the phylogenetic tree

Rapid analysis is crucial!

TreeTime: maximum likelihood phylodynamic analysis

Phylogenetic trees record history:
  • transmission
  • divergence times
  • population dynamics
  • ancestral geographic distribution/migrations
Typical approach: Bayesian parameter estimation
  • flexible
  • probabilistic → confidence intervals etc
  • but: computationally expensive
TreeTime by Pavel Sagulenko
  • probabilistic treatment of divergence times
  • dates trees with thousand sequences in a few minutes
  • linear time complexity
  • fixed tree topology
  • github.com/neherlab/treetime
West African Ebola virus outbreak

TreeTime: nuts and bolts

Attach sequences and dates
Propagate temporal constraints via convolutions
Integrate up-stream and down-stream constraints
Fit phylodynamic model → iterate

Molecular clock phylogenies of ~2000 A/H3N2 HA sequences -- a few minutes

What about bacteria?

  • vertical and horizontal transmission
  • genome rearrangements
  • much larger genomes
  • variation of divergence along the genome
  • NGS genomes tend to be fragmented
  • annotations of variable quality
panX by Wei Ding
  • pan-genome identification pipeline
  • phylogenetic analysis of each orthologous cluster
  • detect associations with phenotypes
  • fast: analyze hundreds of genomes in a few hours
  • github.com/neherlab/pan-genome-analysis

panX @ pangenome.de

S. pneumoniae data set by Croucher et al.

Pan-genome statistics and filters

Species trees and gene trees

Links between species trees and gene trees

Acknowledgments

  • Trevor Bedford
  • Colin Megill
  • Pavel Sagulenko
  • Sidney Bell
  • James Hadfield
  • Wei Ding
  • Emma Hodcroft
  • Sanda Dejanic