Moving targets: Genomic epidemiology tools during a pandemic
Richard Neher
Biozentrum & SIB, University of Basel
slides at neherlab.org/202209_IMMEM.html
From 1'000s to 1'000'000s of samples
From a few labs to hundreds of contributing labs
Our favorite fancy tools and models couldn't handle it
- Sample size: could only use a tiny fraction of data
- Slow: Analyses that take weeks are incompatible with actionable insights.
- Complex: Link between inferences and data signatures too indirect.
Needed tools to make sense of an avalanche of data
- Fast: immediate results/summaries
- Robust: data quality varies, sampling is biased. Complex models go wrong in weird ways when assumptions are violated.
- Simple: anything to complex can't be interpreted reliably.
Many smart people made amazing tools with open data
Sequence analysis and interpretation are challenging
Nextstrain's focus: enable teams to make sense of their data
Workflows to analyze custom data + background
- Hierarchical sampling: global, country, division.
- Hosting via Nextstrain groups
- Aimed at completion within hours
- Data sharing restriction made this more difficult than it should have been
- No experience necessary
- QC: avoid releasing bad data
- Clades, lineages, mutations
- Private
Nextclade Web and CLI
- Aligns, translates, classifies a SARS-CoV-2 genome in 20ms
- Now used for QC and filtering in many Nextstrain workflows
- QC and mutations calls used for many downstream analysis
All open sequences annotated and aligned, updated daily:
Nextclade Web and CLI
- Prevalent QC problems changed over time:
- Initially: assembly/mapping problems with clustered SNPs
- Later: Reference calls in uncovered regions
- Later: Contamination/cross-talk
- Feature additions:
- Differentiating between biological (known) frame shifts/stop codons and artefacts
- Detecting mixed lineages (recombinants, chimeras, contaminants)
- Added Pango lineage calling
- Rewrites:
- Summer 2020: "web-first" JavaScript application (plus node.js CLI)
- Spring 2021: rewrite in C++ with web-assembly
- Jun 2022: release of the Rust version
May 2022: Monkeypoxvirus
- SARS-CoV-2 is a big RNA virus (30kb), the MPXV genome is 200kb long.
- Repeats, low complexity regions, etc...
- Can we still align this in the browser?
Take home messages
- Simple tools often more helpful than sophisticated ones
- Client side computing in the browser has a lot of potential
- Upsides: no expensive backends, no data privacy issues
- Lineage annotations are immensely useful!
→ use sequence information without phylogenetics (done by humans).