Using open data to track and predict infectious disease

Richard Neher
Biozentrum & SIB, University of Basel

slides at

Data sharing in public health emergencies

data "owner" ↔ public interest

... has been problematic in the past

  • 2002 Severe acute respiratory syndrome (SARS)
  • 2003 H5N1 influenza outbreak. Some countries stopped sharing any data
  • 2013-2015 Ebola virus outbreak in West Africa

  • 2014-2016 Zika Virus outbreak: Controversies about attribution and reuse
  • 2014- H7N9 influenza outbreak: Controversies about attribution and reuse

Different disease -- different scientists and institutions.

Rapid identification of the virus, sequencing, and transparent sharing by Chinese scientists and authorities

  • >10 genomes shared on GISAID
  • all sequenced cases come from a single source
  • diagnostic recommendations shared through WHO
  • early days... things are moving fast

Sequences record the spread of pathogens

Mutations accumulate at a rate of $10^{-5}$ per site and day!
images by Trevor Bedford

Influenza virus genome - 8 segments

Zika virus genome $\sim 10000$ bases

Ebola virus genome $\sim 20000$ bases

Many RNA viruses pick up one mutation every 2-4 weeks!

Frequent mutations imply...

  • most viruses in an outbreak/season differ from each other
  • transmission chains are can be inferred
  • transmission can be ruled out!
  • geographic spread can be reconstructed
  • age of an outbreak can be estimated
  • drug resistance surveillance
  • specific mutations might mediate antigenic mismatch

GISRS and GISAID -- Influenza virus surveillance

  • comprehensive coverage of the world
  • timely sharing of data through GISAID -- often within 2-3 weeks of sampling
  • hundreds of sequences per week (in peak months)
→ requires continuous analysis and easy dissemination
→ interpretable and intuitive visualization

joint project with Trevor Bedford & his lab

Barriers to data sharing: scientists

  • Privacy of study participants
  • Fear of being scooped/ensure maximal return
  • Secondary analysis perceived as freeloading: "data parasites"
  • Don't want to be second guessed
  • Release and curation is laborious
  • Sloppy records
Also see Smith et al, F1000 2016

Barriers to data sharing: organizations and governments

  • Economic consequences of outbreaks (tourism, agriculture)
  • Conflicts between high and low/middle income countries
  • Concerns about IP and commercial exploitation
  • Legislative/regulatory barriers
See also: van Panhius et al, BMC Public health 2014

Overcoming Barriers

Open vs restricted sharing

  • Alternatives to GenBank: GenBank is "public domain", no requirement to credit data producers
  • GISAID/EpiFlu: sign-up and agree to terms and conditions
  • Platform for sharing and discussing molecular epidemiology
  • Explicit data reuse terms
  • Outline planned projects in white-paper
  • Caveat: Very difficult to enforce...
See Bogner et al, 2006

Building Trust

  • Peter Bogner coordinated Influenza data sharing
  • Andrew Rambaut coordinated Ebola virus data sharing
  • During the EBV outbreak, WHO and journals explicitly encouraged data sharing
See also Smith et al, F1000 2016

Make sharing easy and provide incentives!


  • Trevor Bedford
  • Pavel Sagulenko
  • James Hadfield
  • Emma Hodcroft
  • Tom Sibley
  • and others