SARS-CoV-2 phylogenetic analysis

This workshop provides a brief overview of the basics of phylogenetic analysis, the spread and evolution of SARS-CoV-2, different nomenclature systems, and the use of web-tools.

Slides and materials

Links and example data

nextclade: Clade assignment, sequence Q:, phylogenetic placement
cov-lineages: Lineage assignment and phylogenetic placement
Example data small
Example data large

Question and answer session

Q: How does that rate of mutations compare to other viruses?

A: It’s relatively slow evolving for an RNA virus (slower than flu for instance). On average we see a new mutation being transmitted every 2 weeks.

A: Influenza virus for example mutate about 3-4 times faster. But the SARS-CoV-2 genome is about twice the size.

Q: Is it known if any of these mutations has an impact on the virulence of the virus for example?

A: There is no evidence at the moment that any of these mutations affect virulence. There is some evidence that S:D614G increases transmission.

A: A large deletion in ORF8 might decrease severity, but it looks as if this variant has gone extinct.

Q: Could you please kindly explain difference among strain, clad , lineage ? Thanks

A: No problem, all these terms can be very confusing. Different strains of viruses have different biological properties that are stable in natural conditions. For instance different antigenic properties. Clades are groupings of the SARS-CoV-2 phylogeny based on a minimum size and persistence threshold (Nextstrain has 5 clades. Lineages are finer scale, epidemiologically informed groups in the SARS-CoV-2 phylogeny. There are >250 lineages at the moment.

Q: At what point does the diversity represent a new strain. Or what does define a different strain?

A: That's a great question. In virology we don't have a hard definition for 'strain' - it's used very loosely and there's not really clear criteria for when something becomes a strain. However, as Aine showed in her talk, it's important to keep in mind that the diversity we see in SARS-COV-2 is much, much less than the viruses we commonly study, which have been circulating in humans much longer - and where we might more commonly refer to 'strains'. So, there's no hard-and-fast threshold, but we wouldn't expect to be saying 'strain' for SARS-CoV-2 very soon!

Q: How do you know the strain that you have sequences is the original strain to cause an outbreak?

A: Based on the sequences we have and the alignment, we can tell that all sequences originated from a single spillover event late last year. The closest sequence other than the pandemic sequences has a branch with ~1,200 mutations on it.

A: This is mostly a question of terminology - compared to other viruses we would say there is only one 'strain' of SARS-CoV-2. However, we can see there is definitely diversity - which we might call clades, lineages, variants, as we've covered. However, it says above, all of these are much, much more closely related to each other than to their nearest neighbour - they're all part of the same outbreak. I hope that helps!

Q: How can we tell two Viruses belong to same Strain? Thanks

A: If they fall into the SARS-CoV-2 diversity they are the same strain of SARS-CoV-2. ---

Q: Is there going to be an updated version of the Focal analysis for specific European countries in the near future? By that I mean: will other countries be added?

A: Hi Ioana, it's true that defining 'Europe' is not always so easy :) What country are you looking for? In some cases we unfortunately may not have enough samples for a separate build. But we are definitely open to adding more countries!

Q: I was curious about the status of Romania (being a romanian myself). Maybe we aren't providing the complete information. Thank you for your reply and for being open to adding more countries, as well!

A: I've just had a look - it does seem like we've got enough sequences for Romania to do a build. Not sure why it was left out - may simply be that when we started this months ago, it didn't have many! But I can add this so we can start including a Romanian build :)

Q: Fantastic so far. On the same topic, do you have a build for Canada? Thanks!

A: Unfortunately we don't have a build for Canada, and as far as I'm aware I'm not aware of any other groups maintaining a Canadan run. You can see the North American build here. And you can see all the builds maintained by us (excluding the Europe ones - sorry!) and other groups here

Q: Sorry for the naive question but what would constitute enough for a build for a country?

A: There isn't a hard cutoff for the number of sequences needed for a build - we could technically do this even if there was just 1! It just won't be very useful as it'll mostly be sequences from elsewhere :) Then, just looking at the (for exampe) Europe build would be just as informative. Each build takes time & computation to run, so we are likely to start maintaining a build for a country when it has a few hundred sequences at least.

Q: What is known about the mutation rate within a host?

A: Hi Karin! Research on this is still ongoing - as an acute virus it is less common to have multiple samples per person. However, with an average rate of 2 mutations per month we don't necessarily expect to see large diversity within host.

Q: Would we be able to apply these tools to SARS-CoV-2 samples from wastewater (representing multiple cases in one sample)

A: Great question Hillary. Yes, you can apply these tools, but the interpretation is indeed harder. We have just recently been discussing this. The difficulty is how to interpret a sample which has an 'excess' of mutations due to containing multiple variants. So while the tools will work - the interpretation will be harder. We are working on a better way to show this in Nextstrain, but at the moment we don't have an automatic way to do this. You'll likely get a sample that shows up as being really divergent from the others.

Q: Hello! Emma, could you tell us more about the downsampling process you applied to the samples showed by Nextstrain??

A: Yes, of course. For the global builds, we sample 16 samples per country, per month, per year. With samples from over 112 countries, that adds up! For the regional builds (and the Europe & other smaller focal builds Nextstrain members maintain) we have two steps - a 'focal' sample which is taken from the region in question, and the 'global context' which is from the rest of the world. For regions, we sample 32 samples per division (like state or canton) per month, per year. Then, we take 4 samples per country per month per year from the rest of the world - but we preferentially take samples that link most closely to the 'focal' set - so that we capture important links and context! To be extra detailed, that is actually the sampling done for regions which aren't Europe. For Europe we do slightly differently since it has so many countries & so many samples - we take 44 samples per country per month per year for Europe, then add 5 samples per country per month per year for regions outside Europe - those samples that link most closely to Europe samples.

Q: Thank you very much. I see…. it is very dynamic depdning on what are you looking for at each moment. Great!!

A: Yes! And for our more focused builds (ex: I maintain builds for Texas) we adjust this accordingly. It's very flexible! You can read a bit more about the subsampling options here (go to the Subsampling heading): https://nextstrain.github.io/ncov/customizing-analysis

Q: I'm about to start my PHD (next monday) and the data would be absolutely helpful. My paper is about coronaviruses, but being a vet school alumni, will be focused on the transmission aspects, if there are some similarities between strands that we find in the pets of the infected owners and the owners, themselves.

A: There have been some interesting transmission patterns in mink farm in the Netherlands

A: If you go to the Netherlands build: You can look for the box on the left called 'search strains' and start typing 'mink' - it'll show you all the mink sequences that have been included - and they pop up as larger circles on the tree!"

Q: At this point and with avaliable information, Which genome region would you select to do an amplicon sequence approach to study viral diveristy of SARS-CoV-2 in complexe samples (i.e. Sewage)?

A: The sequencing pimers by the ARTIC consortium are updated frequently. These would probably be a good starting point.

Q: Considering the low number of mutations present at this point, all sequences would group within the same OTUs. Should we should work only with unique sequences avoiding the OTus approach?

A: These sequences are all at least 99.9% identical. I don't see a need to use OTUs. The clades and lineages are classification systems to address this need.

Q: How do we choose an Outgroup?

A: For making the SARS-CoV-2 tree generally we use early Wuhan sequences to root the tree - these aren't really an outgroup though as they're part of the same virus/outbreak. You could use the RaTG13 sequence as an 'outgroup' but it's quite divergent, as Aine showed - it'll be quite far from the SARS-CoV-2 sequences.

A: We typically the reference sequence Wuhan-01 as a reference and outgroup.

A: For us, we’re rooting on Wuhan-04 which is an A-lineage sequence. But again, an early sequence from Wuhan used to root the tree yes.

Q: If a particular region in the same country (or large city) has a relatively high number of lineages, that would imply a greater risk of transmission? It would be wise to pay close attention to that kinds of regions for public health interventions?

A: I don't think it implies a greater risk of transmission. But is does mean that multiple concurrent transmission clusters exist.

Q: The enterovirus D68 analysis tool is really specific for EV-D species or covers other interesting EV- types (ie.EV-71)

A: The pipeline could be used for other Enteroviruses - you'd need to change out the reference genome (used for alignment & knowing where genes are) and of course the dataset. However the base of the rest should work fairly well :)

FEEDBACK/nice words

Thank you for the very useful session
Thanks!
Thank you all!
thanks very much guys :)
Great, Thanks Professor!
Thanks !
That is innovative! Many thanks !!
Thanks for organizing this workshop, which has been extremely informative and practical for an ID epidemioloist to catch up with the latest development in how genomic tools can be applied to understanding the current pandemic in real time!
That was helpful Sir, many thanks!
What a great Workshop it is. Thanks everyone!
Thank you for this webinar! It was really helpful.
Thank you for this webinar. It has really helped my understanding in general, and introduced a field properly that until now I know my knowledge was totally skating the surface!
This is even better than I have expected! Thank you so much!
hi everyone, thank you for taking time to teach us on these exciting methods.
Thank you

SARS-CoV-2 phylogenetic analysis

Slides and materials

Links and example data

Question and answer session

FEEDBACK/nice words

Published

Category

Tags