Preparing for the Workshop
We look forward to working with you at the QLife workshop! Before you arrive, you should install Nextstrain on the computer you will bring. In the afternoon of the workshop, we aim to get your own data working on Nextstrain. To do this, you should follow the guide below to ensure your data are formatted and organized correctly, allowing you to get working on it straight away!
First, install Nextstrain.
If you have data you want to analyze, prepare your data. Otherwise, we will provide example data.
Installation
Nextstrain can be installed in a variety of ways on all major operating systems. We prefer the conda installation here over docker. Feel free to deviate from this when you think you know what you are doing, but you'll be on your own. If you encounter problems following these steps, get in touch with Valentin during the first days of the workshop.
Due to the recent upgrade of the default python version of conda to 3.10, we need to modify step 4 of the installation instructions slightly:
mamba create -n nextstrain -c bioconda nextstrain-cli --yes python=3.9
Testing your installation
To test the installation, we suggest you download our simplest tutorial data set via
git clone https://github.com/nextstrain/zika-tutorial.git
This will download the pipeline and example data. You can run the analysis with
cd zika-tutorial
conda activate nextstrain
snakemake --cores 1
To view the result, type
auspice view
This should print the following to screen
---------------------------------------------------
Auspice server now running at http://localhost:4000
Serving auspice version 1.38.0.
Looking for datasets in /home/richard/Projects/nextstrain/zika-tutorial/auspice
Looking for narratives in /home/richard/Projects/nextstrain/zika-tutorial
---------------------------------------------------
Opening up your web browser and setting the URL to localhost:4000
should show your analysis (you might need to allow exceptions to some firewall).
In this case, Nextstrain has run successfully!
If something isn't working, please get in touch and we'll try to help you get it running before the workshop!
Prepare your data
We want to get you started analyzing your own data using Nextstrain during the workshop! To facilitate this, it would be extremely helpful if your data was formatted such that nextstrain understands it. Otherwise you'll waste a lot of time on simple and boring stuff during the workshop.
Metadata
Your Nextstrain analysis is going to be vastly more interesting if the sequences or samples you analyze have rich 'meta data'. This metadata would typically include collection dates, geographic location, symptoms of patients, host characteristics, etc. But for Nextstrain to be able to parse these and visualize them, they need to be formated consistently. Your data may have meta information coded into the sequence name (see example below). If not, a very transparent way is to provide the meta data as a separate table in a tab- or comma-separated file.
An example meta data file is shown here:
strain accession date region
1_0087_PF KX447509 2013-12-XX oceania
1_0181_PF KX447512 2013-12-XX oceania
1_0199_PF KX447519 2013-11-XX oceania
BRA/2016 KY785433 2016-04-08 south_america
BRA/2015 KY558989 2015-02-23 south_america
There needs to be at least one column named strain
or name
.
These need to match the identifiers of your sequences (in the Fasta or VCF file) exactly and must not contain characters such as spaces, or ()[]{}|#><
.
Dates should be formated according as YYYY-MM-DD
.
You can specify unknown dates or month by replacing the respected values by XX
(ex: 2013-01-XX
or 2011-XX-XX
) and completely unknown dates can be shown with 20XX-XX-XX
(which does not restrict the sequence to being in the 21st century - they could be earlier).
Because Excel will automatically change the date formatting, we recommend not opening or preparing your meta data file in Excel. If the metadata is already in Excel, or you decide to prepare it in Excel, we recommend using another program to correct the dates afterwards (and don't open it in Excel again!).
Geographic locations can be broken down as region
, country
, division
or city
and for many of these Nextstrain already knows their coordinates.
It is important that these are spelled consistently and we have adopted the convention that spaces are replaced by underscores (ex: New_Zealand
).
To make the most of Nextstrain's features, we recommend including sampling date and at least one type of geographic information if at all possible. However, you can also include things like symptoms, host, clinical outcome - and more!
Sequence as Fasta -- virus and other small genomes
Most Nextstrain analyses start from a set of sequences saved in Fasta format. It is crucial to ensure that the name of the sequence matches exactly the strain name in the metadata table.
>1_0087_PF
ACTCGCTGCATCG...
Often meta data is coded into the fasta header like so:
>1_0087_PF | KX447509 | 2013-12-XX | oceania
ACTCGCTGCATCG...
The analysis part of Nextstrain, augur
, can parse meta data from Fasta headers, but you have to make sure that every sequence has the exact same meta data fields and that they are consistently delimited with |
.
Furthermore, none of the metadata fields can contain the character |
.
Reference sequence
To provide consistent alignment and annotation, augur
uses an annotated reference sequence in genbank
format.
Please bring an appropriate reference sequence along with your data set.
You should be able to download such a sequence from data bases such as Genbank.
For example, in our Zika build we use the reference sequence PF13/251013-18. Since we have full-genome Zika samples, it was important we choose a reference that is also full-genome in length. If you have partial-genome sequences, you should look for a reference that covers the same region you have.
You can download a reference sequence you find on Genbank by using the 'send to' link in the top-right corner and then selecting 'Complete Record', 'File', and 'GenBank' as the format.
Sequence in VCF -- bacteria and other larger genomes
If you would like to run Nextstrain on whole-genome bacterial data, you probably would like to start with VCF sequence data. (Note that if you have bacterial SNP data, this may be in the above Fasta format.)
All of your samples should be combined into one VCF file, where each sample has its own column.
As always, it's crucial that the sequence names in your VCF file (at the top of each column) match the sequence names in your meta data file.
The VCF file can be uncompressed (ending in .vcf
) or compressed (ending in .vcf.gz
).
You will also need the reference sequence to which your VCF file maps. This is a Fasta file that contains one full-length genome with positions that correspond to the positions in your VCF file.
Optional:
If you have areas of your genome that you would like to exclude from analysis (perhaps because they are repetitive or recombinant), you can provide a file in BED format which lists regions to 'mask' (exclude).
An example of this format is below. It does not matter what is in the Chrom
, locus tag
or Comment
fields.
Chrom ChromStart ChromEnd locus tag Comment
NC_000962 23182 23269 IG18_Rv0018c-Rv0019c
NC_000962 33582 33794 Rv0031 remnant of A transposase
NC_000962 80194 80623 IG71_Rv0071-Rv0072
Optional: Bacteria can have thousands of genes, and translating all of them can take quite a bit of time. If you wouldn't like to translate all of them, you can provide a list of genes that you'd like to include in the analysis, and only these will be processed. For example, you might include a list of genes that are associated with drug-resistance. To do this, simply create a file with a list of the genes you'd like to include, with one gene name per line.
Annotation reference sequence
To correctly translate and identify genes in your data, you will also need to provide an annotation file - the equivalent of the GenBank file needed for Fasta input, above. However, for VCF files, this should be in GFF format.
You can also find an appropriate GFF annotation reference on GenBank. Be sure to pick one that is very close to the strain you are using, if there might be variability in the genes present! If the positions in the GFF file do not match the positions in your VCF file, it will not work.
To download a file from GenBank, use the 'send to' button in the top-right corner, then select 'Complete Record', 'File', and 'GFF3' for the format.
You will have to modify the first column of the GFF so that it matches the CHROM
(first column) of your VCF file.