neherlab@biozentrum
  • Home
  • Outreach
  • Publications
  • Software
  • Talks
  • Teaching
  • Team

Installation and data curation

Preparing for the Workshop

We look forward to working with you at the BC2/BaselLife workshop on Monday! Before you arrive, you should install Nextstrain on the computer you will bring. In the afternoon of the workshop, we aim to get your own data working on Nextstrain. To do this, you should follow the guide below to ensure your data are formatted and organized correctly, allowing you to get working on it straight away!

First, install Nextstrain.

Next, prepare your data.

Installation

Nextstrain can be installed in a variety of ways on all major operating systems. For this course, we detail the preferred way of installing nextstrain below. Feel free to deviate from this when you think you know what you are doing, but you'll be on your own. If you encounter problems following these steps, get in touch with Emma and Richard with as much detail as possible.

Operating systems specific steps -- Windows

Since Windows 10, you have the option to use native Linux software on Windows via the "Windows subsystem for Linux" (WSL). Instructions on how to install and activate WSL can be found in the first two steps here. Do not follow the steps beyond 'Create a Link to Your Files'! We recommend you install the Ubuntu distribution as this is the one we will be using during teaching.

After completing these steps, open a terminal (look for 'Bash on Ubuntu' or 'Ubuntu' in the Start Menu), and run the following commands:

sudo apt-get update
sudo apt-get install curl gcc

Operating systems specific steps -- MacOS

To run nextstrain on macOS, you'll need the developer tools. Open a terminal and type gcc. If no developer tools are installed, you will be prompted to install them.

Operating systems specific steps -- Linux

Open a terminal and run the following commands:

sudo apt-get update
sudo apt-get install curl gcc

Install miniconda

Our preferred way to install is via the conda package manager. Conda will install all required python dependencies and necessary bioinformatic tools. To get the miniconda installer for Linux and Windows Subsystem for Linux (WSL), type

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o conda_installer.sh

On MacOS, use

curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o conda_installer.sh

Next, you need to run the installer:

bash conda_installer.sh

This installer will ask you to accept terms and conditions (yes), where the conda distribution is to be saved (the default is fine), and whether your shell should be configured (yes). After this step, you should open a new terminal so that the changes take effect.

Install Nextstrain

To install Nextstrain, use the following commands:

curl http://data.nextstrain.org/nextstrain.yml --compressed -o nextstrain.yml
conda env create -f nextstrain.yml

The first command fetches a list of requirements from the web, the second sets up a environment called 'nextstrain'.

In a last step, you need to install the visualization component, which can be done with the commands:

conda activate nextstrain
npm install --global auspice

Anytime you want to use Nextstrain, you will need to activate the nextstrain environment with

conda activate nextstrain

Testing your installation

To test the installation, we suggest you download our simplest tutorial data set via

git clone https://github.com/nextstrain/zika-tutorial.git

This will download the pipeline and example data. You can run the analysis with

cd zika-tutorial
conda activate nextstrain
snakemake

To view the result, type

auspice view

This should print the following to screen

---------------------------------------------------
Auspice server now running at http://localhost:4000
Serving auspice version 1.38.0.
Looking for datasets in /home/richard/Projects/nextstrain/zika-tutorial/auspice
Looking for narratives in /home/richard/Projects/nextstrain/zika-tutorial
---------------------------------------------------

Opening up your web browser and setting the URL to localhost:4000 should show your analysis (you might need to allow exceptions to some firewall). In this case, Nextstrain has run successfully!

If something isn't working, please get in touch and we'll try to help you get it running before the workshop!

Prepare your data

We want to get you started analyzing your own data using Nextstrain during the workshop! To facilitate this, it would be extremely helpful if your data was formatted such that nextstrain understands it. Otherwise you'll waste a lot of time on simple and boring stuff during the workshop.

Metadata

Your Nextstrain analysis is going to be vastly more interesting if the sequences or samples you analyze have rich 'meta data'. This metadata would typically include collection dates, geographic location, symptoms of patients, host characteristics, etc. But for Nextstrain to be able to parse these and visualize them, they need to be formated consistently. Your data may have meta information coded into the sequence name (see example below). If not, a very transparent way is to provide the meta data as a separate table in a tab- or comma-separated file.

An example meta data file is shown here:

strain      accession   date        region
1_0087_PF   KX447509    2013-12-XX  oceania
1_0181_PF   KX447512    2013-12-XX  oceania
1_0199_PF   KX447519    2013-11-XX  oceania
BRA/2016    KY785433    2016-04-08  south_america
BRA/2015    KY558989    2015-02-23  south_america

There needs to be at least one column named strain or name. These need to match the identifiers of your sequences (in the Fasta or VCF file) exactly and must not contain characters such as spaces, or ()[]{}|#><. Dates should be formated according as YYYY-MM-DD. You can specify unknown dates or month by replacing the respected values by XX (ex: 2013-01-XX or 2011-XX-XX) and completely unknown dates can be shown with 20XX-XX-XX (which does not restrict the sequence to being in the 21st century - they could be earlier).

Because Excel will automatically change the date formatting, we recommend not opening or preparing your meta data file in Excel. If the metadata is already in Excel, or you decide to prepare it in Excel, we recommend using another program to correct the dates afterwards (and don't open it in Excel again!).

Geographic locations can be broken down as region, country, division or city and for many of these Nextstrain already knows their coordinates. It is important that these are spelled consistently and we have adopted the convention that spaces are replaced by underscores (ex: New_Zealand).

To make the most of Nextstrain's features, we recommend including sampling date and at least one type of geographic information if at all possible. However, you can also include things like symptoms, host, clinical outcome - and more!

Sequence as Fasta -- virus and other small genomes

Most Nextstrain analyses start from a set of sequences saved in Fasta format. It is crucial to ensure that the name of the sequence matches exactly the strain name in the metadata table.

>1_0087_PF
ACTCGCTGCATCG...

Often meta data is coded into the fasta header like so:

>1_0087_PF | KX447509 | 2013-12-XX | oceania
ACTCGCTGCATCG...

The analysis part of Nextstrain, augur, can parse meta data from Fasta headers, but you have to make sure that every sequence has the exact same meta data fields and that they are consistently delimited with |. Furthermore, none of the metadata fields can contain the character |.

Reference sequence

To provide consistent alignment and annotation, augur uses an annotated reference sequence in genbank format. Please bring an appropriate reference sequence along with your data set. You should be able to download such a sequence from data bases such as Genbank.

For example, in our Zika build we use the reference sequence PF13/251013-18. Since we have full-genome Zika samples, it was important we choose a reference that is also full-genome in length. If you have partial-genome sequences, you should look for a reference that covers the same region you have.

You can download a reference sequence you find on Genbank by using the 'send to' link in the top-right corner and then selecting 'Complete Record', 'File', and 'GenBank' as the format.

Sequence in VCF -- bacteria and other larger genomes

If you would like to run Nextstrain on whole-genome bacterial data, you probably would like to start with VCF sequence data. (Note that if you have bacterial SNP data, this may be in the above Fasta format.)

All of your samples should be combined into one VCF file, where each sample has its own column. As always, it's crucial that the sequence names in your VCF file (at the top of each column) match the sequence names in your meta data file. The VCF file can be uncompressed (ending in .vcf) or compressed (ending in .vcf.gz).

You will also need the reference sequence to which your VCF file maps. This is a Fasta file that contains one full-length genome with positions that correspond to the positions in your VCF file.

Optional: If you have areas of your genome that you would like to exclude from analysis (perhaps because they are repetitive or recombinant), you can provide a file in BED format which lists regions to 'mask' (exclude). An example of this format is below. It does not matter what is in the Chrom, locus tag or Comment fields.

Chrom   ChromStart  ChromEnd    locus tag   Comment
NC_000962   23182   23269   IG18_Rv0018c-Rv0019c
NC_000962   33582   33794   Rv0031  remnant of A transposase
NC_000962   80194   80623   IG71_Rv0071-Rv0072

Optional: Bacteria can have thousands of genes, and translating all of them can take quite a bit of time. If you wouldn't like to translate all of them, you can provide a list of genes that you'd like to include in the analysis, and only these will be processed. For example, you might include a list of genes that are associated with drug-resistance. To do this, simply create a file with a list of the genes you'd like to include, with one gene name per line.

Annotation reference sequence

To correctly translate and identify genes in your data, you will also need to provide an annotation file - the equivalent of the GenBank file needed for Fasta input, above. However, for VCF files, this should be in GFF format.

You can also find an appropriate GFF annotation reference on GenBank. Be sure to pick one that is very close to the strain you are using, if there might be variability in the genes present! If the positions in the GFF file do not match the positions in your VCF file, it will not work.

To download a file from GenBank, use the 'send to' button in the top-right corner, then select 'Complete Record', 'File', and 'GFF3' for the format.

You will have to modify the first column of the GFF so that it matches the CHROM (first column) of your VCF file.


Published

Sep 9, 2019

Category

teaching

Tags

  • bioinformatics 39
  • phylogenetics 37
  • Imprint
  • Powered by Pelican. Theme based on: Elegant by Talha Mansoor