A glimpse into sequence bioinformatics

Setting up your system

If you know what you are doing, install the necessary program (see yaml file below) via whatever method you like best. Otherwise, I suggest you install the programs and libraries we will need via the package manager conda. To install conda, follow the instructions relevant for your operating system. On linux, this could be achieved by these two commands:

wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh

To install the packages we need, create (or download) the following file that specifies the dependencies

name: xlab
channels:
- bioconda
- defaults

dependencies:
- python=2.7
- biopython>=1.66
- numpy>=1.10.4
- scipy>=0.16.1

- fasttree>=2.1.9
- iqtree>=1.5
- mafft>=7.305

Then install all these dependencies on a specific environment via

conda env create -f xlab.yml
source activate xlab

Influenza virus sequences

Download the file H3N2_Australia.fasta and open it in a text editor. This file contains influenza virus genomes (more precisely the segment HA of H3N2 viruses sampled in Australia). You should see something like this

>A/Brisbane/1/2017|01/02/2017|Australia|H3N2
GGATAATTCTATTAACCATGAAGACTATCATTGCTTTGAGCTGCATTCTATGTCTGGTTTTCGCTCAAAA
AATTCCTGGAAATGACAATAGCACGGCAACGCTGTGCCTTGGGCACCATGCAGTACCAAACGGAACGATA
...

This is the so-called fasta file format. A line that starts with > contains the name of the sequence (in this case a concatenation of name, data, country and subtype). All other lines contain the actual sequence. A more convenient way to explore such sequences is via a dedicated alignment viewer such as aliview. If you want, download and install it.

flu_alignment

The different nucleotides (A,C,G,T) in this sequence are colored differently. The isolated differences between the sequences correspond to mutations that happened somewhere in the history of the sample. The sequences in this alignment code for a protein and the alignment viewer has the functionality to translate the nucleotide sequences into amino acid sequences.

flu_alignment_translation

The genetic code maps nucleotide triplets (so called codons) to amino acids. In this case, codons start at the 2nd positions, i.e., the reading frame is +2.

Sequence alignment

The sequences in the file above are already aligned, that is corresponding nucleotides are arranged in columns in the alignment. This is not necessarily the case. Sequences could simply have different length (parts at the beginning or end are missing in some sequences), or during evolution some parts of the sequences might have been deleted or other inserted. In this case, we need to align the sequences (insert gaps - at the appropriate places) to make sure corresponding nucleotides are in the same column. In this context, corresponding positions are positions that share a common ancestor, and in general such positions are more likely to be in the same state.

ATCTCGTCG---ACTC
||...|X||...||X|
AT---GACGGCGACGC

To score and compare alignments, one typically assigns a positive score to a match |, a negative score to a mismatch X and gaps. The optimal alignment of a pair of sequences can be computed in polynomial time (square of the length of the sequence) via a dynamic programming scheme (transfer matrix type for the physicists), but alignments of multiple sequences are in general a hard problem. A fast but accurate mutliple alignment tool is mafft.

Phylogenetic tree reconstruction

Sequences of the same influenza virus segment descent from a common ancestor without recombination. Their history is hence described by a tree that links the common ancestor to all offspring. Tree reconstruction is in general hard since there are super exponentially many trees. In most cases, however, efficient heuristics give accurate results. A reliable but fast tree reconstruction package is iqtree. To construct a tree from the sequences above, type the following on the command line

iqtree -s H3N2_Australia.fasta

This will generate a number of files and the one ending on treefile will contain the tree. This tree is written in the so-called newick format that groups sequences like so

((A:0.1, B:0.12):0.05,C:0.13);

To look at the tree you just constructed, do the following in ipython:

from Bio import Phylo
T = Phylo.read('H3N2_Australia.fasta.treefile', 'newick')
T.ladderize()
Phylo.draw(T)

this should open a window with a graphical depiction of the tree.

A glimpse into sequence bioinformatics

Setting up your system

Influenza virus sequences

Sequence alignment

Phylogenetic tree reconstruction

Published

Category

Tags