neherlab@biozentrum
  • Home
  • Outreach
  • Publications
  • Software
  • Talks
  • Teaching
  • Team

Eco-Evolutionary dynamics in Microbial Communities

the command line

Many program used to process and analyze sequence data are called from the command line. These program don't usually have fancy graphical user interfaces but get the necessary input as parameters of the command by which they are called. Working with the command line can be much more efficient for many task, so its a worthwhile thing to learn. There are several good introductary courses out there. Give them a go, ask if you get stuck

  • Very basic interactive intro by Codeacademy
  • A tutorial by David Baumgold covers the basics
  • More serious (and laborious) tutorial LinuxCommand
  • Cheat-sheets like this can come in handy

Sequencing data

Sequencing machines read some type of signal (usually light or current) from which they infer the sequence. With nanopore, we have some window into this but most of the time you'll get a text files with base calls in fastq format. This looks like this:

@HISEQ:229:C6N64ACXX:4:1101:7886:2578 1:N:0:GTCCGC
GAAACATGGTTTGCGTACGGGCGCCGGGGCCGAACTGGCCAGTAGTGCGGCCCCAACCGGGTGGACCCATGCCAGGAAGAGTAGAGCAATGAAAGTCTGGG
+
CCCFFFFFHHHHHJJJHIIHGGJJJJIJJHHFDDDDDDDDDDDDDFDDDDDDDDDDDDDBDBBDDDDDDDCDDDDADDDDDACDDDDDDDDCDDDCCACCC

The different parts are:

@read_id
sequence
+
quality_string

The quality string specifies confidence of the base call, that a measure of the error probability of this base. Quality scores are encoded on a log-scale and represented as ASCII characters. The most commonly used scale is Q = char(33+q) where $$ q = -10\log_{10} p $$ and \(p\) is the probability of miscall. Hence a base that the sequencer thinks is 99.9% correct will have a \(q=30\) and be encoded as character ?. Depending on the sequencing technology, these quality score can be fairly reliable measures of sequencing quality, or only rough indicators.

Understanding fastq files

Let's look into a fastq file. I will do this in python, but if you are more familiar with another language, that is fine too. Open ipython and copy/paste

:::python
# import the relevant libraries
from Bio import SeqIO
import gzip

#this specifies one particular file of gzippped reads in fastq format
fname = '/home/qbiodata/morbidostat/PA83/v02/d04/PA83v02d04_AGTCAA_L002_R1_001.fastq.gz'

# open the file with the reads and call a routine that knows how to interpret the file
read_file = gzip.open(fname)
reads = SeqIO.parse(read_file, 'fastq')

read = reads.next()

The variable read now contains the first read in the file. Type

in  [] read.seq
out [] Seq('AGCAGGCTCATCGCCCGCCCTCGTCGCACCTGTTCCAGCAACGCCTCGAAGGTC...TGT', SingleLetterAlphabet())

in  [] read.letter_annotations['phred_quality']
out [] [.... long list of numbers... ]

To plot the phred scores along the read, do (you need to have logged in with ssh -X to be able to open graphs)

# import a plotting tool box
import matplotlib.pyplot as plt
plt.ion()

plt.plot(read.letter_annotations['phred_quality'])

A histogram of read quality

#!python
from Bio import SeqIO
import gzip
import numpy as np

fname = '/home/qbiodata/morbidostat/PA83/v02/d04/PA83v02d04_AGTCAA_L002_R1_001.fastq.gz'
read_length = 101
QMatrix = np.zeros((read_length, 45))
ind = np.arange(read_length)
max_reads=100000

read_file = gzip.open(fname)
for ri, read in enumerate(SeqIO.parse(read_file, 'fastq')):
  QMatrix[ind, read.letter_annotations['phred_quality']]+=1
  if ri>max_reads:
    break

read_file.close()

import matplotlib.pyplot as plt
plt.imshow(QMatrix.T, interpolation='nearest')

Read trimming

Sequencing reads have problems. They often don't have high quality along the entire length and it is often a good idea to clean away the low quality part. Illumina reads often have additional problems since there is the possibility of reading into the adapters used to attach the reads to the flow cell and start the sequencing reaction. There are software packages to deal with this problem. We have used trim_galore in the past. This is essentially a wrapper around cutadapt and FastQC. The following will run trim_galore in paired-end mode throwing out all reads that are shorter than 70 bases after trimming bases with a quality score below 30. The input files are read?_sample.fastq.

trim_galore --paired --length 70 --quality 30 read1_sample.fastq read2_sample.fastq

This is pretty drastic filtering and you might often want to relax the quality criterion. trim_galore outputs to following files:

-rw-r--r-- 1 qbiodata qbiokitp  63K Jul 26 10:43 read1_sample.fastq
-rw-r--r-- 1 qbiodata qbiokitp 1.6K Jul 26 11:06 read1_sample.fastq_trimming_report.txt
-rw-r--r-- 1 qbiodata qbiokitp  34K Jul 26 11:06 read1_sample_val_1.fq
-rw-r--r-- 1 qbiodata qbiokitp  63K Jul 26 10:51 read2_sample.fastq
-rw-r--r-- 1 qbiodata qbiokitp 1.8K Jul 26 11:06 read2_sample.fastq_trimming_report.txt
-rw-r--r-- 1 qbiodata qbiokitp  34K Jul 26 11:06 read2_sample_val_2.fq

You see that the *.val_?.fq are about half the size of the original files.

Mapping & Assembly

Most sequencers deliver only short DNA fragments and/or make mistakes. Hence the output of the sequencer is quite far from what you'd like to have in the end and substantial post-processing is necessary. The two main strategies to analyze these data further are (i) mapping to reference genomes or (ii) assembly of all the little reads into longer parts. Mapping is the method of choice if a close, reliable reference sequence exists and the emphasis is on single nucleotide changes or small indels. Assembly requires longer reads, more coverage, and more computing power but if successful give a genome where there was none before.

Mapping

In mapping, one piles up sequencing read on their homologous positions of a reference genome. Mapping programs typically generate an index of the reference genome first and then use this index to rapidly find aligment of the reads on the reference genome. The positions can be found in a time that logarithmic in the size of the genome and large data sets can be mapped against large genomes very rapidly. But many-fold repeated regions can slow the mapping process and mapping doesn't allow a large number of differences between the query and the reference. Program that can be used for mapping

  • BLAST (nobody would do that, but possible)
  • bwa (Burrow-Wheeler Aligner): fast, gap-aware, limited edit distance, but now has a nanopore mode.
  • bowtie: fast, doesn't handle gaps
  • LAST: Deals well with repeats
  • SOAP: GPU enabled and fast
  • stampy: slow, but works well with indels and high divergence

Mapping output

Mapping program typically produce output files in sam format, which is compact way to specify where and how each read aligned (mapped) to the reference. Samfiles typically start with a header the contains the names and lengths of the different pieces (chromosomes, contigs, etc) the reference came in. In our case, this is a bacterial genome in three contigs and a plasmid.

@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_c1  LN:3694493
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_c2  LN:2296629
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_c3  LN:994613
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_plsm  LN:39882

The alignment of individual reads are then specified as

HISEQ:229:C6N64ACXX:4:1101:16569:2194 83  gnl|DSMZ|PA77_HGAP3_2SMC_c3 17249 0 66M1D35M  = 17014 -337  TTCTGCATGGCGTGCCGCAGATTCTGATGATGGACCCAGGTTCTGCCAACACCTCGGCCATGGCCCGAACCTTTGCCGTGCGCTGCGCATCCGCGTCATCG  ########?5<DB>3DCDDCDDDC>:CCCACBCADDBC?DC<DCCDB@>8-88DDDADCDDDBCDEDGHGHEAJJIFJJIJIJJJJIIGHGHHFFDDF@@C  NM:i:1  AS:i:94 XS:i:94
HISEQ:229:C6N64ACXX:4:1101:16569:2194 163 gnl|DSMZ|PA77_HGAP3_2SMC_c3 17014 0 84M17S  = 17249 337 CTCAAGCCCGGCGCCGATGAGCACGGTAACGGCCTGCGCGTNNTGGAGCATGACCAGTTCTACAAGAACAAGCCGAAGAACGTGNNNNNNNNNNCNTCTAA  C@CFFFFFHHGHGJJJJJJIIGIJIJFHIIJIJHHHHFFD?##,5<?@CDDDDDDDDCDDEEDDDDDDDDDDBDDBBDDDDDBB#################  NM:i:2  AS:i:80 XS:i:80

This is pretty messy, so lets clean this up a bit

read_id 83  chr3 17249 0 66M1D35M  = 17014 -337  TTCTGCATGGCGTGC...  ########?5<DB>3D... NM:i:1  AS:i:94 XS:i:94
read_id 163 chr3 17014 0 84M17S  = 17249 337 CTCAAGCCCGGCGC...  C@CFFFFFHHGHGJJJJJ...  NM:i:2  AS:i:80 XS:i:80

The two reads shown are actually a read pair, meaning they derive from the same DNA molecule. The first entry is the read ID, the second is a (complicated flag), the third is the mapping position (base 17249), the fifth is the so called CIGAR code that specifies the aligment to the reference sequence. In this case, it specifies a 66 base match (66M), a one base deletion (1D), and another 35 base match (35M). The number 17014 is the location of the paired read and the -337 is the insert size, that is the length of the piece of DNA that went into this read. The remainder is the sequence and the quality score. The full format specification can be found here Note that the samfile contains all the information from the fastq file and information on where the read mapped.

We won't go into the details of how the mapping and alignment is done algorithmically. Most programs initally generate an index of the reference genome (a structure that allows rapid look-up of the positions of kmers), find seeds in reads using the index, and do a local pairwise aligment.

Insert size distribution

The insert is the piece of DNA flanked by sequencing adapters on your flow cell. After mapping paired end reads, we can look at the size of inserts by the distance between the start of the read1 and 2 on the reference

#!python
import pysam
import numpy as np

max_reads=10000
sam_fname = "d00v00.sam"
isizes = []
with pysam.Samfile(sam_fname) as samfile:
    # Iterate over single reads
    for i, read in enumerate(samfile):
        if read.is_read1:
            isizes.append(read.isize)
        if i>max_reads:
            break


import matplotlib.pyplot as plt
plt.plot(sorted(np.abs(isizes)), np.linspace(0,1,len(isizes)))
plt.xlim(0,700)
plt.xlabel("insert size")
plt.ylabel("fraction shorter than X")

Custom pile-up

We often use a small little function to generate pile-ups in python. This is probably not the fastest nor the most standard way of doing it, but it works and lets you inspect how often you get ACGT-N at every position of the reference and how often you get small insertions. It will also split these counts by read1, read2, and reads that map to the forward and reverse strands.

#!python

# cp /home/qbio1/201707_pile_up.py .
def sam_to_pile_up(sam_fname, paired=False, qual_min=30, max_reads=-1, VERBOSE = 0):
    '''
    calculates a pile up for a set of mapped reads
    parameters:
    sam_fname   --   sam or bam file with mapped reads
    paired      --   differentiates between read one or two if True
                     otherwise the two reads are lumped together
    qual_min    --   Ignore bases with quality less than qmin
    '''
    import numpy as np
    import pysam
    from collections import defaultdict

    alpha =np.fromstring('ACGT-N', dtype='S1')
    def ac_array(length, paired):
        if paired:
            return np.zeros((2,2,6,length), dtype =int)
        else:
            return np.zeros((2,6,length), dtype = int)

    # Note: the data structure for inserts is a nested dict with:
    # position --> string --> read type --> count
    #  (dict)      (dict)       (list)      (int)
    def insertion_data_structure(paired):
        if paired:
            return defaultdict(lambda: defaultdict(lambda: np.zeros((2,2), int)))
        else:
            return defaultdict(lambda: defaultdict(lambda: np.zeros(2, int)))


    # Open BAM or SAM file
    with pysam.Samfile(sam_fname) as samfile:
        ac =  []
        for nref in xrange(samfile.nreferences):
            if VERBOSE: print "allocating for:", samfile.getrname(nref), "length:", samfile.lengths[nref]
            ac.append((samfile.getrname(nref), ac_array(samfile.lengths[nref], paired),
                        insertion_data_structure(paired)))

        # Iterate over single reads
        for i, read in enumerate(samfile):
            # Max number of reads
            if i == max_reads:
                if VERBOSE >= 2:
                    print 'Max reads reached:', max_reads
                break

            if read.is_unmapped: continue
            # Print output
            if (VERBOSE > 2) and (not ((i +1) % 10000)):
                print (i+1)

            # create short cuts to the nucleotide count and insertion data structures
            rev = int(read.is_reverse)
            if paired:
                r1 = int(read.is_read2)
                counts = ac[read.rname][1][r1, rev]
                insertion = ac[read.rname][2]
            else:
                counts = ac[read.rname][1][rev]
                insertion = ac[read.rname][2]

            seq = np.fromstring(read.seq, 'S1')
            qual = np.fromstring(read.qual, np.int8) - 33
            pos = read.pos
            # Iterate over blocks in the CIGAR codes
            for ic, (block_type, block_len) in enumerate(read.cigar):
                if block_type==4: # softclip
                    seq = seq[block_len:]
                    qual = qual[block_len:]
                    continue
                if block_type==5: # hard clip
                    continue

                # Match
                if block_type == 0:
                    seqb = seq[:block_len]
                    qualb = qual[:block_len]
                    # Increment counts for possible nucleotide
                    for j, a in enumerate(alpha):
                        posa = ((seqb == a) & (qualb >= qual_min)).nonzero()[0]
                        if len(posa):
                            counts[j,pos + posa] += 1

                    # Chop off this block, advance reference position
                    if ic != len(read.cigar) - 1:
                        seq = seq[block_len:]
                        qual = qual[block_len:]
                        pos += block_len

                # Deletion
                elif block_type == 2:
                    # Increment gap counts
                    counts[4, pos:pos + block_len] += 1
                    # advance reference, but not read
                    pos += block_len

                # Insertion
                # an insert @ pos 391 means that seq[:391] is BEFORE the insert,
                # THEN the insert, FINALLY comes seq[391:]
                elif block_type == 1:
                    seqb = seq[:block_len]
                    qualb = qual[:block_len]
                    # Accept only high-quality inserts
                    if (qualb >= qual_min).all():
                        if paired:
                            insertion[pos][seqb.tostring()][r1, rev] += 1
                        else:
                            insertion[pos][seqb.tostring()][rev] += 1

                    # Chop off seq, but not pos
                    if ic != len(read.cigar) - 1:
                        seq = seq[block_len:]
                        qual = qual[block_len:]

                # Other types of cigar?
                else:
                    if VERBOSE>2:
                        print "unrecognized CIGAR type:", read.cigarstring
                    #raise ValueError('CIGAR type '+str(block_type)+' not recognized')

    return ac


if __name__ == '__main__':
    fname = 'd00v00.sam'
    pile_up = sam_to_pile_up(fname, paired=False, qual_min=20, max_reads=100000)

Assembly

The processes of stitching together many small reads into larger contiguous sequences is called assembly. The main algorithmic strategies to assemble genomes are de Buijn graphs (where nodes are kmers) and overlap graphs (where nodes are reads).

Popular assembly programs include

  • canu: a fork of the celera assembler for long-read only assembly
  • spades: does hybrid assembly

UCSB computing enviroment

USCB generously allowed us to use their HPC environment for this course. It uses the PBS queing system -- you can find their quick start guide here.

To submit a job, you typically specify a small script that specifies the resources the job needs and starts the program that does the work.

#!/bin/bash
# specify number of nodes and cores per node
#PBS -l nodes=1:ppn=1
# specify the time you expect the job to run hh:mm:ss
#PBS -l walltime=2:00:00

# you executable
echo "hello world", $1

This would then be called as

qsub submit.sh -F random_output

where the additional argument "random_output" will be sustituted for the "$1" in the script. Multiple arguments need to be enclosed by quotation marks. To check the status of your jobs, type

qstat -u username

For the different tasks that we will run during this course, we have to generate analogous scripts that submit the appropriate jobs.


Published

Jul 24, 2017

Category

teaching

Tags

  • bioinformatics 39
  • command line 2
  • Imprint
  • Powered by Pelican. Theme based on: Elegant by Talha Mansoor