the command line
Many program used to process and analyze sequence data are called from the command line. These program don't usually have fancy graphical user interfaces but get the necessary input as parameters of the command by which they are called. Working with the command line can be much more efficient for many task, so its a worthwhile thing to learn. There are several good introductary courses out there. Give them a go, ask if you get stuck
- Very basic interactive intro by Codeacademy
- A tutorial by David Baumgold covers the basics
- More serious (and laborious) tutorial LinuxCommand
- Cheat-sheets like this can come in handy
Sequencing data
Sequencing machines read some type of signal (usually light or current) from which they infer the sequence. With nanopore, we have some window into this but most of the time you'll get a text files with base calls in fastq format. This looks like this:
@HISEQ:229:C6N64ACXX:4:1101:7886:2578 1:N:0:GTCCGC
GAAACATGGTTTGCGTACGGGCGCCGGGGCCGAACTGGCCAGTAGTGCGGCCCCAACCGGGTGGACCCATGCCAGGAAGAGTAGAGCAATGAAAGTCTGGG
+
CCCFFFFFHHHHHJJJHIIHGGJJJJIJJHHFDDDDDDDDDDDDDFDDDDDDDDDDDDDBDBBDDDDDDDCDDDDADDDDDACDDDDDDDDCDDDCCACCC
The different parts are:
@read_id
sequence
+
quality_string
The quality string specifies confidence
of the base call, that a measure of the error probability of this base. Quality scores are encoded on a log-scale and represented as ASCII characters.
The most commonly used scale is Q = char(33+q) where
$$
q = -10\log_{10} p
$$
and \(p\) is the probability of miscall.
Hence a base that the sequencer thinks is 99.9% correct will have a \(q=30\) and be encoded as character ?
.
Depending on the sequencing technology, these quality score can be fairly reliable measures of sequencing quality, or only rough indicators.
Understanding fastq files
Let's look into a fastq file. I will do this in python, but if you are more familiar with another language, that is fine too. Open ipython
and copy/paste
:::python
# import the relevant libraries
from Bio import SeqIO
import gzip
#this specifies one particular file of gzippped reads in fastq format
fname = '/home/qbiodata/morbidostat/PA83/v02/d04/PA83v02d04_AGTCAA_L002_R1_001.fastq.gz'
# open the file with the reads and call a routine that knows how to interpret the file
read_file = gzip.open(fname)
reads = SeqIO.parse(read_file, 'fastq')
read = reads.next()
The variable read
now contains the first read in the file.
Type
in [] read.seq
out [] Seq('AGCAGGCTCATCGCCCGCCCTCGTCGCACCTGTTCCAGCAACGCCTCGAAGGTC...TGT', SingleLetterAlphabet())
in [] read.letter_annotations['phred_quality']
out [] [.... long list of numbers... ]
To plot the phred scores along the read, do (you need to have logged in with ssh -X
to be able to open graphs)
# import a plotting tool box
import matplotlib.pyplot as plt
plt.ion()
plt.plot(read.letter_annotations['phred_quality'])
A histogram of read quality
#!python
from Bio import SeqIO
import gzip
import numpy as np
fname = '/home/qbiodata/morbidostat/PA83/v02/d04/PA83v02d04_AGTCAA_L002_R1_001.fastq.gz'
read_length = 101
QMatrix = np.zeros((read_length, 45))
ind = np.arange(read_length)
max_reads=100000
read_file = gzip.open(fname)
for ri, read in enumerate(SeqIO.parse(read_file, 'fastq')):
QMatrix[ind, read.letter_annotations['phred_quality']]+=1
if ri>max_reads:
break
read_file.close()
import matplotlib.pyplot as plt
plt.imshow(QMatrix.T, interpolation='nearest')
Read trimming
Sequencing reads have problems.
They often don't have high quality along the entire length and it is often a good idea to clean away the low quality part.
Illumina reads often have additional problems since there is the possibility of reading into the adapters used to attach the reads to the flow cell and start the sequencing reaction.
There are software packages to deal with this problem.
We have used trim_galore
in the past.
This is essentially a wrapper around cutadapt
and FastQC
.
The following will run trim_galore
in paired-end mode throwing out all reads that are shorter than 70 bases after trimming bases with a quality score below 30.
The input files are read?_sample.fastq
.
trim_galore --paired --length 70 --quality 30 read1_sample.fastq read2_sample.fastq
This is pretty drastic filtering and you might often want to relax the quality criterion.
trim_galore
outputs to following files:
-rw-r--r-- 1 qbiodata qbiokitp 63K Jul 26 10:43 read1_sample.fastq
-rw-r--r-- 1 qbiodata qbiokitp 1.6K Jul 26 11:06 read1_sample.fastq_trimming_report.txt
-rw-r--r-- 1 qbiodata qbiokitp 34K Jul 26 11:06 read1_sample_val_1.fq
-rw-r--r-- 1 qbiodata qbiokitp 63K Jul 26 10:51 read2_sample.fastq
-rw-r--r-- 1 qbiodata qbiokitp 1.8K Jul 26 11:06 read2_sample.fastq_trimming_report.txt
-rw-r--r-- 1 qbiodata qbiokitp 34K Jul 26 11:06 read2_sample_val_2.fq
You see that the *.val_?.fq
are about half the size of the original files.
Mapping & Assembly
Most sequencers deliver only short DNA fragments and/or make mistakes. Hence the output of the sequencer is quite far from what you'd like to have in the end and substantial post-processing is necessary. The two main strategies to analyze these data further are (i) mapping to reference genomes or (ii) assembly of all the little reads into longer parts. Mapping is the method of choice if a close, reliable reference sequence exists and the emphasis is on single nucleotide changes or small indels. Assembly requires longer reads, more coverage, and more computing power but if successful give a genome where there was none before.
Mapping
In mapping, one piles up sequencing read on their homologous positions of a reference genome. Mapping programs typically generate an index of the reference genome first and then use this index to rapidly find aligment of the reads on the reference genome. The positions can be found in a time that logarithmic in the size of the genome and large data sets can be mapped against large genomes very rapidly. But many-fold repeated regions can slow the mapping process and mapping doesn't allow a large number of differences between the query and the reference. Program that can be used for mapping
- BLAST (nobody would do that, but possible)
- bwa (Burrow-Wheeler Aligner): fast, gap-aware, limited edit distance, but now has a nanopore mode.
- bowtie: fast, doesn't handle gaps
- LAST: Deals well with repeats
- SOAP: GPU enabled and fast
- stampy: slow, but works well with indels and high divergence
Mapping output
Mapping program typically produce output files in sam
format, which is compact way to specify where and how each read aligned (mapped) to the reference.
Samfiles typically start with a header the contains the names and lengths of the different pieces (chromosomes, contigs, etc) the reference came in.
In our case, this is a bacterial genome in three contigs and a plasmid.
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_c1 LN:3694493
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_c2 LN:2296629
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_c3 LN:994613
@SQ SN:gnl|DSMZ|PA77_HGAP3_2SMC_plsm LN:39882
The alignment of individual reads are then specified as
HISEQ:229:C6N64ACXX:4:1101:16569:2194 83 gnl|DSMZ|PA77_HGAP3_2SMC_c3 17249 0 66M1D35M = 17014 -337 TTCTGCATGGCGTGCCGCAGATTCTGATGATGGACCCAGGTTCTGCCAACACCTCGGCCATGGCCCGAACCTTTGCCGTGCGCTGCGCATCCGCGTCATCG ########?5<DB>3DCDDCDDDC>:CCCACBCADDBC?DC<DCCDB@>8-88DDDADCDDDBCDEDGHGHEAJJIFJJIJIJJJJIIGHGHHFFDDF@@C NM:i:1 AS:i:94 XS:i:94
HISEQ:229:C6N64ACXX:4:1101:16569:2194 163 gnl|DSMZ|PA77_HGAP3_2SMC_c3 17014 0 84M17S = 17249 337 CTCAAGCCCGGCGCCGATGAGCACGGTAACGGCCTGCGCGTNNTGGAGCATGACCAGTTCTACAAGAACAAGCCGAAGAACGTGNNNNNNNNNNCNTCTAA C@CFFFFFHHGHGJJJJJJIIGIJIJFHIIJIJHHHHFFD?##,5<?@CDDDDDDDDCDDEEDDDDDDDDDDBDDBBDDDDDBB################# NM:i:2 AS:i:80 XS:i:80
This is pretty messy, so lets clean this up a bit
read_id 83 chr3 17249 0 66M1D35M = 17014 -337 TTCTGCATGGCGTGC... ########?5<DB>3D... NM:i:1 AS:i:94 XS:i:94
read_id 163 chr3 17014 0 84M17S = 17249 337 CTCAAGCCCGGCGC... C@CFFFFFHHGHGJJJJJ... NM:i:2 AS:i:80 XS:i:80
The two reads shown are actually a read pair, meaning they derive from the same DNA molecule. The first entry is the read ID, the second is a (complicated flag), the third is the mapping position (base 17249), the fifth is the so called CIGAR code that specifies the aligment to the reference sequence. In this case, it specifies a 66 base match (66M), a one base deletion (1D), and another 35 base match (35M). The number 17014 is the location of the paired read and the -337 is the insert size, that is the length of the piece of DNA that went into this read. The remainder is the sequence and the quality score. The full format specification can be found here Note that the samfile contains all the information from the fastq file and information on where the read mapped.
We won't go into the details of how the mapping and alignment is done algorithmically. Most programs initally generate an index of the reference genome (a structure that allows rapid look-up of the positions of kmers), find seeds in reads using the index, and do a local pairwise aligment.
Insert size distribution
The insert is the piece of DNA flanked by sequencing adapters on your flow cell. After mapping paired end reads, we can look at the size of inserts by the distance between the start of the read1 and 2 on the reference
#!python
import pysam
import numpy as np
max_reads=10000
sam_fname = "d00v00.sam"
isizes = []
with pysam.Samfile(sam_fname) as samfile:
# Iterate over single reads
for i, read in enumerate(samfile):
if read.is_read1:
isizes.append(read.isize)
if i>max_reads:
break
import matplotlib.pyplot as plt
plt.plot(sorted(np.abs(isizes)), np.linspace(0,1,len(isizes)))
plt.xlim(0,700)
plt.xlabel("insert size")
plt.ylabel("fraction shorter than X")
Custom pile-up
We often use a small little function to generate pile-ups in python.
This is probably not the fastest nor the most standard way of doing it, but it works and lets you inspect how often you get ACGT-N
at every position of the reference and how often you get small insertions.
It will also split these counts by read1
, read2
, and reads that map to the forward
and reverse
strands.
#!python
# cp /home/qbio1/201707_pile_up.py .
def sam_to_pile_up(sam_fname, paired=False, qual_min=30, max_reads=-1, VERBOSE = 0):
'''
calculates a pile up for a set of mapped reads
parameters:
sam_fname -- sam or bam file with mapped reads
paired -- differentiates between read one or two if True
otherwise the two reads are lumped together
qual_min -- Ignore bases with quality less than qmin
'''
import numpy as np
import pysam
from collections import defaultdict
alpha =np.fromstring('ACGT-N', dtype='S1')
def ac_array(length, paired):
if paired:
return np.zeros((2,2,6,length), dtype =int)
else:
return np.zeros((2,6,length), dtype = int)
# Note: the data structure for inserts is a nested dict with:
# position --> string --> read type --> count
# (dict) (dict) (list) (int)
def insertion_data_structure(paired):
if paired:
return defaultdict(lambda: defaultdict(lambda: np.zeros((2,2), int)))
else:
return defaultdict(lambda: defaultdict(lambda: np.zeros(2, int)))
# Open BAM or SAM file
with pysam.Samfile(sam_fname) as samfile:
ac = []
for nref in xrange(samfile.nreferences):
if VERBOSE: print "allocating for:", samfile.getrname(nref), "length:", samfile.lengths[nref]
ac.append((samfile.getrname(nref), ac_array(samfile.lengths[nref], paired),
insertion_data_structure(paired)))
# Iterate over single reads
for i, read in enumerate(samfile):
# Max number of reads
if i == max_reads:
if VERBOSE >= 2:
print 'Max reads reached:', max_reads
break
if read.is_unmapped: continue
# Print output
if (VERBOSE > 2) and (not ((i +1) % 10000)):
print (i+1)
# create short cuts to the nucleotide count and insertion data structures
rev = int(read.is_reverse)
if paired:
r1 = int(read.is_read2)
counts = ac[read.rname][1][r1, rev]
insertion = ac[read.rname][2]
else:
counts = ac[read.rname][1][rev]
insertion = ac[read.rname][2]
seq = np.fromstring(read.seq, 'S1')
qual = np.fromstring(read.qual, np.int8) - 33
pos = read.pos
# Iterate over blocks in the CIGAR codes
for ic, (block_type, block_len) in enumerate(read.cigar):
if block_type==4: # softclip
seq = seq[block_len:]
qual = qual[block_len:]
continue
if block_type==5: # hard clip
continue
# Match
if block_type == 0:
seqb = seq[:block_len]
qualb = qual[:block_len]
# Increment counts for possible nucleotide
for j, a in enumerate(alpha):
posa = ((seqb == a) & (qualb >= qual_min)).nonzero()[0]
if len(posa):
counts[j,pos + posa] += 1
# Chop off this block, advance reference position
if ic != len(read.cigar) - 1:
seq = seq[block_len:]
qual = qual[block_len:]
pos += block_len
# Deletion
elif block_type == 2:
# Increment gap counts
counts[4, pos:pos + block_len] += 1
# advance reference, but not read
pos += block_len
# Insertion
# an insert @ pos 391 means that seq[:391] is BEFORE the insert,
# THEN the insert, FINALLY comes seq[391:]
elif block_type == 1:
seqb = seq[:block_len]
qualb = qual[:block_len]
# Accept only high-quality inserts
if (qualb >= qual_min).all():
if paired:
insertion[pos][seqb.tostring()][r1, rev] += 1
else:
insertion[pos][seqb.tostring()][rev] += 1
# Chop off seq, but not pos
if ic != len(read.cigar) - 1:
seq = seq[block_len:]
qual = qual[block_len:]
# Other types of cigar?
else:
if VERBOSE>2:
print "unrecognized CIGAR type:", read.cigarstring
#raise ValueError('CIGAR type '+str(block_type)+' not recognized')
return ac
if __name__ == '__main__':
fname = 'd00v00.sam'
pile_up = sam_to_pile_up(fname, paired=False, qual_min=20, max_reads=100000)
Assembly
The processes of stitching together many small reads into larger contiguous sequences is called assembly. The main algorithmic strategies to assemble genomes are de Buijn graphs (where nodes are kmers) and overlap graphs (where nodes are reads).
Popular assembly programs include
UCSB computing enviroment
USCB generously allowed us to use their HPC environment for this course. It uses the PBS queing system -- you can find their quick start guide here.
To submit a job, you typically specify a small script that specifies the resources the job needs and starts the program that does the work.
#!/bin/bash
# specify number of nodes and cores per node
#PBS -l nodes=1:ppn=1
# specify the time you expect the job to run hh:mm:ss
#PBS -l walltime=2:00:00
# you executable
echo "hello world", $1
This would then be called as
qsub submit.sh -F random_output
where the additional argument "random_output" will be sustituted for the "$1" in the script. Multiple arguments need to be enclosed by quotation marks. To check the status of your jobs, type
qstat -u username
For the different tasks that we will run during this course, we have to generate analogous scripts that submit the appropriate jobs.