While augur was initially designed to process viral sequence data, we have extended it to work with bacterial data. Genome evolution and spread of bacteria differs from viruses in several important ways:
- genomes are much larger, but only a small fraction of sites are variable
- bacteria often gain or loose genes via horizontal transfer
- drug resistance determinants and virulence factors often reside on plasmids or other mobile elements.
The latter two complications are not an issue for Mycobacterium tuberculosis (MTb) and we will focus our analysis here on MTb. The biggest difference in the analysis workflow is that sequence data typically doesn't come as a collection of sequences in fasta format, but as a list of differences relative to a reference sequence. These difference are typically stored in VCF format. The basic structure of VCF is the following
#CHROM POS ID REF ALT QUAL FILTER sample1 sample2 sample3 1 143 x G A 29 PASS 0 1 0 1 173 y T A 3 PASS 0 0 1 ...
Each row corresponds to a variable position and specifies the reference allele (REF) and the variant allele (ALT). Each sample corresponds to a column that contains information on whether the sample has the reference allele or the variant allele at this position. (The actual file is a lot more complicated than this example and contains all sorts of quality and filtering information).
Besides using different input files, the basic augur workflow is pretty similar for bacteria and viruses. A step-by-step tutorial for MTb data is available on the nextstrain website.