Nicholas Noll, Marco Molari and Richard Neher
bioRxiv, vol. , 2022.02.24.481757, 2022
The genomic diversity of microbes is commonly parameterized as population genetic polymorphisms relative to a reference genome of a well-characterized, but arbitrary, isolate. Reference genomes contain a fraction of the microbial pangenome, the set of genes observed within all isolates of a given species, and are thus blind to both the dynamics of the accessory genome, as well as variation within gene order and copy number. With the wide-spread usage of long-read sequencing, the number of high-quality, complete genome assemblies has increased dramatically. Traditional computational approaches towards whole-genome analysis either scale poorly, or treat genomes as dissociated bags of genes, and thus are not suited for this new era. Here, we present PanGraph, a Julia based library and command line interface for aligning whole genomes into a graph, wherein each genome is represented as an undirected path along vertices, which in turn, encapsulate homologous multiple sequence alignments. The resultant data structure succinctly summarizes population-level nucleotide and structural polymorphisms and can be exported into a several common formats for either downstream analysis or immediate visualization.