Probabilistic inference in population genetics
The many sequenced genomes that are now available allow inference of past population processes from genetic diversity. Population processes of interest include selection of novel variants that help populations to adapt to new enviroments, severe reductions in population size (so called population bottlenecks), and migration. The objectives of this lecture are to
- introduce basic concepts of population genetic inference
- set up probabilistic models of variant frequencies
- solve branching processes using generating functions and the method of characteristics
Evidently, all of this will have to remain quite superficial but I am hoping to give you some flavor of what population genetics is about and what kind of questions can be addressed with sequence data.
As genomes are passed on from parents to children, they are reshuffled by recombination and altered by mutation, and the composition of the gene pool changes since some variants are more successful than others -- either by chance or because of their intrinisic properties. Understanding these processes is essential in order to use genetic data for inference. Genetic data typically comes in the form of sequences of whole genomes or a particular gene from different individuals and these sequences differ at a few positions. These differences originate from mutations that happened at some point the past. When a group of individuals shares a particular mutation, this mutations could have either happened in the common ancestor of this group of individuals, or this mutation could have arisen multiple times. When mutated sites are a minority (humans differ at 1 in 1000 sites), a shared mutations typically arose in the ancestor of this group of individuals. Some type of mutations, for example drug resistance mutations often arise independently under treatment. We will assume for now that the great majority of mutations is not of that type. In this case, mutations -- or more generally the similarity between sequences -- can be used to reconstruct the a tree that describes the history of the sequences in the sample:
Population genetics is concerned with understanding the link between mutation, selection, etc and genetic diversity in the population.This link is essential for inference.
The neutral background model
In absence of any systematic differences between different variants, their relative frequencies will nevertheless change by chance. The stochastic element stems from the randomness in the number of offspring each individual has. This process is typically modeled in two complementary but equivalent ways -- either via a "forward in time" diffusion process of variants or a "backward in time" genealogical process called the coalescent. It is instructive to consider them both.
Coalescent models
The coalescent starts from a sample of individuals and asks "What is the probability that a given pair are siblings, i.e., have the same parent?". In a population of constant size (average number of offspring \(m=1\)), it is easy to show that this probability is \(\frac{\sigma^2}{N}\). This result makes intuitive sense: In a large population, the probability of sharing a parent by change is small. Furthermore, this probability is larger if there are a few large families, that is the offspring number distribution has large variance. If there are \(k\) individuals in the sample, there are \(k(k-1)/2\) possible pairs. Tracing now these \(k\) lineages backward in time, the first merger between a random pair will happen at rate $$ \lambda_k = \frac{k(k-1)\sigma^2}{2N} $$ After the first merger event, the number of lineages decreases from \(k\to k-1\) until eventually only one lineage is left, see the sketch of a tree above. This lineage is the most recent common ancestor (MRCA) of the sample and the expected time to get there is simply the sum of the expectation of the different merger events. $$ \langle T_{MRCA}\rangle = \frac{2N}{\sigma^2}\sum_{k=2}^{n} \frac{1}{k(k-1)} = \frac{2N}{\sigma^2}\left[\frac{1}{2} + \frac{1}{6}+\cdots\right] = \frac{2N(n-1)}{\sigma^2 n} $$ (easy to show by induction -- try it!).
The mutations observed in the sample have accumulated on the tree since the MRCA and genetic diversity is informative about population processes in the past. The averge pairwise difference between sequences, for example, is equal to \(N\mu\) where \(\mu\) is the per site mutation rate. Furthermore, time dependence of the population size can be inferred by fitting a trajectory \(N(t)\) that maximizes the likelihood of the sequence data.
Coalescent processes in population genetics were first introduced by Kingman in 1982.
Diffusion models
A completely equivalent approach to describe genetic diversity in the population is via a forward-in-time diffusion process. If the size of the population is approximately constant, the average number of offspring is evidently \(m=1\) , while variance \(\sigma^2\) depends on the mode of reproduction. Given mean and variance of the offspring number distribution, we can calculate the distribution of the frequency of a variant in the subsequent generation. In a population of size \(N\), the this frequency is going to be centered around its value \(p\) in the previous generation with a variance \(\frac{\sigma^2 p(1-p)}{N}\). In the continuous time lime, the distribution \(P(p,t)\) obeys the diffusion equation $$\frac{\partial}{\partial t} P(p,t) = \frac{\sigma^2}{2N}\frac{\partial^2}{\partial p^2}p(1-p)P(p,t)$$
It is easy to see that this can be solved by separation of variables \(P(p,t) = a(p)b(t)\) which results in
$$\frac{1}{b(t)}\frac{\partial}{\partial t} b(t) = -\lambda = \frac{1}{a(p)}\frac{\sigma^2}{2N}\frac{\partial^2}{\partial p^2}p(1-p)a(p)$$
suggesting that mode decay is exponential with \(b(t) = e^{-\lambda t}\). The equation for \(a(p)\) is related to the hypergeometric equation and its solution can be expressed in terms of Gegenbauer polynomials. Finite solution at \(p=0\) and \(p=1\) requires a discrete spectrum with such that admissable solutions are $$ a_k(p) = \frac{k(k+1)\sigma^2}{N}T_{k}(p) $$ Each of these solutions corresponds to an eigenmode that decays with rate \(\frac{k(k+1)\sigma^2}{N}\) and the general solution is $$ P(p,t) = \frac{\sigma^2}{N}\sum_{k=0} c_k k(k+1) T_{k}(p) e^{-\frac{k(k+1)\sigma^2}{N}} $$ This description of the frequencies of genetic variants in the population was developed by Motoo Kimura and is summarized in Kimura, 1964. Note that the decay rates of the eigenmodes are the same as the rates at which the tree goes from \(k\) to \(k-1\) lineages.
Branching processes and the spread of beneficial mutations
The models introduced above assume that all individuals in the population are completely equivalent, meaning that there are no heritable traits that affect the offspring distribution. Beneficial mutations, i.e., mutations that increase the change of survival, are of particular interest. These mutations typically arise in a single individual and are then passed on in a stochastic manner. It is hence quite uncertain if a beneficial mutation is going to survive. For sake of arguments, lets assume the offspring number distribution is Poisson. Individuals with a new good mutation have a mean number of offspring of \(m = 1+s\), those without the mutation simply 1. The probabibility of loss of the mutation obeys $$ P_0 = \sum_{i=0} p_i P_0^i = e^{-1-s}\sum_{i=0} \frac{((1+s)P_0)^i}{i!} = e^{-(1+s)(1-P_0)} $$ For small \(s\), \(P_0\) is close to 1 and we can expand in \(P_{fix} = 1-P_0\) $$ 1 - P_{fix} = 1 - (1+s)P_{fix} + (1+2s+s^2)P_{fix}^2/2 \Rightarrow sP_{fix}\approx P_{fix}^2/2 $$ and hence \(P_{fix}\approx 2s\). This illustrates that a mutation that confers an advantage of 10% is still lost in 80% of all cases -- evolution is very stochastic!
More generally, stochastic population processes can be modeled with branching processes and it is instructive to solve explicitly for the solution of a branching process. The goal is to solve for the probability distribution of the number of copies of a genetic variant \(P(n,t)\) at time \(t\) after the mutation arose. This quantity obeys a simple difference equation $$ P(n, t+dt) = (1-dt(b+d)n)P(n, t) + dt\left[ (n-1)b P(n-1,t) + (n+1)dP(n+1,t)\right] $$ where the first term corresponds to the case when nothing is happening in the time interval \(dt\), the second corresponds to division/birth with rate \(d\) and the third to the death of one individual with rate \(d\). Taking the limit \(dt\to 0\), we immediately get $$ \partial_t P(n, t) = -(b+d)nP(n, t) + (n-1)bP(n-1,t) + (n+1)dP(n+1,t) $$ To solve this equation, we define the generating function \(\phi(z,t) = \sum_n z^n P(n,t)\), multiply the equation by \(z^n\) and sum: $$ \partial_t \phi(z, t) = -(b+d) z \partial_z \sum_{n=0} z^{n} P(n, t) + z^2 b \partial_z \sum_{n=0} z^{n-1}P(n-1,t) + \partial_z \sum_{n=1} n z^{n+1}P(n+1,t) = \left[-(b+d)z + bz^2 + d\right]\partial_z \phi(z,t) $$ We will consider here the case where \(P(n,0)=\delta_{n0}\), i.e., there is exactly one individual at time \(t=0\), and \(\phi(z,t) = z\) To solve this partial differential equation, we can use the methods of characteristics. The method of characteristics determines families of curves in the space of \((z,t)\) along which the function is constant. Parameterizing the characteristic with \(t\), we have \(\phi(z(t),t)\) and consequently. $$ 0 = \frac{d\phi(z(t), t)}{dt} = \partial_z \phi(z,t) \frac{dz}{dt} + \partial_t \phi(z,t) = \partial_z \phi(z,t) \left[\frac{dz}{dt} -(b+d)z + b z^2 + d \right] $$ Substituting \(z(t) = 1- \chi(t)\), this equation simplifies to $$ \frac{d \chi(t)}{dt} = - s \chi(t) + b\chi(t)^2 $$ with \(s=b-d\). The solution then is $$ \chi(t) = \frac{e^{-st}}{C-b\int_0^t e^{-st'} dt'} = \frac{e^{-st}}{C+\frac{b}{s}(e^{-st}-1)} $$ with initial condition \(\chi(0) = C^{-1}\). By construction, the generating function \(\phi(z,t)\) is constant along the characteristic curve such that \(\phi(z(t),t) = \phi(z(0),0)) = z(0) = 1-C^{-1}\). The problem is hence reduced to finding the integration constant \(C\) such that the characteristic passes through the point \((z,t)\). Solving $$ z = 1 - \frac{e^{-st}}{C(z,t)+\frac{b}{s}(e^{-st}-1)} $$ for \(C(z,t)\) yields $$ %(1 - z)(C(z,t)+\frac{b}{s}(e^{-st}-1)) = e^{-st} %(1 - z)C(z,t) = e^{-st} - (1-z)\frac{b}{s}(e^{-st}-1) %\frac{1}{C(z,t)} = \frac{(1 - z)}{e^{-st} - (1-z)\frac{b}{s}(e^{-st}-1)} \phi(z,t) = 1 - \frac{1}{C(z,t)} = 1-\frac{(1 - z)e^{st}}{1 - (1-z)\frac{b}{s}(1 - e^{st})} $$ As a sanity check, we can evaluate the first few moments $$ 1 = \sum_n P(n,t) = \phi(1,t) = 1 $$ $$ \langle n\rangle = \sum_n n P(n,t) = \partial_z \phi(z,t)|_{z=1} = e^{st} $$ and the survival probability $$ 1-P(0,t) = 1-\phi(0,t) = \frac{e^{st}}{1+\frac{(1+s)}{s}(e^{st}-1)} \to \cases{ \frac{s}{1+s} & s>0\cr \frac{1}{1+t} & s=0\cr se^{st} & s<0 } $$ The figure below illustrates the behavior of the fixation probability for different growth rates \(s\). For times \(t<s\), all curves behave roughly neutrally. Only at longer times do systematic differences in growth rate matter and beneficial mutations fix for good and deleterious mutations are lost.
The full probability distribution \(P(n,t)\) can be readily obtained by expanding \(\phi(z,t)\) into a geometric series.
Handwritten notes
References
- Population Genetics: A concise guide, John Gillespie, John Hopkins University Press, 2004
- Diffusion Models in Population Genetics, Motoo Kimura, Journal of Applied Probability, Vol. 1, No. 2 (Dec., 1964), pp. 177-232
- The coalescent, John Kingman, Stochastic Processes and their Applications Volume 13, Issue 3, September 1982, Pages 235-248