Tracking substitution rate through simulation in a Wright-Fisher model

By Alis Benkert and Alex Ghinea under the supervision of Thibault Latrille

Module Modelling and Bioinformatics Spring 2024

Link to powerpoint presentation

Abstract

The Wright-Fisher model was used to track substitutions and fixations in simulated populations. Mutation rates were introduced, multiple loci were tracked, and substitutions were calculated as the number of substitutions per generation. Then, selection was applied to the populations while observing one locus, and the effects of selection on the substitution rate was observed.

Introduction

The goal of our bioinformatics project was to determine the substitution rate across loci in a given population. To do this, we wanted to simulate a population with mutations and be able to track when the mutations cause substitutions in loci. Substitutions in our case are substitution mutations that have a potential to fixate in the population, as in they could become the new ancestral allele at that locus. We want to know the number of substitutions per genome per generation and observe the spread of substitutions across the population to find when it fixes.

We then wanted to be able to adjust the script to change the population size, number of loci, and mutation rate.

Part 1: Tracking Substitutions

We needed to generate a population of individuals and be able to introduce mutations, and then track the mutations when they occur. We used a population that was fixed to make it easier to control, used non-overlapping generations, diploid individuals, and the individuals were hermaphroditic without any occurrence of self-fertilization.

Modelling the population

The population was represented by 3-dimensional numpy arrays

Dimension 1: Individuals

Dimension 2: Loci

Dimension 3: Alleles

To generate the arrays, we need alleles to choose from with their own frequencies, a population size "N", and a number of loci.

Creating the new generation

The simplest way to create the new generation (recall the generations were non-overlapping), is to take N offsprings and have them "choose" their parents randomly from the previous generation. We do this because it is more logical to backtrack from the offspring to the parent than try to predict what the offspring will be using the parents.

First, we generated an empty 3D array for the offspring, then used 2 numpy arrays to store random indexes while ensuring no offspring has two of the same individual as a parent (no self-fertilization). Then, create 2D numpy arrays storing randomly chosen allele indexes (0 or 1) for each locus and each offspring (since we are using diploid organisms).

Implement mutations

Every time a new offspring population is created, each allele may or may not mutate into another one depending on the mutation rate. In order to implement a mutation rate of, for example 0.1 (where 10% of alleles mutate), first we generated a numpy array of the same dimensions as before, then an offspring population randomly filled with a number between 0 and 1. Then, we identified where the random number was between 0 and 0.1 (so 10%), where 0-0.1 is true and 0.11-1 is false. Then, the mask is transferred over to the population, and the true condition causes the allele to randomly be mutated to another allele.

Iterate

We began with a totally monomorphic population, then iterated it over and over using offspring as the next generations' parents each time. We continued tracking which alleles were present at each locus, and when a locus became monomorphic in the population for a novel allele, it was a substitution (fixation)!

Results

Theoretical

In simple models, substitution rate K is almost always independent from population size N. Therefore, we would expect that for neutral mutations (where the selection coefficient is 0), the substitution rate K is equal to the mutation rate πœ‡.

K=πœ‡

When we look at loci, the relationship is then changed to the substitution rate being equal to the mutation rate times the number of loci.

K=πœ‡ x L


K=πœ‡ is made up of more elements than just the substitution rate and mutation rate. The full equation is as follows:

K = ( 2π‘πœ‡ βˆ— 〖𝑃〗_𝑓𝑖π‘₯ = 2π‘πœ‡/2N ) = πœ‡

In this equation, 2Nπœ‡ represents the mutations per generation. We then multiply the number of mutations per generation by the probability that they will fix in the population,γ€– 𝑃〗_𝑓𝑖π‘₯.

〖𝑃〗_𝑓𝑖π‘₯ = 1/2N describes the probability that any given allele will fix by considering all of the alleles in the population.

By inputtingγ€– 𝑃〗_𝑓𝑖π‘₯ = 1/2N into the first equation, we see it simplifies to πœ‡. This is our theoretical relationship.

However, the complication occurs within the equation for the probability of fixation. 〖𝑃〗_𝑓𝑖π‘₯ = 1/2N is only true when the mutation rate πœ‡ multiplied by the current population Ne is much less than one:

πœ‡ * Ne << 1

When this condition is not met, the substitution rate no longer equals the mutation rate. This puts a restriction on the population size/mutation rate relationship if we want to maintain the theoretical substitution rate expression.

In a population of 1000, a mutation rate of ΞΌ= 1 x 10^-3 would be too large (πœ‡ * Ne = 1). The problem with this example is the mutation rate is so high that any substitutions that may occur never have an opportunity to fix in a population, since mutations are happening so frequently.

In a population of 1000, a mutation rate of ΞΌ= 1 x 10^-5 would be good (πœ‡ * Ne = 1). Mutations will occur rarely enough that they have an opportunity to spread and fix in the population, though it may take a while, especially since in our case we wanted to simulate real-life populations, which can be quite large. Over many generations and observing many loci, this can prove difficult to run as a code.

Practical

In real life, the πœ‡ * Ne expression usually results in a value between 0.01 and 0.0001. We found that as long as we maintained a relationship between πœ‡ and N, we could use a much smaller population and higher mutation rate to produce the same results in a fraction of the time. This is a concept called scaling.

This was supported by us running a simulation with a population of 10, 50, and 500, where we observed that all populations had almost the same number of substitutions per generation as one another. The smaller populations had a bit more stochasticity in their results, but were nonetheless closely aligned with our trendline of K=πœ‡. The delay seen for the population size 500 is caused by the lag in fixation due to the size of the population, a phenomenon described by the equation 4*Ne. If the graph were to be trimmed from the first generation to the 2000th, then realigned, it would closely follow the trendline on the graph. The other populations are small enough that there is no observable effect.

As a result, we used this concept of scaling to help us in the next section of our project, where we introduced selection.

Part 2: Selection

What selection?

Up until now, we had used random mating in the population to obtain our offspring and run our simulation. Now, we would like to introduce selection. There are two main types of selection, those being survival selection and sexual selection. We decided to focus on sexual selection since it is easier for us to work with in the setup we have. This will mean that the genotype of our individuals will influence their chance of becoming a parent and contributing to the next generation.

The fitness landscape

To illustrate the idea of the implementation of selection in the simulation, we are using fitness landscapes. This displays the determination of an individuals fitness based on the alleles it has. In our example, the X axis is the allele from parent 1, the Y axis is the allele from parent 2, and the Z axis is the fitness associated with the combinations. A/A has a strong positive fitness, C/G and C/C seem to have a positive fitness, and T/T is deleterious.

The fitness associated with alleles are called fitness effects. Selection coefficients are determined from subtracting the fitness effect of the ancestral allele from the allele of the individual. If the fitness effect of the current allele is better than the ancestral, the selection coefficient will be positive, whereas if it is worse, the selection coefficient will be negative. The selection coefficients have a direct effect on the individuals' probability to mate.

Implementing selection in a simulation

First, we chose a fitness landscape. We maintained our 3D numpy array of populations, but focused on one locus only and applied fitness proportionate selection. We maintained our dimensions the same as in the previous section, then we assigned fitness values to each allele. We then summed them for each individual to assign individuals a fitness value, which is possible since with codominant selection, both alleles are weighed the same. The probability of mating of each individual is then calculated by taking the individuals' fitness using both alleles and dividing it by the sum of the fitness of all the individuals in the population.

〖𝑃(π‘šπ‘Žπ‘‘π‘–π‘›π‘”)〗𝑖 =〖𝐹𝑖𝑑𝑛𝑒𝑠𝑠〗𝑖 /(βˆ‘π‘ j=1〖𝐹𝑖𝑑𝑛𝑒𝑠𝑠〗𝑗 )

We then selected parents the same as before, but this time accounting for their specific mating probabilities. Then we proceed with recombination and mutation exactly as before.


We began with a fully monomorphic population of only A individuals (assuming a past fitness landscape where A was the fittest state) and ran it for a different fitness landscape.

Results

We ran simulations with population sizes of 10, mutation rates of 1x10^-5, and 100,000 generations with varying selection coefficients. We only changed the fitness effect of one allele in our simulations, we chose C, and we only increased the effect. The graphs display generations on the X axis, allele frequency on the Y axis, and the legends show the colours assigned to the alleles as well as the generation at which a substitution occurred.

First, we began by not increasing any fitness effects at all (so there was no preference for any allele). We saw a couple of substitutions, for example T --> C which took 120 generations to fix in the population. Substitutions in this scenario are completely at random.

Next, we changed the fitness effect so that the likelihood of an individual with a C allele being chosen to mate increases by 1%. This did not change anything about the graphic however, since the increase in fitness is not enough to make any significant difference.

We then tried an increase of 10%, which showed us a substitution to C in both attempts, which occurred in the span of about 30 generations, so quite rapidly. So at 10%, we see that C is able to dominate at this locus quite quickly and remains there.

Finally, we wanted to see what would happen if there was a 100% increase? The resulting graphic shows an almost immediate fixation to C; the second a mutation to C appeared in the population, it spread and fixed within about 20 generations.

Fitness landscape

To explain this phenomenon, we looked back at fitness landscapes. If the fitness effect is high enough in C compared to the other alleles, then a mutation to the C allele drives the whole population to ascend to peak fitness, at the top of the fitness landscape. From there, any mutation back to any other allele is counter-selected, which prevents the population from ever falling back down. Any mutations between A, G, and T are not valuable since they have a similar fitness effect.

When the selection coefficient is around 0 for all, and the fitness effects are very low, the alleles are easily interchanged since there is little difference in having one over the other in terms of selection. There is little drive for a climb up the landscape, and shifting on the landscape is still possible as a result of drift, where the effect of randomness is stronger than that of selection.

Conclusion

We were successfully able to simulate real conditions of real populations, and able to observe possible effects of mutations, selection, and drift on them.

Our project could be expanded on by looking at:

- Changing the fitness effects in various ways; we had only increased that of C, but more could have been changed and they could also be changed negatively or positively. We also could have changed parameters to look at different types of dominance, such as dominant and recessive alleles and see how our populations respond differently.

- The red queen hypothesis - a constant arms race. Fitness landscapes of two organisms that impact one another due to a competition of survival would be interesting to observe with this sort of project. It could also be interesting to observe a relationship like this between a host and a virus, where any mutation may be beneficial for competition, but might be detrimental for the individual, etc. etc.

- Linkage between loci, where we could observe the loci carried by individuals selected for one locus in particular, and then see how linked loci have an effect on future generations. Additionally, it would be interesting to look at how deleterious other linked loci could be in an individual that is being positively selected for at a particular locus.

- Interactions between loci selected sexually or for survival. They could be observed as they change, if they overlap, if one locus selected for sexually is linked to a locus that is deleterious for survival, etc. etc.


Remainder

Bonus graphics with the same parameters as before, but in a population of 100 individuals

Answers to questions:

- Why fixed population sizes?

- Why non-overlapping generations?