Load packages

### TIPS ###
# We first load all the libraries we will need for this practical. 
# Make sure you have those installed 
# If not then please visit https://www2.unil.ch/popgen/teaching/SISG22/
library(hierfstat)
library(gaston)
library(SNPRelate)
library(JGTeach)

Drift

We will start by doing in silico Buri’s drosophila cages experiment. We will generate using the JGTeach::drift function the allelic frequency trajectories for a number of replicates, corresponding to the different cages. Use \(100\) individuals, and initial frequency of \(0.5\) and \(1000\) replicates

Code

dum<-drift(nind=100,p0=0.5,nrep=1000,ngen=100,PlotIt = TRUE)

Describe the figure produced.

Answers

We see allele frequencies fluctuating as time goes, with the first replicates fixing one or the other allele after \(\approx 35\) generations. After a 100 generations, many replicates have fixed one or the other allele, but still many have the two alleles segregating.

set the number of individuals to 10; to 1000. How does population size affects the drifting process?

Code

dum<-drift(nind=10,p0=0.5,nrep=1000,ngen=100,PlotIt = TRUE)

dum<-drift(nind=1000,p0=0.5,nrep=1000,ngen=100,PlotIt = TRUE)

Answers

With 10 individuals only, allele frequencies fluctuations are much larger than with \(100\), and fixation of one or the other allele occurs very fast. After a 100 generation, only a handful of replicates retains polymorhism.

With \(1000\) individuals, we see the reverse pattern, fluctuations of allele frequencies are much smaller, and all replicates remain polymorphic after 100 generations, with allele frequencies remaining within the range \([0.2-0.8]\).

Set the number of individuals to 100 and the number of generation to 1000. Take time slices (generations 1, 2, 3, 5, 10, 50, 100,1000) to look at the distribution of the replicates allele frequencies over time.

Code

x<-drift(nind=100,ngen=1000)
gens<-c(1,2,3,5,10,20,50,100,1000)

par(mfrow=c(3,3))
for (i in gens)
hist(x[i,],breaks=0:50/50,main=paste("generation:", i),xlab="Freq",ylab="")

par(mfrow=c(1,1))

Describe the resulting figure and compare it to Buri’s results.

Answers

The figures show allele frequency distribution spreading from a pic at 0.5 for the initial generation, to something belle shaped around 0.5 until the 20th generation but with increasing spread. At generation 50, the flattening is more pronounced, and at generation 100, the distribution is U -shaped with several replicates having fixed one or the other allele, but still many replicates maintain a polymorphic state. At generation 1000, all replicates have fixed one or the other allele, in equal proportion

What are the overall expected genotypic frequencies after 10, 20, 50 and 100 generations, assuming all replicates started with a frequency of 0.5 in the initial generation?

Answer

Several ways to answer this question.

The first is “brute” force, using the average of the genotypic proportions assuming HW from each replicate at the relevant generation from the initial experiment, results from which were stored in objectx:

g<-c(10,20,50,100)

hom1<-x[g,]^2
het<-2*x[g,]*(1-x[g,])
hom2<-(1-x[g,])^2
round(rbind(rowMeans(hom1),rowMeans(het),rowMeans(hom2)),digits=3)

##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.264 0.274 0.311 0.367
## [2,] 0.478 0.452 0.394 0.301
## [3,] 0.259 0.274 0.295 0.332

The second is via estimating \(F_{ST}\) from these generations, as \(var(p)/(\bar{p} \times (1-\bar{p}))\):

Vp<-apply(x[g,],1,var)
barp<-rowMeans(x[g,])
(Fsts<-Vp/barp/(1-barp)) #alternatively Fsts<-Vp/0.5^2

## [1] 0.04484019 0.09558799 0.21156659 0.39819122

round(Fsts,digits=3)

## [1] 0.045 0.096 0.212 0.398

round(rbind(barp^2+barp*(1-barp)*Fsts,2*barp*(1-barp)*(1-Fsts),(1-barp)^2+barp*(1-barp)*Fsts),digits=3)

##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.264 0.274 0.311 0.367
## [2,] 0.478 0.452 0.394 0.301
## [3,] 0.259 0.274 0.295 0.332

Last, we could calculate the expected \(F_{ST}\)s for these generations, assuming no mutation and an infinite number of replicates (thus \(\bar{p}=p_0=0.5\)) using the relation \(\theta^t=1/2N+(1-1/2N)\theta^{t-1}\):

thetas<-numeric(100)
p0<-.5
N<-100
thetas[1]<-1/2/N
theta0<-0; for (i in 2:100) thetas[i]<-1/2/N+(1-1/2/N)*thetas[i-1]
round(thetas[g],digits=3)

## [1] 0.049 0.095 0.222 0.394

round(rbind(p0^2+p0^2*thetas[g],2*p0*(1-p0)*(1-thetas[g]),(1-p0)^2+p0*(1-p0)*thetas[g]),digits=3)

##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.262 0.274 0.305 0.349
## [2,] 0.476 0.452 0.389 0.303
## [3,] 0.262 0.274 0.305 0.349

We see the estimated \(F_{ST}\) are close to their expectations for all generations, as are the proportion of the different genotypes.

[optional]. We will use the JGTeach::drift function to explore some properties of drift, namely the probability and time to fixation of an allele.

Theory tells us that the probability of fixation of an allele is its frequency, and that time to fixation is

\[ \frac{4N(1-p) \ln(1-p)}{p} \]

which, for a frequency of \(0.5\), corresponds to \(2.7N\) generations.

[optional] simulate \(10,000\) replicates of a population of \(50\) diploid individuals starting with a frequency \(p_0=0.5\). The expected probability of fixation is \(0.5\), and it should take on average \(135\) generations to achieve fixation.

Code

x1<-drift(nind=50,ngen=1000,nrep=10000)
par(mfrow=c(3,3))
for (i in gens)
  hist(x1[i,],breaks=seq(0,1,0.01),main=paste("generation:", i),xlab="Freq",ylab="")

par(mfrow=c(1,1))

[optional] extract the time to fixation of the different replicates, from which you will extract both the probability and mean time to fixation.

Code

 # distribution of fixation times: take the first occurrence of 1 for each replicate
dist.tfix<-apply(x1,2,function(y) which(y>=1.0)[1])
# fixation probability
(pfix<-sum(!is.na(dist.tfix))/10000)

## [1] 0.4982

# mean time to fixation
(tfix<-mean(dist.tfix,na.rm=T))

## [1] 137.3695

Both the probability of fixation 0.5 and the mean time to fixation 137 are close to their expected values

[optional] Make a histogram of the time to fixation and describe it.

Answer

hist(dist.tfix, main="Distribution of time to fixation, p0=0.50, N=50",xlab="Generations",ylab="")
abline(v=tfix, col="red",lwd=2)

The distribution of fixation time as a long right tail, and a mode to the left of the mean time to fixation: many replicates will fix before the mean time to fixation, but a few will take a very long time to reach fixation.

Individual populations \(F_{ST}\)s

Simulate using hierfstat::sim.genot.t 3 populations evolving independently one from another, in order to check the approximation

\[ F_{ST_i} \approx \frac{t}{2N_{i}} \]

The 3 populations have diploid size \(N_1=100,N_2=1000\) and \(N_3=10000\) respectively. after 50 generations, we expect \(\beta_{1,2,3}\) to be approximately 0.25, 0.025, 0.0025. Is it the case ?

Answer

# 50 loci with 20 alleles each, 
dat<-sim.genot.t(size=100,nbal=20,N=c(100,1000,10000),
                 nbloc=50,mig=0.0,t=50)

# the betas function estimates population specific Fsts from a
# fstat format data frame assuming random mating

(beta.ci<-betas(dat,nboot=100)$ci)

##                                            boot.bW
## 2.5%  0.1843812 0.02237260 -0.004061785 0.07162905
## 97.5% 0.2366347 0.03460005  0.007309045 0.09018959

The confidence intervals of \(F_{ST_i}\)s overlap with the expectation for population 2 and 3 (\(1000\) and \(10000\) individuals respectively), but not for population 1, where the upper limit of the confidence interval of \(F_{ST_1}\), 0.2366347 is just below the approximation 0.25.

Check whether the approximation holds after \(200\) generations.

Answer

# 50 loci with 20 alleles each, 
dat<-sim.genot.t(size=100,nbal=20,N=c(100,1000,10000),
                 nbloc=50,mig=0.0,t=200)

# the betas functiom estimates population specific Fst from a
# fstat format data frame assuming random mating
betas(dat,nboot=100)$ci

##                                         boot.bW
## 2.5%  0.5904312 0.0900153 0.005164215 0.2368829
## 97.5% 0.6912153 0.1276246 0.028368175 0.2725287

#expectation
200/c(200,2000,20000)

## [1] 1.00 0.10 0.01

The pattern we saw after 50 generations is exemplified: \(F_{ST_1}\) confidence interval does not include the approximation of its expectation (\(1\)). For the other two, the approximation is still within the confidence interval, but barely so for the second. A rule of thumb for the approximation to work is \(t \leq 0.2N\)

[optional] only time allowing, come back to this if you finished everything else.

Imagine a way to do this simulation with ms.

Answer

# assumes mu=1e-8, r=1e-8, N0=1e+5, nbp=1e+5, 
# simulate 10 replicates, 100 diploid individuals from each pop
#  
# pop 4 will be the ancestral population with N4=N0=1e+5
# Only pops 1-3 are sampled, and there is no migration:
# -I 4 200 200 200 0 0 
# pop1 is a 1000th of anc pop, pop2 a 100th and pop 3 a 10th:
# -n 1 0.001 -n 2 0.01 -n 3 0.1
# split occured (backward) 50 generation ago
# in unit of 4* Anc pop size ->50/4/100000=0.000125:
# -ej 0.000125 1 4 -ej 0.000125 2 4 -ej 0.000125 3 4

ms 600 10 -t 400 -r 400 100000 -I 4 200 200 200 0 0 -n 1 0.001 -n 2 0.01 -n 3 0.1 -ej 0.000125 1 4 -ej 0.000125 2 4 -ej 0.000125 3 4 > 3popsdrift.txt

bed<-ms2bed("../3popsdrift.txt")
# expectation is (50/2/c(100,1000,10000)), 
# but works well for t<0.2N only
fst.dosage(bed,pop=rep(1:3,each=50))

##             1             2             3           All 
##  3.020749e-01  2.661630e-02 -9.885836e-05  1.095308e-01

50/c(200,2000,20000)

## [1] 0.2500 0.0250 0.0025

Population specific Fsts from the 1000 genomes

We will now explore the first 20 megabases of chromosome 22 from samples of the 1000 genome project and will find that negative population specific \(F_{ST}^i\) are not uncommon.

Load chromosome 22 fragment VCF file from the 1000 genomes into R (see practical 1) and the samples description.

Code

ch22<-read.VCF("../chr22_Mb0_20.recode.vcf.gz")

## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.

samp.desc.url<-"https://www2.unil.ch/popgen/teaching/SISG18/integrated_call_samples_v3.20130502.ALL.panel"
samp.desc<-read.table(samp.desc.url,header=TRUE,stringsAsFactors = TRUE)

Describe object samp.desc: Print a table of the number of samples per super_pop and pop.

Code

with(samp.desc,table(pop,super_pop))

##      super_pop
## pop   AFR AMR EAS EUR SAS
##   ACB  96   0   0   0   0
##   ASW  61   0   0   0   0
##   BEB   0   0   0   0  86
##   CDX   0   0  93   0   0
##   CEU   0   0   0  99   0
##   CHB   0   0 103   0   0
##   CHS   0   0 105   0   0
##   CLM   0  94   0   0   0
##   ESN  99   0   0   0   0
##   FIN   0   0   0  99   0
##   GBR   0   0   0  91   0
##   GIH   0   0   0   0 103
##   GWD 113   0   0   0   0
##   IBS   0   0   0 107   0
##   ITU   0   0   0   0 102
##   JPT   0   0 104   0   0
##   KHV   0   0  99   0   0
##   LWK  99   0   0   0   0
##   MSL  85   0   0   0   0
##   MXL   0  64   0   0   0
##   PEL   0  85   0   0   0
##   PJL   0   0   0   0  96
##   PUR   0 104   0   0   0
##   STU   0   0   0   0 102
##   TSI   0   0   0 107   0
##   YRI 108   0   0   0   0

Verify that samples are the same and in the same order in ch22 and samp.desc.

Code

all.equal(ch22@ped$id,as.character(samp.desc$sample),check.attributes = FALSE)

## [1] TRUE

# if not (in different order and or some missing) :
# use match(ch22@ped$id,samp.desc$sample)

Estimate population and continent \(F_{ST}\)s from this dataset (it takes some time). You will have to download the Allele sharing (matching) file first.

Code+Answers

# From dosage data:
# fst.ch22.pop<-fst.dosage(ch22,pop=samp.desc$pop)
# fst.ch22.cont<-fst.dosage(ch22,pop=samp.desc$super_pop)

# More efficiently from matching data
# Matching.ch22<-matching(ch22)
# takes some time, so saved in matching.ch22.RDS available from # website

Matching.ch22<-readRDS("../matching.ch22.RDS")
fs.ch22.pop<-fs.dosage(Matching.ch22,pop=samp.desc$pop,matching=TRUE)
fs.ch22.cont<-fs.dosage(Matching.ch22,pop=samp.desc$super_pop,matching=TRUE)
fs.ch22.pop$Fs[2,]

##         ACB         ASW         BEB         CDX         CEU         CHB         CHS 
## -0.09355343 -0.08075169  0.10151305  0.17074313  0.14365325  0.17838039  0.17711221 
##         CLM         ESN         FIN         GBR         GIH         GWD         IBS 
##  0.08806118 -0.08344716  0.16672832  0.14865051  0.07875782 -0.07935443  0.14356831 
##         ITU         JPT         KHV         LWK         MSL         MXL         PEL 
##  0.10161283  0.16150007  0.15666623 -0.08451502 -0.10001199  0.13114569  0.18725503 
##         PJL         PUR         STU         TSI         YRI         All 
##  0.07933950  0.08566280  0.09944818  0.14146696 -0.08124044  0.07455351

fs.ch22.cont$Fs[2,]

##         AFR         AMR         EAS         EUR         SAS         All 
## -0.09971103  0.09566666  0.15940128  0.13837331  0.08328910  0.07540386

Population specific \(F_{ST}\)s vary between \(-0.10\) for Mende from Sierra Leone (MSL) and \(0.18\) for Han Chinese in Bejing, China (CHB) 1000 genomes population codes with an overall \(F_{ST}=0.075\).

Continent specific \(F_{ST}\)s vary between \(-0.10\) for Africa (AFR) and \(0.16\) for Esat Asia (EAS), with the same overall value.

There is a plot function associated to fs.dosage. Produce the corresponding plot after having ordeered the population by continent and discuss the results in the light of what you know about human demographic history: the top left panel show individual inbreeding coefficients per population, relative to the mean kinship in their population. The right column panels show \(F_{ST}^{XY}\) (or \(\beta^{XY}\)) on top and population specific \(F_{ST}\) at the bottom. Last, the bottom left panel shows pairwise \(F_{ST}\). Do you see a similarity between \(F_{ST}^{XY}\) and the heatmap of \(K_{AS}\) kinship?

Answers

#extract popnames in the correct order
pnames<-unlist(lapply(strsplit(with(samp.desc,
      levels(factor(paste(super_pop,pop,sep=":")))),":"),function(x) x[[2]]))
#forces factor to take  this order                            
pop<-with(samp.desc,factor(pop,levels=pnames))
fs.ch22.pop<-fs.dosage(Matching.ch22,pop=pop,matching=TRUE)
coul<-rep(c("orange","gold","red","green","blue","purple"),c(2,5,4,5,5,5))
plot(fs.ch22.pop,las=2,cex.axis=0.5,col=coul)

The plot function of hierfstat::fs.dosage produces four panels:

The top left panel shows individual inbreeding coefficients, estimated with the reference being the population to which the individual belongs. Samples are arranged according to their continent. Values of individual inbreeding cefficients are centered around 0, with Admixed Americans (AMR, red) samples showing more variation and South Asian (SAS, purple) samples hinting at slightly positive values.

The top right panel shows a heatmap of the FsM matrix containing population specific FSTs on the diagonal. The off diagonal elements contains the average of the kinships for pairs of individuals, one from each population, relative to the mean kinship for pairs of individuals between populations. This basically boils down to the average kinship per population or pair of we saw in the previous practical). African samples (AFR) show lower values than non african samples, with EAS and Europe (EUR) showing the largest within continent values.

The bottom left panel shows the Fst2x2 matrix, the classical pairwise \(F_{ST}\)s. All values within continents are low (with a small exception for AMR), while values between AFR and non AFR samples are high. We also see sightly higher values in the non AFR bloc for pairwise \(F_{ST}\)s between EAS and EUR.

The bottom right panel shows Population specific \(F_{ST}\)s as barplot. It’s immediately clear that all AFR values are negative, while all other populations show positive population specific \(F_{ST}\)s, with EAS population being largest, followed by EUR, and AMR showing the largest variance.

PCA

In this part, we will learn how to conduct Principal Component Analyses (PCA) on a genomic data set, [optionally] look at the different flavors of PCA, and get a feel for how to interpret the results.

Start by simulating the genotypes of 250 individuals, 50 from each of 5 populations, at \(1000\) bi-allelic loci, assuming an island model at equilibrium between migration, mutation and drift. Assume each island is made of \(1000\) individuals, with \(m=0.003\) between them. For this, use the sim.genot function of the hierfstatpackage.

Code

np<-5
dat5<-sim.genot(nbal=2,nbloc=1000,nbpop=np,mig=0.003)

Convert the data into dosage format using the biall2dos function.

Code

dos<-biall2dos(dat5[,-1])

Transform the dosage matrix into matrix \(X\), as done in Patterson et al. (2006). You may find the scale function useful for this.

Code

ps<-colMeans(dos)/2
sdp<-(ps*(1-ps))^(0.5)
#first, filter out the fixed loci
nfixed<-which(ps>0 & ps <1)
dosnf<-dos[,nfixed]

X<-scale(dosnf)

# or
# X<-sweep(dosnf,2,ps[nfixed],FUN="-")
# X<-sweep(X,2,sdp[nfixed],FUN="/")

Calculate \(XX'\) using function tcrossprod.

Code

XXT<-tcrossprod(X)

Obtain the eigen value decomposition of \(XX'\) using the eigen function.

Code

eigx<-eigen(XXT)

Obtain the individual’s coordinates \(UV^{1/2}\).

Code

ind.coord<-sweep(eigx$vectors,2,eigx$values^0.5,FUN="*")

Compare the results you obtain to what is obtained using the prcomp function on matrix \(X\).

Code

prx<-prcomp(X)
#check equality of the absolute values of the  PCs

all.equal(abs(matrix(prx$x)),abs(matrix(ind.coord)))

## [1] TRUE

In a four panel graphic windows, produce the following 4 plots:
- A scree plot of the first 20 eigen values, expressed as proportion of the sum of all eigenvalues
- A plot of PC2 against PC1, using a different color for each population
- A plot of PC4 against PC3
- A plot of PC6 against PC5

Code

par(mfrow=c(2,2))
plot((eigx$values/sum(eigx$values))[1:20],type="h",
     main="screeplot",xlab="Eigen values",ylab="")
plot(prx$x[,1:2],col=rep(1:np,each=50),pch=16)
plot(prx$x[,3:4],col=rep(1:np,each=50),pch=16)
plot(prx$x[,5:6],col=rep(1:np,each=50),pch=16)

par(mfrow=c(1,1))

Describe what you see: How many eigen values stand out? are samples grouped according to their population of origin? If so along which axes? How many PCs are necessary to describe the structure of this dataset?

Answer

The top left panel (scree plot) shows 4 eigen values standing out. Samples are indeed grouped according to their population of origin along axes 1 to 4. The bottom right panel (axes 5 and 6) shows no more structure remains after the first 4 axes, with individuals from all groups overlapping.

Produce a parallel coordinate plot, where the x-axis correspond to the first 5 PCA axes, and the y axis shows the coordinates of each individuals along the 5 axes, drawn as a line with color corresponding to the population in which the individual has been sampled.

Answer

colpca<-rep(1:np,each=50)
ns<-5
plot(1:ns,prx$x[1,1:ns],col=colpca[1],type="l",ylim=range(prx$x[,1:ns]),
     xlab="Axis",ylab="coord.",main="parallel coordinate plot")
for (i in 2:nrow(prx$x)) lines(1:ns,prx$x[i,1:ns],col=colpca[i])

The parallel coordinate plot representation is an easy way to visualize more than 2 axes at the same time: we see from the figure that individuals from the same population share similar coordinates along axes 1 to 4, the red population standing out on axis 1, the green on axis 2, the turquois on axis 3 and well separated from the black population, and finally the blue on axis 4.

[optional]. Redo this analysis on the matrix of dosages centered but not scale to \(sd=1\). Conclusions?

Answer

X<-scale(dosnf,scale=FALSE)
XXT<-tcrossprod(X)
eigx<-eigen(XXT)
ind.coord<-sweep(eigx$vectors,2,eigx$values^0.5,FUN="*")

prx<-prcomp(X)

all.equal(abs(matrix(prx$x)),abs(matrix(ind.coord)))

## [1] "'is.NA' value mismatch: 250 in current 0 in target"

par(mfrow=c(2,2))
plot((eigx$values/sum(eigx$values))[1:20],type="h",
     main="screeplot",xlab="Eigen values",ylab="")
plot(prx$x[,1:2],col=rep(1:np,each=50),pch=16)
plot(prx$x[,3:4],col=rep(1:np,each=50),pch=16)
plot(prx$x[,5:6],col=rep(1:np,each=50),pch=16)

par(mfrow=c(1,1))

The pattern we observed is very similar to the one obtained with the scaled matrix. With SNPs data, with many rare alleles, scaling gives the same weight to rare and common alleles, which might not be what ones want. If interested on this topic, there is a vast literature on using PCA in population genetics, see e.g. Novembre and Stephens, Nat. gent (2008).

[optional]. Instead of using \(XX'\) for eigenvalue decomposition, use the kinship matrix \(K_{AS}\) obtained from the allele sharing matrix, you might want to use the hierfstat::beta.dosage function for this. Conclusions?

Code

Kas<-beta.dosage(dos)
eigKas<-eigen(Kas)

ind.coord<-sweep(eigKas$vectors,2,eigKas$values^0.5,FUN="*")

par(mfrow=c(2,2))
plot((eigKas$values/sum(eigKas$values))[1:20],type="h",
     main="screeplot",xlab="Eigen values",ylab="")
plot(ind.coord[,1:2],col=rep(1:np,each=50),pch=16)
plot(ind.coord[,3:4],col=rep(1:np,each=50),pch=16)
plot(ind.coord[,5:6],col=rep(1:np,each=50),pch=16)

par(mfrow=c(1,1))

Again, we obtain a picture separating well the 5 population along 4 axes.

[optional]. Simulate a new genotype data set, this time with only three populations. Rerun step of question 7 above. Conclusions?

Answer

np<-3
dat3<-sim.genot(nbal=2,nbloc=1000,nbpop=np,mig=0.003)
dos<-biall2dos(dat3[,-1])
ps<-colMeans(dos)/2
sdp<-(ps*(1-ps))^(0.5)
#first, filter out the fixed loci
nfixed<-which(ps>0 & ps <1)
dosnf<-dos[,nfixed]

X<-scale(dosnf)
prx<-prcomp(X)
par(mfrow=c(2,2))
plot((prx$sdev^2/sum(prx$sdev^2))[1:20],type="h",
     main="screeplot",xlab="Eigen values",ylab="")
plot(prx$x[,1:2],col=rep(1:np,each=50),pch=16)
plot(prx$x[,3:4],col=rep(1:np,each=50),pch=16)
plot(prx$x[,5:6],col=rep(1:np,each=50),pch=16)

par(mfrow=c(1,1))

With only 3 populations, 2 axes are sufficient to describe the structure in the data set

[optional]. You might want to experiment with higher migration rate in the simulated data set, and observe the effect it has. Conclusions?

Answer

np<-5
#migration 3 times larger

dat5<-sim.genot(nbal=2,nbloc=1000,nbpop=np,mig=0.01)
dos<-biall2dos(dat5[,-1])
ps<-colMeans(dos)/2
sdp<-(ps*(1-ps))^(0.5)
#first, filter out the fixed loci
nfixed<-which(ps>0 & ps <1)
dosnf<-dos[,nfixed]

X<-scale(dosnf)
prx<-prcomp(X)
par(mfrow=c(2,2))
plot((prx$sdev^2/sum(prx$sdev^2))[1:20],type="h",
     main="screeplot",xlab="Eigen values",ylab="")
plot(prx$x[,1:2],col=rep(1:np,each=50),pch=16)
plot(prx$x[,3:4],col=rep(1:np,each=50),pch=16)
plot(prx$x[,5:6],col=rep(1:np,each=50),pch=16)

par(mfrow=c(1,1))

colpca<-rep(1:np,each=50)
ns<-5
plot(1:ns,prx$x[1,1:ns],col=colpca[1],type="l",
     ylim=range(prx$x[,1:ns]),xlab="Axis",
     ylab="coord.",main="par. coord. plot")
for (i in 2:nrow(prx$x)) lines(1:ns,prx$x[i,1:ns],col=colpca[i])

With more migration, we retain the 4 significant axes, but individuals are not as tightly clustered as previously.

[optional]. Rather than an island model of population structure, you might want to simulate a stepping stone in one dimension, using the hierfstat::sim.genot.metapop.t function. Assume 8 demes, connected by migration with nearest neighbors only, and with 5% migration between neighbors. Redo the PCA on the data set. Interpret the figure.

Answer

np<-8
nl<-1000
t<-1000
M<-matrix(0,ncol=np,nrow=np)
diag(M[-1,-np])<-0.05
diag(M[-np,-1])<-0.05
diag(M)<-1-rowSums(M,na.rm=TRUE)

ss1d<-sim.genot.metapop.t(nbal=2,nbpop=np,nbloc=nl,mig=M,t=t)
dos<-biall2dos(ss1d[,-1])

ps<-colMeans(dos)/2
sdp<-(ps*(1-ps))^(0.5)
#first, filter out the fixed loci
nfixed<-which(ps>0 & ps <1)
dosnf<-dos[,nfixed]

X<-scale(dosnf)
prx<-prcomp(X)
par(mfrow=c(2,2))
plot((prx$sdev^2/sum(prx$sdev^2))[1:20],type="h",
     main="screeplot",xlab="Eigen values",ylab="")
plot(prx$x[,1:2],col=rep(1:np,each=50),pch=16)
plot(prx$x[,3:4],col=rep(1:np,each=50),pch=16)
plot(prx$x[,5:6],col=rep(1:np,each=50),pch=16)

par(mfrow=c(1,1))

In a 1D-stepping stone, only the first axis of the PCA is significant, and points of different colors overlap, but do so in a geographical fashion: black dots to the left and grey dots to the right along the 1st axis, with other colors coming in turn (black, then red, then green, blue, turquoise, purple, orange and finally grey)

2 populations system

If enough time only Here, we are reproducing figs 1. and 3 of Weir and Goudet (2017). The following two questions will help you understand the material in this paper, but are not essential.

The two populations system presented during the lecture and given in eq 5 of the paper has been implemented in JGTeach::thet.bet.2pops.

[optional]. Have a look at the help page, and run the examples (fig 1 of the paper).

Answer

 # reproduces fig 1 in Weir and Goudet (2017)
 #top row
 first<-thet.bet.2pops(mu=0.0,m2=0.0,m1=0.0,n1=10000,n2=100,ngen=10000)

 #middle row
 second<-thet.bet.2pops(mu=0.001,m2=0.0,m1=0.0,n1=10000,n2=100,ngen=10000)

 #bottom row
 third<-thet.bet.2pops(mu=0.001,m2=0.0,m1=0.01,n1=10000,n2=100,ngen=10000)

[optional]. Next you will show that the estimates matches the expected values. For this, we will make use of the hierfstat function sim.genot.metapop.t to generate genetic data according to this model.

This may take some time, so a data file fig3.RData containing the results is available from the website, you should download it.

Answer

gens<-1:10*100
ngen<-gens[length(gens)]
#generate data
mig.mat2<-matrix(c(.99,0.01,0,1),nrow=2,byrow=TRUE)
mig.mat2

##      [,1] [,2]
## [1,] 0.99 0.01
## [2,] 0.00 1.00

mutrate<-10^{-3} #10^{-6} initial value

# As sim.genot.metapop.t only outputs genotypes from the 
# last generation, necessary to run the function
# with several end time points. This is the essence of 
# using lapply to a here

#  x<-lapply(gens,function(y){ 
#  sim.genot.metapop.t(t=y,nbal=20,nbpop=2,N=c(10000,100),
#                      mig=mig.mat2,nbloc=1000,mut=mutrate)})

# estimate betas for data sets
# beta.x<-lapply(x,betas,nboot=1000)


load("../fig3.RData")
betas.ci<-lapply(beta.x,function(x) x$ci)
#expected values

etb1<-thet.bet.2pops(mu=mutrate,n1=10000,n2=100,m1=0.01,m2=0,ngen=ngen,plotit=FALSE)

gens<-1:10*100
ngen<-gens[length(gens)]

#create the plot
plot(1:ngen,etb1$Be[,1],type="l",ylim=range(etb1$Be,betas.ci),col="red",lwd=2,
     xlab="Generations",ylab=expression(beta))
lines(1:ngen,etb1$Be[,2],col="blue",lwd=2)
lines(1:ngen,etb1$Be[,3],lwd=2)
abline(h=0)
segments(gens,unlist(lapply(betas.ci,function(x) x[1,2])),gens,
         unlist(lapply(betas.ci,function(x) x[2,2])),lwd=2,col="blue")
segments(gens,unlist(lapply(betas.ci,function(x) x[1,1])),gens,
         unlist(lapply(betas.ci,function(x) x[2,1])),lwd=2,col="red")
points(gens,unlist(lapply(beta.x,function(x) x$betaiovl[1])),
       col="red",cex=1.5,pch=16)
points(gens,unlist(lapply(beta.x,function(x) x$betaiovl[2])),
       col="blue",cex=1.5,pch=16)
#beta_w added too
points(gens,unlist(lapply(beta.x,function(x) mean(x$betaiovl))),
       col="black",cex=1.5,pch=16)
title("2 populations with migration model. \n        N1=10'000; N2=100; m1=0.01; m2=0.0")

END OF PRACTICAL 4.

practical 4: Population structure

Jerome Goudet and Bruce Weir

2023-07-17

Load packages

Drift

Individual populations \(F_{ST}\)s

Population specific Fsts from the 1000 genomes

PCA

2 populations system