### TIPS ###
# We first load all the libraries we will need for this practical.
# Make sure you have those installed
# If not then please visit the companion web site https://www2.unil.ch/popgen/teaching/SISG23/
library(hierfstat)
library(gaston)
library(SNPRelate)
library(JGTeach)
library(wesanderson)
library(knitr)
download the first
20 megabytes of chromosome 22 (this is a subset of chromosome 22,
data from all chromosomes can be downloaded from the
1000 genome download site. From a unix command prompt, enter
wget url
; from a browser, just download the file.
Import this file in R using package hierfstat
(check
the packages vignettes, import
) and its function read.VCF
using the following
commands:
ch22 <- read.VCF("../chr22_Mb0_20.recode.vcf.gz")
### TIPS ###
# Make sure to specify the correct path to the file!
# e.g. if your file is in the directory of your script then ditch the '../'
How many individuals in the data set? How many SNPs? Explore the
structure of the ch22
object with str(ch22)
.
Describe the different components
for answers, see section 2.3 of gaston vignette
#ch22 matrix with nrows inds and ncol snps
dim(ch22)
str(ch22)
samp.desc.fname <- "integrated_call_samples_v3.20130502.ALL.panel" # file name
ftp.path <- "https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/" # web link
samp.desc.url <- paste0(ftp.path,samp.desc.fname)
samp.desc <- read.table(samp.desc.url,header=TRUE,stringsAsFactors = TRUE) # read.table can also read web links
# checks that the order of samples in bed file and description file are the same
all.equal(as.character(ch22@ped$id),as.character(samp.desc$sample))
## [1] TRUE
# per coninent (super_pop)
boxplot(ch22@ped$hz ~ samp.desc$super_pop, ylab='Heterozygosity', xlab='continent', pch=20,
col=wes_palette('Rushmore1'))
# and per population
boxplot(ch22@ped$hz ~ samp.desc$pop,las=2, ylab='Heterozygosity', xlab='population', pch=20)
# same, sorted by continent
boxplot(ch22@ped$hz ~ with(samp.desc,factor(super_pop:pop)), las=2 ,ylab='Heterozygosity', xlab='', pch=20,
col=wes_palette('Rushmore1')[ rep(c(1:5), times=c(7,4,5,5,5)) ])
ch22@snps$hz
in the ch22
object. How does it differ from ch22@ped$hz
?The largest individual heterozygosities are in Africa (AFR).
The lowest are in East Asia (EAS).
The highest variance in individual heterozygosity is in the Americas (AMR)
ch22@snaps$hz
contains individual loci heterozygosities,
the proportion of heterozygous individuals at each locus.
ch22@ped$hz
contains the proportion of heterozygous loci
per individual. Heterozygosity is often not qualified (i.e. individual
or locus heterozygosity), and can thus be confusing. In population
genetics, we are usually interested in locus heterozygosity (proportion
of individuals heterozygous at a given locus). In population genomics,
both individual and locus heterozygosities are relevant, but should be
qualified.
hierfstat
and its
function sim.genot
(read its help page!), generate a data
set with the genotypes of 50 sampled individuals at 100 biallelic loci
from each of 4 populations, where the populations are made of 1000
individuals and exchange migrants at a rate of 0.001. Leave mutation
rate and f to their default. From the R
prompt:#precede the function name with ? to get help, e.g. ?sim.genot
dat <- sim.genot(nbpop=4,nbloc=100,nbal=2,size=50,N=1000,mig=0.001)
# dim to get the number of rows and columns
dim(dat)
# column names
head(names(dat))
# structure of the data set
str(dat)
The data set contains 200 rows (number of individuals) and 101 columns (number of loci plus one).
The first column contains the population identifier, the second and thord columns the genotypes of the individuals at the first and second loci.
hierfstat::biall2dos
to convert the
dat
object to a dos
object in dosage
format:dos <- biall2dos(dat[,-1])
str(dos)
[,-1]
one in the previous command?The [,-1]
removes the first column of the
dat
object, which contains the population identifier rather
than genotypes.
fstat
format dataset with more than 2 alleles
be converted to dosage format? To answer this, browse
hierfstat
helpThe function fstat2dos
allows converting loci with any
number of alleles to dosage format. If the locus carries 2 alleles, 2
“SNPs” will be created, one for the first allele and the other for the
second (hence it is redundant). For a locus with k alleles, k SNPs will
be created.
hierfstat
help and identify other
functions allowing to simulate genotypic data. Describe them
brieflyHave a look at the hierfstat new features vignette, section “Simulating genetic data”
ms
or mspms
(outside
R
, read the manual), simulate a similar dataset (you might
want more loci), and import it into R using the function
hierfstat::ms2bed
. To simulate the data, issue at the
command prompt (outside R
) the following command, after
having downloaded and isntalled the software (see the programs section
of the web
site:ms 400 2 -s 50 -r 40 100000 -I 4 100 100 100 100 4 > islM1.txt
### TIPS ###
# Running ms requires some basic terminal skills! Skip if uncomfortable
# To install ms you need to download the folder specified in the course website
# then through a linux terminal (ex WSL) you can $ tar -xvf ms.tar.gz && cd msdir/ && gcc -o ms ms.c streec.c rand1.c -lm
# then run ms as ./ms
islm1 <- ms2bed("https://www2.unil.ch/popgen/teaching/SISGData/t4r4p1.txt")
islm1
object. How many
individuals? How many loci?#str(islm1)
dim(islm1)
bed
file containing all biallelic
SNPs on this chromosome, excluding alleles other than
A,T,G,C,a,t,g,c
. For this, assuming your file is names
chr22_Mb0_20.recode.vcf.gz
type the following
commands:plink2 is essential for the --max-alleles option to work
plink2 --vcf ../chr22.vcf.gz --make-bed --snps-only just-acgt
--max-alleles 2 --out chr22.1kg
### TIPS ###
# You can also run plink with R using the system() command.
# example if your plink.exe and the vcf.gz is in your working directory (getwd() to see where you are - setwd() to change it)
# you can run > system("plink2.exe --vcf ../chr22_Mb0_20.recode.vcf.gz --make-bed --snps-only just-acgt --max-alleles 2 --out chr22.1kg")
# from your R console to execute the software
You can explore PLINK website to discover some of its capabilities.
R
and gaston
,
using the following commands from the R
prompt:chr22 <- read.bed.matrix("chr22.1kg") # reads the output of plink
SNPRelate
, convert the chromosome 22
VCF file into a GDS file, and open this file. You can find more about
SNPRelate
here.SNPRelate::snpgdsVCF2GDS("./chr22_Mb0_20.recode.vcf.gz","chr22.gds") # makes GDS file
f <- SNPRelate::snpgdsOpen("chr22.gds") # reads GDS file
ms
or the file pan.txt
provided, produce a histogram of allele frequencies# adjust for the location of file pan.txt (the "../") as necessary
pan <- ms2bed("../pan.txt")
hist(pan@p,breaks=0:100/100, xlab='allelic frequency', main='Histogram of allelic frequencies in a panmictic population')
ms
simulations; and to the
distribution of allelic frequencies generated with
sim.genot
. What can you say about how realistic these
distributions are?ch22 <- read.VCF("../chr22_Mb0_20.recode.vcf.gz") # reload chr22 vcf
## ped stats and snps stats have been set.
## 'p' has been set.
## 'mu' and 'sigma' have been set.
samp.desc.file <- "https://www2.unil.ch/popgen/teaching/SISGData/integrated_call_samples_v3.20130502.ALL.panel"
samp.desc <- read.table(samp.desc.file,header=TRUE) # reload sample descriptions
# subset samples
AFR <- which(samp.desc$super_pop=="AFR") # which samples are from AFR
EAS <- which(samp.desc$super_pop=="EAS") # which samples are from EAS
par(mfrow=c(2,2)) # plot indexing (this will make the plotting area 2x2)
#AFR hist
hist(ch22[AFR,]@p,breaks=101,main="AFR",xlab="Allele Freq.")
#EAS hist
hist(ch22[EAS,]@p,breaks=101,main="EAS",xlab="Allele Freq.")
#PAN ms hist
hist(pan@p,breaks=101,main="panmictic pop with ms",xlab="Allele Freq.")
#simulate data from a panmictic pop with sim.genot
dat<-sim.genot(nbal=2,nbpop=1,size=100,nbloc=10000)
#convert to dosage
dos<-biall2dos(dat[,-1])
# colMeans(dos) will give twice the frequencies so we /2
hist(colMeans(dos)/2, breaks=101, main="panmictic pop with sim.genot", xlab="Allele Freq.")
par(mfrow=c(1,1)) # back to single plots
The allelic frequency distribution obtained from ms
is
more similar to the distribution observed from real human data (a
L-shaped distribution, with many rare and few very common alleles). This
indicates that the mutation model used in ms
, the Infinite
Site Model (ISM), where new mutations occur at unmutated places in the
genome, is probably closer to what is happening in reality than the K
allele model (KAM) used in hierfstat::sim.genot
, where
mutations occur recurrently at the same site, and the initial
distribution of frequencies is drawn from a beta
distribution with parameter \(4N\mu\).
This latter mutation model leads to a U-shaped distribution. We will see
the \(\beta\) distribution in more
details later in this course
JGTeach::comp.ci.binom
, compare the accuracy of
the Wald, binomial, and bootstrap confidence intervals:Focusing on the Wald and exact confidence intervals, infer a rule for when it is not appropriate to use the Wald confidence interval.
A “rule” is difficult to infer, but the product of the population size times the frequency \(N \times p\) is important. Wald intervals often differ from the exact confidence interval when \(Np \leq 5\), and also include negative values, which is not possible for a proportion.
R
designates the count of reference alleles
and A
the count of alternate alleles. When is the allelic
frequency variance the same as binomial variance? Discuss the possible
shortcomings of using the binomial variance instead of the true
variance.RR | RA | AA |
---|---|---|
100 | 200 | 100 |
150 | 100 | 150 |
200 | 0 | 200 |
4 | 72 | 324 |
22 | 36 | 342 |
40 | 0 | 360 |
tab <- matrix(c(100,200,100,150,100,150,200,
0,200,4,72,324,22,36,342,40,0,360),ncol=3,byrow=TRUE)
tab <- data.frame(tab)
names(tab) <- c("RR","RA","AA")
tab$PR <- (tab[,1]+tab[,2]/2)/400 # PR for frequency of reference
tab$f <- 1-tab[,2]/400/(2*tab$PR*(1-tab$PR)) # f for inbreeding coefficient
tab$Vb<- tab$PR*(1-tab$PR)/2/400 #Vb for binomial variance
tab$V <- tab$PR*(1-tab$PR)*(1+tab$f)/2/400 # V for variance from genotype counts
kable(round(tab,digits=5))
RR | RA | AA | PR | f | Vb | V |
---|---|---|---|---|---|---|
100 | 200 | 100 | 0.5 | 0.0 | 0.00031 | 0.00031 |
150 | 100 | 150 | 0.5 | 0.5 | 0.00031 | 0.00047 |
200 | 0 | 200 | 0.5 | 1.0 | 0.00031 | 0.00062 |
4 | 72 | 324 | 0.1 | 0.0 | 0.00011 | 0.00011 |
22 | 36 | 342 | 0.1 | 0.5 | 0.00011 | 0.00017 |
40 | 0 | 360 | 0.1 | 1.0 | 0.00011 | 0.00023 |
The allelic frequency variance \(V\) is the same as the binomial variance \(V_b=p(1-p)/(2n)\) only when \(f=0\). If \(f>0\) we will underestimate allele frequency variance when using the binomial variance, and overestimate it when \(f<0\). Many research papers report binomial variance rather than allele frequency variance.
ms
or the file pan.txt
provided, plot the SNPs heterozygosity against the frequency of the
alternate allele, and add the line of expected heterozygosity.#read pan.txt into a bed object
pan <- ms2bed("https://www2.unil.ch/popgen/teaching/SISGData/pan.txt")
#plot snps hz against p
plot(pan@p, pan@snps$hz, col="black", pch=16, cex=0.4, ylab='SNP heterozygosity (Het)',
xlab='allelic frequency (p)')
#add expected prop of heterozygotes given the frequency
p <- 0:1000/1000 #same as seq(0,1,by=0.001)
lines(x=p,y=2*p*(1-p),col="orange")
Describe what you see.
We see observed loci heterozygosities (black dots) fitting quite tightly to the expected heterozygosities (the orange line).
Expected heterozygosity is maximum at \(0.5\) when both allele are equifrequent, and decreases in parabolic fashion to \(0\) as allele frequencies goes to 0 or 1.
What would the figure look like if there was only 50 individuals in the sample?
pan50 <- pan[1:50,] # get only 50 individuals
plot(x=pan50@p, y=pan50@snps$hz,col="black",pch=20,cex=0.8, ylab='SNP heterozygosity',
xlab='allelic frequency')
lines(p,2*p*(1-p),col="orange")
The original dataset contains 500 individuals. With only 50 individuals, the fit between observed loci heterozygosities (the black dots) and expected heterozygosities (the orange line) is not as tight. This is expected because as sample size decreases, variance increases as we saw before.
END OF PRACTICAL 1.