download the first 20 megabytes of chromosome 22 (this is a subset of chromosome 22, data from all chromosomes can be downloaded from the 1000 genome download site. From a unix command prompt, enter wget url
; from a browser, just download the file.
Import this file in R using package hierfstat
(check the packages vignettes, import ) and its function read.VCF
using the following commands.:
library(hierfstat)
ch22<-read.VCF("chr22_Mb0_20.recode.vcf.gz")
How many individuals in the data set? How many SNPs? Explore the structure of the ch22
object with str(ch22)
. Describe the different components
for answers, see section 2.3 of gaston vignette
#ch22 matrix with nrows inds and ncol snps
dim(ch22)
str(ch22)
bed
file containing all biallelic SNPs on this chromosome, excluding alleles other than A,T,G,C,a,t,g,c
. For this, assuming your file is names chr22.vcf.gz
type the following commands:%%plink2 is essential for the --max-alleles option to work
plink2 --vcf chr22.vcf.gz --make-bed --snps-only just-acgt
--max-alleles 2 --out chr22.1kg
You can explore PLINK website to discover some of its capabilities.
R
and gaston
, using the following commands from the R
promptlibrary(gaston)
chr22<-read.bed.matrix("chr22.1kg")
SNPRelate
, convert the chromosome 22 VCF file into a GDS file, and open this file.library(SNPRelate)
snpgdsVCF2GDS("chr22.vcf.gz","chr22.gds")
f<-snpgdsOpen("chr22.gds")
samp.desc.fname<-"integrated_call_samples_v3.20130502.ALL.panel"
ftp.path<-"ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/"
samp.desc.url<-paste0(ftp.path,samp.desc.fname)
onekg.desc<-read.table(samp.desc.url,header=TRUE,stringsAsFactors = TRUE)
#checks that the order of samples in bed file and description file are the same
all.equal(as.character(ch22@ped$id),as.character(onekg.desc$sample))
boxplot(ch22@ped$hz~onekg.desc$super_pop)
#and per population
boxplot(ch22@ped$hz~onekg.desc$pop,las=2)
#same, sorted by continent
boxplot(ch22@ped$hz~with(onekg.desc,factor(super_pop:pop)),las=2)
ch22@snps$hz
in the ch22
object. How does it differ from ch22@ped$hz
?hierfstat
and its function sim.genot
(read its help page), generate a data set with the genotypes of \(50\) sampled individuals at \(100\) biallelic loci from each of \(4\) populations, where the populations are made of \(1000\) individuals and exchange migrants at a rate of \(0.001\). Leave mutation rate and \(f\) to their default. From the R
prompt:library(hierfstat)
#precede the function name with ? to get help, e.g. ?sim.genot
dat<-sim.genot(nbpop=4,nbloc=100,nbal=2,size=50,N=1000,mig=0.001)
#dim to get the number of rows and columns
dim(dat)
#column names
head(names(dat))
#structure of the data set
str(dat)
hierfstat::biall2dos
to convert dat
to dosage format:dos<-biall2dos(dat[,-1])
str(dos)
Why the [,-1]
one in the previous command?
Could fstat
format dataset with more than 2 alleles be converted to dosage format? To answer this, browse hierfstat
help
[optional] explore hierfstat
help and identify other functions allowing to simulate genotypic data. Describe them briefly
ms
or mspms
(outside R
, read the manual), simulate a similar dataset (you might want more loci), and import it into R using the function hierfstat::ms2bed
. To simulate the data, issue at the command prompt (outside R
) the following command:ms 400 2 -s 50 -r 40 100000 -I 4 100 100 100 100 4 > islM1.txt
islm1<-ms2bed("islM1.txt")
islm1
object. How many individuals? How many loci?#str(islm1)
dim(islm1)
library(gaston)
library(hierfstat)
library(JGTeach)
ms
or the file pan.txt provided, produce a histogram of allele frequenciespan<-ms2bed("pan.txt")
hist(pan@p,breaks=101)
ms
simulations we did during practical 1; and to the distribution of allelic frequencies generated with sim.genot
. What can you say about how realistic these distributions are?ch22<-read.VCF("chr22_Mb0_20.recode.vcf.gz")
samp.desc.file<-"https://www2.unil.ch/popgen/teaching/SISG18/integrated_call_samples_v3.20130502.ALL.panel"
samp.desc<-read.table(samp.desc.file,header=TRUE)
names(samp.desc)
str(samp.desc)
AFR<-which(samp.desc$super_pop=="AFR")
EAS<-which(samp.desc$super_pop=="EAS")
par(mfrow=c(2,2))
#AFR hist
hist(ch22[AFR,]@p,breaks=101,main="AFR",xlab="Allele Freq.")
#EAS hist
hist(ch22[EAS,]@p,breaks=101,main="EAS",xlab="Allele Freq.")
#PAN ms hist
hist(pan@p,breaks=101,main="panmictic pop with ms",xlab="Allele Freq.")
#simulate data from panmictic pop with sim.genot
dat<-sim.genot(nbal=2,nbpop=1,size=100,nbloc=10000)
#convert to dosage
dos<-biall2dos(dat[,-1])
#colMeans will give twice the frequencies
#->colMean(dos)/2 gives the frequencies
hist(colMeans(dos)/2,breaks=101,main="panmictic pop with sim.genot",xlab="Allele Freq.")
par(mfrow=c(1,1))
JGTeach::comp.ci.binom
, compare the accuracy of the Wald, binomial, and bootstrap confidence intervals:Focusing on the Wald and exact confidence intervals, infer a rule for when it is not appropriate to use the Wald confidence interval.
R
designates the count of reference alleles and A
the count of alternate alleles. When is the allelic frequency variance the same as binomial variance? Discuss the possible shortcomings of using the binomial variance instead of the true variance.\[ \begin{array}{ccccc} \hline RR & RA & AA & p_R & V \\ 100 & 200 & 100 && \\ 150 & 100 & 150 && \\ 200 & 0 & 200 && \\ 4 & 72 & 324 && \\ 22 & 36 & 342 && \\ 40 & 0 & 360 && \\ \hline \end{array} \]
tab<-matrix(c(100,200,100,150,100,150,200,
0,200,4,72,324,22,36,342,40,0,360),ncol=3,byrow=TRUE)
pr<-(tab[,1]+tab[,2]/2)/400 #freq of reference
f<-1-tab[,2]/400/(2*pr*(1-pr)) # inbreeding coefficient f
v<-pr*(1-pr)*(1+f)/2/400 # variance from genotype counts
round(cbind(tab,pr,f,v),digits=5)
ms
or the file pan.txt provided, plot the SNPs heterozygosity against the frequency of the alternate allele, and add the line of expected heterozygosity.#read pan.txt into a bed object
pan<-ms2bed("https://www2.unil.ch/popgen/teaching/SISGData/pan.txt")
#plot snps hz against p
plot(pan@p,pan@snps$hz,col="red",pch=16,cex=0.3)
#add expected prop of heterozygotes given the frequency
x<-0:1000/1000
lines(x,2*x*(1-x),col="blue")
Describe what you see.
What would the figure look like if there was only 50 individuals in the sample?
pan50<-pan[1:50,]
plot(pan50@p,pan50@snps$hz,col="red",pch=16,cex=0.3)
lines(x,2*x*(1-x),col="blue")