Importing data into R

  1. download the first 20 megabytes of chromosome 22 (this is a subset of chromosome 22, data from all chromosomes can be downloaded from the 1000 genome download site. From a unix command prompt, enter wget url; from a browser, just download the file.

  2. Import this file in R using package hierfstat (check the packages vignettes, import ) and its function read.VCF using the following commands.:

library(hierfstat)
ch22<-read.VCF("chr22_Mb0_20.recode.vcf.gz")

How many individuals in the data set? How many SNPs? Explore the structure of the ch22 object with str(ch22). Describe the different components

for answers, see section 2.3 of gaston vignette

#ch22 matrix with nrows inds and ncol snps
dim(ch22)
str(ch22)
  1. [optional] Other ways of importing VCF files into R
%%plink2 is essential for the --max-alleles option to work

plink2 --vcf chr22.vcf.gz --make-bed  --snps-only just-acgt
--max-alleles 2 --out chr22.1kg

You can explore PLINK website to discover some of its capabilities.

library(gaston)
chr22<-read.bed.matrix("chr22.1kg")
library(SNPRelate)
snpgdsVCF2GDS("chr22.vcf.gz","chr22.gds")
f<-snpgdsOpen("chr22.gds")
  1. Show boxplots of individual heterozygosities as a function of their continent of origin (you will have to load the sample description for this, and make sure that the order of samples is the same in the bed file and the description file). This can be done with the following commands:
samp.desc.fname<-"integrated_call_samples_v3.20130502.ALL.panel"
ftp.path<-"ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/"
samp.desc.url<-paste0(ftp.path,samp.desc.fname)
onekg.desc<-read.table(samp.desc.url,header=TRUE,stringsAsFactors = TRUE)
#checks that the order of samples in bed file and description file are the same
all.equal(as.character(ch22@ped$id),as.character(onekg.desc$sample))
boxplot(ch22@ped$hz~onekg.desc$super_pop)
#and per population
boxplot(ch22@ped$hz~onekg.desc$pop,las=2)
#same, sorted by continent
boxplot(ch22@ped$hz~with(onekg.desc,factor(super_pop:pop)),las=2)

Simulating Genetic/Genomic data

  1. Simulating genetic data: using hierfstat and its function sim.genot (read its help page), generate a data set with the genotypes of \(50\) sampled individuals at \(100\) biallelic loci from each of \(4\) populations, where the populations are made of \(1000\) individuals and exchange migrants at a rate of \(0.001\). Leave mutation rate and \(f\) to their default. From the R prompt:
library(hierfstat)
#precede the function name with ? to get help, e.g. ?sim.genot

dat<-sim.genot(nbpop=4,nbloc=100,nbal=2,size=50,N=1000,mig=0.001)
#dim to get the number of rows and columns
dim(dat)

#column names
head(names(dat))

#structure of the data set

str(dat)
dos<-biall2dos(dat[,-1])
str(dos)
  1. [optional] Using ms or mspms (outside R, read the manual), simulate a similar dataset (you might want more loci), and import it into R using the function hierfstat::ms2bed. To simulate the data, issue at the command prompt (outside R) the following command:
ms 400 2 -s 50 -r 40 100000 -I 4 100 100 100 100 4 > islM1.txt 
  1. using the file you have just generated or a similar one from the course website (in which case, you’ll have to download it first) load it into R:
islm1<-ms2bed("islM1.txt")
#str(islm1)
dim(islm1)

Allelic frequencies

library(gaston)
library(hierfstat)
library(JGTeach)
  1. Using either data from a panmictic population you have simulated with ms or the file pan.txt provided, produce a histogram of allele frequencies
pan<-ms2bed("pan.txt")
hist(pan@p,breaks=101)
  1. Produce a histogram of SNP allele frequencies of the first 20 megabases of chromosome 22 we imported yesterday, from the African samples, and then from the East Asian samples. Describe the shape of the distribution. How can this be explained? Compare this distribution to the distribution of allele frequencies from the ms simulations we did during practical 1; and to the distribution of allelic frequencies generated with sim.genot. What can you say about how realistic these distributions are?
ch22<-read.VCF("chr22_Mb0_20.recode.vcf.gz")
samp.desc.file<-"https://www2.unil.ch/popgen/teaching/SISG18/integrated_call_samples_v3.20130502.ALL.panel"
samp.desc<-read.table(samp.desc.file,header=TRUE)
names(samp.desc)
str(samp.desc)

AFR<-which(samp.desc$super_pop=="AFR")
EAS<-which(samp.desc$super_pop=="EAS")
par(mfrow=c(2,2))
#AFR hist
hist(ch22[AFR,]@p,breaks=101,main="AFR",xlab="Allele Freq.")
#EAS hist
hist(ch22[EAS,]@p,breaks=101,main="EAS",xlab="Allele Freq.")
#PAN ms hist
hist(pan@p,breaks=101,main="panmictic pop with ms",xlab="Allele Freq.")
#simulate data from panmictic pop with sim.genot
dat<-sim.genot(nbal=2,nbpop=1,size=100,nbloc=10000)
#convert to dosage
dos<-biall2dos(dat[,-1])
#colMeans will give twice the frequencies
#->colMean(dos)/2 gives the frequencies
hist(colMeans(dos)/2,breaks=101,main="panmictic pop with sim.genot",xlab="Allele Freq.")
par(mfrow=c(1,1))
  1. using the shiny app BinomNorm (based on function JGTeach::comp.ci.binom, compare the accuracy of the Wald, binomial, and bootstrap confidence intervals:

Focusing on the Wald and exact confidence intervals, infer a rule for when it is not appropriate to use the Wald confidence interval.

  1. Estimate allelic frequency variance from the following genotype counts, where R designates the count of reference alleles and A the count of alternate alleles. When is the allelic frequency variance the same as binomial variance? Discuss the possible shortcomings of using the binomial variance instead of the true variance.

\[ \begin{array}{ccccc} \hline RR & RA & AA & p_R & V \\ 100 & 200 & 100 && \\ 150 & 100 & 150 && \\ 200 & 0 & 200 && \\ 4 & 72 & 324 && \\ 22 & 36 & 342 && \\ 40 & 0 & 360 && \\ \hline \end{array} \]

tab<-matrix(c(100,200,100,150,100,150,200,
      0,200,4,72,324,22,36,342,40,0,360),ncol=3,byrow=TRUE)
pr<-(tab[,1]+tab[,2]/2)/400 #freq of reference
f<-1-tab[,2]/400/(2*pr*(1-pr)) # inbreeding coefficient f
v<-pr*(1-pr)*(1+f)/2/400 # variance from genotype counts
round(cbind(tab,pr,f,v),digits=5)
  1. Using either data from a panmictic population you have simulated with ms or the file pan.txt provided, plot the SNPs heterozygosity against the frequency of the alternate allele, and add the line of expected heterozygosity.
#read pan.txt into a bed object
pan<-ms2bed("https://www2.unil.ch/popgen/teaching/SISGData/pan.txt")
#plot snps hz against p
plot(pan@p,pan@snps$hz,col="red",pch=16,cex=0.3)

#add expected prop of heterozygotes given the frequency
x<-0:1000/1000
lines(x,2*x*(1-x),col="blue")

Describe what you see.

What would the figure look like if there was only 50 individuals in the sample?

pan50<-pan[1:50,]
plot(pan50@p,pan50@snps$hz,col="red",pch=16,cex=0.3)
lines(x,2*x*(1-x),col="blue")