### TIPS ###
# We first load all the libraries we will need for this practical. 
# Make sure you have those installed 
# If not then please visit https://www2.unil.ch/popgen/teaching/SISG22/
library(gaston)
## Warning: le package 'gaston' a été compilé avec la version R
## 4.2.3
## Warning: le package 'Rcpp' a été compilé avec la version R
## 4.2.3
## Warning: le package 'RcppParallel' a été compilé avec la
## version R 4.2.3
library(hierfstat)
library(JGTeach)

For this practical we will use both simulated pedigrees, simulated genetic data on these pedigrees and the small chunck of chromosome 22 from the 1000 genomes.

Kinship and inbreeding from pedigrees

  1. load a pedigree from a monogamous population using read.table. Explore the pedigree: what are the first, second and third column? How many founders in this pedigree? Find at least two families with sibships of four (just inspect the pedigree).
Code
ped <- read.table("https://www2.unil.ch/popgen/teaching/SISGData/pedmono.txt",header=TRUE)
str(ped)
dim(ped)
#first column is the individual identifier
#second column is the female id 
#third column is the male id 


# number of founders:
# founders are individuals with missing sires and dams

sum(is.na(ped$sire) & is.na(ped$dam))
Answers

The first column is the individual identifier. The second column is the female id (dam is the term used in animal breeding for mother), while the third column is the male id (sire is the term used in animal breeding for father).

The number of founders is obtained from identifying how many members in the pedigree have unidentified parents. We find 20 founders.

individuals with IDs 28 to 32 share the same parents, and thus belong to the same family. Same for individuals 59 to 62.:

ped[28:32,]
##    ind dam sire
## 28  28  14   10
## 29  29  14   10
## 30  30  14   10
## 31  31  14   10
## 32  32  14   10
ped[59:62,]
##    ind dam sire
## 59  59  24   35
## 60  60  24   35
## 61  61  24   35
## 62  62  24   35
#first inds of gen 1:5 are   c(21,47,77,117,159) 
Code
# calculate relatedness matrix
ARM <- JGTeach::pedARM(ped$sire,ped$dam)

# a sibship 
ARM[28:32,28:32] 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  1.0  0.5  0.5  0.5  0.5
## [2,]  0.5  1.0  0.5  0.5  0.5
## [3,]  0.5  0.5  1.0  0.5  0.5
## [4,]  0.5  0.5  0.5  1.0  0.5
## [5,]  0.5  0.5  0.5  0.5  1.0
ped[28:32,]
##    ind dam sire
## 28  28  14   10
## 29  29  14   10
## 30  30  14   10
## 31  31  14   10
## 32  32  14   10
# relatedness with parents
ARM[28:32,c(ped[28,2],ped[28,3])]
##      [,1] [,2]
## [1,]  0.5  0.5
## [2,]  0.5  0.5
## [3,]  0.5  0.5
## [4,]  0.5  0.5
## [5,]  0.5  0.5
# another example
ARM[59:62,59:62]
##      [,1] [,2] [,3] [,4]
## [1,]  1.0  0.5  0.5  0.5
## [2,]  0.5  1.0  0.5  0.5
## [3,]  0.5  0.5  1.0  0.5
## [4,]  0.5  0.5  0.5  1.0
ARM[59:62,c(ped[59,2],ped[59,3])]
##      [,1] [,2]
## [1,]  0.5  0.5
## [2,]  0.5  0.5
## [3,]  0.5  0.5
## [4,]  0.5  0.5
Answers

The pedigree relatedness among these sibship is 0.5, as is the relatedness of the sibships with their parents.

Sibship 92 to 95 is special in that the relatedness among sibs is larger than 0.5:

#relatedness higher than 0.5 with their siblings
ARM[92:95,92:95]
##         [,1]    [,2]    [,3]    [,4]
## [1,] 1.03125 0.56250 0.56250 0.56250
## [2,] 0.56250 1.03125 0.56250 0.56250
## [3,] 0.56250 0.56250 1.03125 0.56250
## [4,] 0.56250 0.56250 0.56250 1.03125
#and their parents
ARM[92:95,c(ped[92,2],ped[92,3])]
##         [,1]    [,2]
## [1,] 0.59375 0.53125
## [2,] 0.59375 0.53125
## [3,] 0.59375 0.53125
## [4,] 0.59375 0.53125

This is because their parents are related to each other:

#their parents are related
ARM[c(ped[92,2],ped[92,3]),c(ped[92,2],ped[92,3])]
##        [,1]   [,2]
## [1,] 1.1250 0.0625
## [2,] 0.0625 1.0000
Code
hist(mat2vec(ARM), xlab='relatedness')

ped.kin <- hierfstat::grm2kinship(ARM)
# plot kinship with generation delimiters 
image(ped$ind,ped$ind,ped.kin)
genlim<-c(21,47,77,117,159)
#add vertical and horizontal lines between generations
abline(h=genlim-.5,v=genlim-.5)

### TIP
# a different way of plotting it using the corrplot function:
# library(corrplot) ; corrplot(ARM/2, method = 'shade', tl.pos='n', is.corr=F,diag=F)
# notice that the ARM/2 will translate relatedness to kinship but only for the off-diagonal elements
Answers

The histogram of relatedness shows a (small) majority of individuals are unrelated, and most have a relatedness coefficient \(\leq 0.25\). The heatmap shows relatedness to increase from left to right and bottom to top, that is, as pedigree depth increases. From the 1st generation following founders, we see redish blocks along the main diagonal corresponding to sibships, and, as we move right and top, these blocks become reder, meaning that relatedness among sibs increase due to the relatedness of their parents. We see that four generations after founding of the pedigree, most individuals are related. This is due to the small number of founders of this pedigree, and may not happen in larger ones, but it is a genaral observation that as the depth of pedigree increases, individuals of later generations are more related.

Kinship and inbreeding from markers, compared to pedigrees

  1. We have generated genotypes dosage from these individuals. First download, then load this file using readRDS, and compute
Code
genoped <- readRDS("../geno_mono.RDS")
# calculate allele sharing kinship
genoped.Kas <- hierfstat::beta.dosage(genoped,inb=FALSE,Mb=TRUE)
# calculate ratio of averages
genoped.M <- with(genoped.Kas,betas*(1-MB)+MB)
genoped.Kc0 <- JGTeach::Kc0(genoped.M,matching=TRUE)
# calculate average of ratios
bed.genoped <- gaston::as.bed.matrix(genoped)
genoped.std.GRM <- gaston::GRM(bed.genoped,autosome.only=FALSE)

# Plotting
par(mfrow=c(1,3))
plot(mat2vec(ARM)/2,mat2vec(genoped.Kas$betas),
     cex=0.5,col="red",main=expression(K[AS]),
     xlab="ped kinship",ylab="est. kinship", pch=20)
abline(c(0,1))
plot(mat2vec(ARM)/2,mat2vec(genoped.Kc0),
     cex=0.5,col="red",main=expression(K[c0]),
     xlab="ped kinship",ylab="est. kinship", pch=20)
abline(c(0,1))

# gaston::GRM reports relatedness rather than kinship, hence the /2
plot(mat2vec(ARM)/2,mat2vec(genoped.std.GRM)/2,
     cex=0.5,col="red",main="standard kinship",
     xlab="ped kinship",ylab="est. kinship", pch=20)
abline(c(0,1))

par(mfrow=c(1,1))
Answers We first notice that most genomic based estimates are lower than the corresponding pedigree values, as all points are below the \(y=x\) line. We see that genomic estimates are often negative, while all pedigree based one are \(\geq 0\). We also observe that the allele sharing (\(K_{AS}\), left) estimates are less variable than the standard estimates based on ratio of averages (\(K_{c0}\), middle) or average of ratios (standard, right), this later one also showing a lower slope than the other twos.
Code
#we divide by two because we want kinships rather than relatedness
mARM<-mean(mat2vec(ARM)/2)
# (kinship - mean) / (1 - mean)
ARMc<-(ARM/2-mARM)/(1-mARM)

# Plotting 
par(mfrow=c(1,3))
plot(mat2vec(ARMc),mat2vec(genoped.Kas$betas),
     cex=0.5,col="red",main=expression(K[AS]),
     xlab="ped kinship",ylab="est. kinship",pch=20)
abline(c(0,1))
plot(mat2vec(ARMc),mat2vec(genoped.Kc0),
     cex=0.5,col="red",main=expression(K[c0]),
     xlab="ped kinship",ylab="est. kinship",pch=20)
abline(c(0,1))
plot(mat2vec(ARMc),mat2vec(genoped.std.GRM)/2,
     cex=0.5,col="red",main="standard kinship",
     xlab="ped kinship",ylab="est. kinship",pch=20)
abline(c(0,1))

par(mfrow=c(1,1))
Answers

Once rescaled, we observe that the allele sharing estimates \(K_{AS}\) are all very close to the \(y=x\) line, meaning a good correspondence between pedigree and genomic inbreeding. The discrepancies between \(K_{c0}\) and pedigree kinship is reduced, but not completely eliminated, and large variation for specific pedigree values exist. Last, the standard estimate is worse than the two others, and showing a huge variance.

Note that genomic estimates and pedigree measures of kinship will not always closely match, as pedigree measures are an expectation, whereas genomic values are realization of the process, which depend on the vagaries of Mendelian sampling which will be exemplified in genomes with low recombination.

For a further discussion of theses issues, see Goudet etal., 2018 Molecular Ecology and Weir and Goudet, 2017 Genetics, Fig. 5 and followings

kinship and inbreeding in the 1000 genomes

  1. We will now compare these three estimators of genomic kinship using the subset of Chromosome 22 we used in previous practical. Start by downloading this file and computing the three estimators.
Code
ch22 <- read.VCF("../chr22_Mb0_20.recode.vcf.gz")
ch22.M<-readRDS("../matching.ch22.RDS")
# for scaling 
Mb <- mean(mat2vec(ch22.M))

ch22.Kas<-(ch22.M-Mb)/(1-Mb)
# takes a while to run - hang in there 
ch22.Kc0<-JGTeach::Kc0(ch22.M,matching=TRUE) 
ch22.std.GRM<-gaston::GRM(ch22)
Code
#extract self kinship and convert it to inbreeding
ch22.Kas.inb<-diag(ch22.Kas)*2-1/2
diag(ch22.Kas)<-NA
ch22.Kc0.inb<-diag(ch22.Kc0)*2-1/2
diag(ch22.Kc0)<-NA
ch22.std.Inb<-diag(ch22.std.GRM)-1
diag(ch22.std.GRM)<-NA

# sample description file
samp.desc<-read.table("https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel",header=TRUE,stringsAsFactors = TRUE)

#to order samples 
ospp<-with(samp.desc,order(super_pop,pop))
#to plot superpop lines
lsp<-cumsum(table(samp.desc$super_pop))

#png("1kg_ch22chunck_Kas.png")
image(1:2504,1:2504,ch22.Kas[ospp,ospp], main = 'Kas')
abline(h=lsp,v=lsp)

#dev.off()
#png("1kg_ch22chunck_Kc0.png")
image(1:2504,1:2504,ch22.Kc0[ospp,ospp], main = 'Kc0')
abline(h=lsp,v=lsp)

#dev.off()
#png("1kg_ch22chunck_Kstdk.png")
image(1:2504,1:2504,ch22.std.GRM[ospp,ospp]/2, main = 'standard')
abline(h=lsp,v=lsp)

#dev.off()
Answer

The different blocks correspond to continent, organized in order Africa (AFR), America (AMR), East Asia (EAS), Europe (EUR) and South Asia (SAS) from left to right and bottom to top. lighter colors mean lower kinship, reder means higher kinship.

The first heatmap corresponds to \(K_{AS}\) kinship, and shows AFR samples to be lest related among themselves and with all others than samples from the other continents, who show higher kinship, in particular within the EAS, but also the EUR continent. Noteworthy is the high kinship shown for samples belonging to different, non African continents (the large red block at the top right)

The second heatmap corresponds to \(K_{c0}\) and shows a completely different picture: This time, African samples are the most related of all samples, with EAS and EUR showing also high kinship, but lower than among AFR samples. Samples from different continent but AFR do not show the high relatedness shown with \(K_{AS}\).

The third heatmap corresponds to the standard estimates of kinship and does not show any variation.

We will discuss these important results and their implications at the end of the practical session

Code
# index snps with maf > .05 
maf05.indx <- which(ch22@snps$maf > .05)
# make a new GRM with the subseted loci
ch22.std.maf05 <- GRM( ch22[, maf05.indx] )
# diagonal muted
diag(ch22.std.maf05) <- NA
# plot
image(ch22.std.maf05[ospp,ospp]/2, main = 'standard maf05')
abline(h=lsp,v=lsp)

Answer Once filtered at maf \(\geq 0.05\), the standard GRM shows a similar pattern to \(K_{c0}\), demonstrating that the low MAF alleles are indeed affecting the estimates of kinship and should be removed if one wants to use this standard GRM. We advise not to use the standard GRM.

END OF PRACTICAL 3.