Difference between revisions of "Genome Wide Association Studies"

 
(18 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Genome Wide Association Studies (GWAS) search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors.
 
  
Our current focus is on the Cohorte Lausanne (CoLaus), a population-based sample of more than 6'000 individuals from the Lausanne area. The CoLaus phenotypic dataset includes a large range of measurements, including extensive blood chemistry, anatomic and physiological measures, as well as parameters related to life style and history. Genotypes have been measured for ~500`000 SNPs using Affymetrix 500k SNP arrays. Regressing the various phenotypes onto these SNPs has already revealed a number of highly significant associations (see http://serverdgm.unil.ch/bergmann/CBG_publications.html for our publications).
+
[[Category:Bulletins]]
  
Current GWAS usually include the following steps:
+
<newstitle> First GWAS on Drosophila height published </newstitle>   
* genotype calling from the raw chip-data and basic quality control
+
<teaser>
* principle component analysis (PCA) to detect and possibly correct for population stratification
+
We recently collaborated with the Hafen group in Zurich on a project to identify natural variants impacting size in Drosophila. We found an association in the kek1 locus, a well-characterized growth regulator. Additionally 33 novel loci were validated. The paper is available in
* genotype imputation (using linkage disequilibrium information from HapMap)
+
<a href=http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005616>  Plos Genetics </a>.
* testing for association between a single SNP and continuous or categorical phenotypes
+
<date>11 Jan 2016 </date>
* global significance analysis and correction for multiple testing
+
</teaser>
* data presentation (e.g. using quantile-quantile and Manhattan plots)
 
* cross-replication and meta-analysis for integration of association data from multiple studies
 
  
From the many GWAS that were performed in the last years it became apparent that even well-powered (meta-)studies with many thousands (and even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for accessing risk for predisposition to any of the complex diseases that have been studied.
+
== Introduction ==
  
Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, often several decades ago. It has been argued that these estimates entail problems of its own, such as: independently raised twins shared a common prenatal environment; they may have undergone intrauterine competition; the mother may be more physically stressed (less nutrients); and twins reared apart are difficult to find, and may reflect certain types of environments. Indeed it is important to remember that heritability estimates are always relative to the genetic and environmental factors in the population, and are not absolute measurements of the contribution of genetic and environmental factors to a phenotype. Heritability estimates reflect the amount of variation in genotypic effects compared to variation in environmental effects.
+
Genome-wide association studies (GWAS) aim to associate one or several phenotypes with a large panel of genotypes measured in the same population. The most commonly investigated genotypes are single nucleotide polymorphisms (SNPs), which are common genetic variants (typically with a minor allele frequency of at least 1% in a given population. The standard approach for testing for an association is to use the genotype, coded in terms of the dosage (0, 1 or 2) of the less frequent allele (the co-called “minor allele”) as feature, and the phenotype as response variable within a regression model. Continuously distributed phenotypes are often “qq-normalised”, i.e. transformed into a normally distributed variable. Discrete phenotypes, such as disease states, are tested for association with the genotype using logistic regression. Both for linear and logistic regression it is common to include covariates when estimating the effects. Typical covariates for GWAS of human phenotypes are age, sex and the principle components of the entire genotypic profile, which serves as a proxy for population stratification. The regression estimates the SNP-wise effect β and its standard error (ste). The ratio β/ste is t-distributed under the null hypothesis. Since the standard error is the standard deviation divided by the square root of the sample size, it can always be made smaller by increasing the sample size, leading to larger t-statistics, if the effect is non-zero. Thus with ever growing cohorts, some of which have sample sizes getting close to one million, it is in principle possible to detect very small effects.
  
Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants from the Hapmap CEU panel (ref). While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction of SNPs which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes [ref]. Finally, other genetic variants like Copy Number Variations (CNVs) (or even epigenetics) may also play an important role.
+
The genetic component of a complex trait is due to the combination of a large number of small effects, some of which may be additive, while others combine in a non-linear manner, known as epistasis. The combined genetic variability in proportion to its overall variability (including the environmental part) is known as heritability. The additive heritability of a trait can be estimated from its GWAS summary statistics (i.e. SNP-wise effects sizes and their errors) using a method known as LD score regression [ref]. A sizable heritability of any phenotype is a sign of it having a genetic and therefore biological underpinning.  
  
Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, co-variables, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.
+
Statistical power is essential for GWAS for two reasons. First, individual SNP-wise effects of complex traits are expected to be small, in particular if the effect is influencing fitness (even slightly) negatively, since any sizable detrimental effects would have purged the effect allele from the population by natural selection. Second, genome-wide scans today typically test about one million measured SNPs. As a consequence of the large number of tests, significant associations can occur just by chance. For example, when making one million tests, under the null hypothesis of there being no real associations, the nominal p-values from these tests are uniformly distributed, and the smallest p-value is expected to be of the order of 10-6, i.e. one over the number of tests. The most common way to control false positives when testing multiple hypotheses, is to apply a so-called “Bonferroni correction”, where only associations with p-values smaller than the nominal significance cutoff (usually 0.05) divided by the number of tests, are considered to be significant. Thus applying a Bonferroni significance threshold of 5·10-8 is widely accepted within the GWAS community to reveal genuine associations.  
  
== Further reading ==
+
A significant challenge of GWAS is to interpret the SNP-wise associations. These associations can be seen as pointers to individual nucleotides in the DNA that are candidates for modulating the trait of interest. Yet, there are several difficulties when analysing trait-associated SNPs. First, proximal SNP are usually not independent, a phenomenon known as “linkage disequilibrium” (LD). As a consequence one usually finds sizable regions that can contain hundreds of SNPs, which are all significantly associated with the trait. The differences of the respective p-values are often too small to decide which of the many SNPs is the “lead SNP”, the one with the highest chance of driving the association signal. Moreover, GWAS usually do not include rare genetic variants, which may be the actual causal nucleotides. Some of the rare variants can be imputed from the SNPs, and state-of-art GWAS now consider about 10 million imputed genotypes on top of the one million that are measured directly (most commonly using microarrays). As sequencing is becoming less and less expensive, we can expect that eventually the complete human sequence, including extremely rare or even individual variants, will be available for GWAS.
 
 
For an introduction to GWAS, with an emphasis on human studies, you could start with a nice tutorial article <cite>BaldingTutorial</cite>, and a review of more recent issues <cite>McCarthyReview</cite>. There is also a nice review about approaches for rodent studies <cite>FlintReview</cite>.
 
 
 
== More Advanced Statistical Methodology ==
 
 
 
An important and widely used approach to dealing with cryptic population structure <cite>PricePC</cite>, and key references on genotype imputation <cite>ServinImputation</cite><cite>MarchiniImputation</cite>.
 
 
 
A powerful approach to deal with strain structure or relatedness between individuals <cite>KangEMMA</cite>.
 
 
 
== Software ==
 
 
 
[http://pngu.mgh.harvard.edu/~purcell/plink PLINK] is an excellent data handling tool, and  
 
implements many useful statistical methods.  It's the Swiss Army Knife for GWAS.
 
 
 
[http://genepath.med.harvard.edu/~reich/Software.htm EIGENSOFT] is widely used for population structure analysis and correction.
 
 
 
[http://www.stats.ox.ac.uk/%7Emarchini/software/gwas/gwas.html IMPUTE and SNPTEST],
 
or
 
[http://www.sph.umich.edu/csg/abecasis/mach MACH] and
 
[http://mga.bionet.nsc.ru/%7Eyurii/ABEL ProbABEL], or [http://stephenslab.uchicago.edu/software.html BimBam], and all be used to perform more sophisticated model based genotype imputation and association testing.
 
 
 
[http://toby.freeshell.org/software/quicktest.shtml QUICKTEST] is
 
our own software for association testing using uncertain genotypes.  For quantitative trait analysis, we think it is faster and better than SNPTEST.
 
 
 
== References ==
 
 
 
<biblio>
 
# BaldingTutorial pmid=16983374
 
# McCarthyReview pmid=18398418
 
# FlintReview pmid=15803197
 
# PricePC pmid=16862161
 
# ServinImputation pmid=17676998
 
# MarchiniImputation pmid=17572673
 
# KangEMMA pmid=18385116
 
</biblio>
 

Latest revision as of 16:57, 22 May 2021



Introduction

Genome-wide association studies (GWAS) aim to associate one or several phenotypes with a large panel of genotypes measured in the same population. The most commonly investigated genotypes are single nucleotide polymorphisms (SNPs), which are common genetic variants (typically with a minor allele frequency of at least 1% in a given population. The standard approach for testing for an association is to use the genotype, coded in terms of the dosage (0, 1 or 2) of the less frequent allele (the co-called “minor allele”) as feature, and the phenotype as response variable within a regression model. Continuously distributed phenotypes are often “qq-normalised”, i.e. transformed into a normally distributed variable. Discrete phenotypes, such as disease states, are tested for association with the genotype using logistic regression. Both for linear and logistic regression it is common to include covariates when estimating the effects. Typical covariates for GWAS of human phenotypes are age, sex and the principle components of the entire genotypic profile, which serves as a proxy for population stratification. The regression estimates the SNP-wise effect β and its standard error (ste). The ratio β/ste is t-distributed under the null hypothesis. Since the standard error is the standard deviation divided by the square root of the sample size, it can always be made smaller by increasing the sample size, leading to larger t-statistics, if the effect is non-zero. Thus with ever growing cohorts, some of which have sample sizes getting close to one million, it is in principle possible to detect very small effects.

The genetic component of a complex trait is due to the combination of a large number of small effects, some of which may be additive, while others combine in a non-linear manner, known as epistasis. The combined genetic variability in proportion to its overall variability (including the environmental part) is known as heritability. The additive heritability of a trait can be estimated from its GWAS summary statistics (i.e. SNP-wise effects sizes and their errors) using a method known as LD score regression [ref]. A sizable heritability of any phenotype is a sign of it having a genetic and therefore biological underpinning.

Statistical power is essential for GWAS for two reasons. First, individual SNP-wise effects of complex traits are expected to be small, in particular if the effect is influencing fitness (even slightly) negatively, since any sizable detrimental effects would have purged the effect allele from the population by natural selection. Second, genome-wide scans today typically test about one million measured SNPs. As a consequence of the large number of tests, significant associations can occur just by chance. For example, when making one million tests, under the null hypothesis of there being no real associations, the nominal p-values from these tests are uniformly distributed, and the smallest p-value is expected to be of the order of 10-6, i.e. one over the number of tests. The most common way to control false positives when testing multiple hypotheses, is to apply a so-called “Bonferroni correction”, where only associations with p-values smaller than the nominal significance cutoff (usually 0.05) divided by the number of tests, are considered to be significant. Thus applying a Bonferroni significance threshold of 5·10-8 is widely accepted within the GWAS community to reveal genuine associations.

A significant challenge of GWAS is to interpret the SNP-wise associations. These associations can be seen as pointers to individual nucleotides in the DNA that are candidates for modulating the trait of interest. Yet, there are several difficulties when analysing trait-associated SNPs. First, proximal SNP are usually not independent, a phenomenon known as “linkage disequilibrium” (LD). As a consequence one usually finds sizable regions that can contain hundreds of SNPs, which are all significantly associated with the trait. The differences of the respective p-values are often too small to decide which of the many SNPs is the “lead SNP”, the one with the highest chance of driving the association signal. Moreover, GWAS usually do not include rare genetic variants, which may be the actual causal nucleotides. Some of the rare variants can be imputed from the SNPs, and state-of-art GWAS now consider about 10 million imputed genotypes on top of the one million that are measured directly (most commonly using microarrays). As sequencing is becoming less and less expensive, we can expect that eventually the complete human sequence, including extremely rare or even individual variants, will be available for GWAS.