Genome Wide Association Studies
Genome-wide association studies (GWAS) aim to associate one or several phenotypes with a large panel of genotypes measured in the same population. The most commonly investigated genotypes are single nucleotide polymorphisms (SNPs), which are common genetic variants (typically with a minor allele frequency of at least 1% in a given population. The standard approach for testing for an association is to use the genotype, coded in terms of the dosage (0, 1 or 2) of the less frequent allele (the co-called “minor allele”) as feature, and the phenotype as response variable within a regression model. Continuously distributed phenotypes are often “qq-normalised”, i.e. transformed into a normally distributed variable. Discrete phenotypes, such as disease states, are tested for association with the genotype using logistic regression. Both for linear and logistic regression it is common to include covariates when estimating the effects. Typical covariates for GWAS of human phenotypes are age, sex and the principle components of the entire genotypic profile, which serves as a proxy for population stratification. The regression estimates the SNP-wise effect β and its standard error (ste). The ratio β/ste is t-distributed under the null hypothesis. Since the standard error is the standard deviation divided by the square root of the sample size, it can always be made smaller by increasing the sample size, leading to larger t-statistics, if the effect is non-zero. Thus with ever growing cohorts, some of which have sample sizes getting close to one million, it is in principle possible to detect very small effects.
The genetic component of a complex trait is due to the combination of a large number of small effects, some of which may be additive, while others combine in a non-linear manner, known as epistasis. The combined genetic variability in proportion to its overall variability (including the environmental part) is known as heritability. The additive heritability of a trait can be estimated from its GWAS summary statistics (i.e. SNP-wise effects sizes and their errors) using a method known as LD score regression [ref]. A sizable heritability of any phenotype is a sign of it having a genetic and therefore biological underpinning.
Statistical power is essential for GWAS for two reasons. First, individual SNP-wise effects of complex traits are expected to be small, in particular if the effect is influencing fitness (even slightly) negatively, since any sizable detrimental effects would have purged the effect allele from the population by natural selection. Second, genome-wide scans today typically test about one million measured SNPs. As a consequence of the large number of tests, significant associations can occur just by chance. For example, when making one million tests, under the null hypothesis of there being no real associations, the nominal p-values from these tests are uniformly distributed, and the smallest p-value is expected to be of the order of 10-6, i.e. one over the number of tests. The most common way to control false positives when testing multiple hypotheses, is to apply a so-called “Bonferroni correction”, where only associations with p-values smaller than the nominal significance cutoff (usually 0.05) divided by the number of tests, are considered to be significant. Thus applying a Bonferroni significance threshold of 5·10-8 is widely accepted within the GWAS community to reveal genuine associations.
A significant challenge of GWAS is to interpret the SNP-wise associations. These associations can be seen as pointers to individual nucleotides in the DNA that are candidates for modulating the trait of interest. Yet, there are several difficulties when analysing trait-associated SNPs. First, proximal SNP are usually not independent, a phenomenon known as “linkage disequilibrium” (LD). As a consequence one usually finds sizable regions that can contain hundreds of SNPs, which are all significantly associated with the trait. The differences of the respective p-values are often too small to decide which of the many SNPs is the “lead SNP”, the one with the highest chance of driving the association signal. Moreover, GWAS usually do not include rare genetic variants, which may be the actual causal nucleotides. Some of the rare variants can be imputed from the SNPs, and state-of-art GWAS now consider about 10 million imputed genotypes on top of the one million that are measured directly (most commonly using microarrays). As sequencing is becoming less and less expensive, we can expect that eventually the complete human sequence, including extremely rare or even individual variants, will be available for GWAS.