Science


Large-scale data analyses

Classical GWAS studies

The CBG has participated in a large number of genome-wide association studies (GWAS). Most of our studies use data from the Cohorte Lausannoise (CoLaus) focusing on cardiovascular health. We also participated in a few other large clinical studies (e.g. on HCV and narcolepsy), as well as GWAS using data from collections of inbred mice and flies. We have a number of efficient tools to perform standard GWAS and combine the results from others into meta-analyses. A major focus of the lab is the development of new methods for the analysis of GWAS-related data, including:

Methods to improve standard GWAS

We have worked on a number projects aimed at enhancing the standard GWAS approach: Specifically, we have published

  • on population stratification (“Genes mirror geography within Europe”)
  • on “Methods for Testing Association Between Uncertain Genotypes and Quantitative Traits”
  • on “Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort”
  • on detecting pairwise interactions through “FastEpistasis: A high performance computing solution for quantitative trait epistasis”
  • on evaluating their prevalence by “Comparison of strategies to detect epistasis from eQTL data”
  • on a “Novel Method to Estimate the Phenotypic Variation Explained by Genome Wide Association Studies Reveals Large Fraction of the Missing Heritability” (theoretical work investigating possible explanations for the missing variance problem)

Methods processing GWAS results in the context of external data

A major focus of our recent research has been on methods that make use of the vast amount of (published) associations between variants and traits with the aim to reveal the underlying biology. Specifically, our Pathway scoring algorithm (Pascal) (published under “Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics”) computes a score for each gene based on all the trait association p-values of SNPs within a window around the coding region of this gene. Importantly, the score integrates the SNP-wise support for association in a statistically rigorous manner accounting for the local correlation structure (known as linkage disequilibrium (LD), such that independent support of (groups of) SNPs is weighted more strongly. Pascal avoids the computational burden of permutation tests, allowing it to compute gene scores about 100 times faster and much more accurately. This advance was crucial for the primary goal of Pascal, namely to further integrate these gene-scores into pathway-scores, reflecting the association of a trait with (preselected) sets of genes. A key advantage of taking into account the signal of gene sets, is that such sets may turn out to be significant even if none of the contributing SNP- or gene-wise summary statistics are nominally significant. The ability to compute gene scores extremely fast allowed us to process large collections of annotated gene sets (such as GO, KEGG and other pathway collections) for many traits. We showed that our method is well-calibrated in terms of its false discovery rate and typically more sensitive than competing approaches. In particular, Pascal accounts for gene sets with neighboring genes whose SNP-wise signals may be correlated due to LD, which is usually ignored, but can lead to inflated pathway scores.

Methods processing external data in the context of GWAS results

The rapid computation of gene- and pathway-scores with Pascal also enabled us to leverage the large body of GWAS data for shedding light on other genomic data. In a recent study “Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases” we derived close to 400 cell type- tissue-specific regulatory networks from FANTOM5 data and asked in which of these networks the genes with high scores for a given disease phenotype had higher connectivity than expected by chance. This mapping of tissues to diseases was consistent with common knowledge, but also implicated some so far unexpected tissues as disease relevant. In particular, we were able to make new hypotheses about the relevance of some specific brain tissues for psychiatric disorders.

Furthermore, the capability to rapidly compute whether a given gene set has a significant enrichment in GWAS signals across a large panel of traits, inspired and enabled us to launch the Disease Module Identification DREAM Challenge (https://synapse.org/modulechallenge). The goal of this challenge was to facilitate a community effort to build and evaluate unsupervised molecular network modularization algorithms (see our preprint ”Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases” which has been accepted for publication in Nature Methods). Participants had to dissect a panel of six gene and protein networks into modular components (i.e. gene sets). Our challenge received a lot of interest by the community with over 400 registered participants and 380 posts in the discussion forum. After multiple training rounds, 43 international teams made valid final submissions including the predicted network modules, detailed method descriptions and code. Our idea was to use a completely novel evaluation strategy, where each submitted module was tested against a large panel of 180 GWAS using the Pascal tool. Thus rather than testing modules for enrichment in already annotated pathways, we argued that the information entailed in the GWAS results would provide an alternative, potentially more comprehensive evaluation of modules. This is because any GWAS by definition is genome-wide, thereby providing information for all genes, while (human) gene/pathway annotations are necessarily biased and include only a subset of all genes. Moreover, gene annotations are often based on similar data as the networks (e.g., gene expression data or protein interaction data) and therefore do not provide an independent means of validation, while the GWAS data are completely orthogonal to the networks. The predictions submitted by the community revealed 1,632 modules with significant association to at least one of the phenotypes in our GWAS panel. Further study of these modules showed that not all genes in the gene-sets are necessarily associated to the underlying phenotype with genome-wide significance. We make the three top methods for network modularisation available in a single toolbox.

Methods integrating molecular phenotypes with GWAS data

Our main advance in this domain is in the development of methods for integrating metabolomics with genotypic data. We have access to proton NMR spectra from both urine and serum for about 850 genotyped CoLaus participants. These data were, however, not quantified into metabolite concentrations. This presents a limitation, as previous mGWAS had typically only used metabolite concentrations. In order to avoid the often imperfect identification and quantification of metabolites, we designed a different type of mGWAS we called untargeted mGWAS. In this approach, we directly test metabolome features, obtained by simple alignment and normalization of raw spectral data, for association with genetic variants. The key advantage of this design is that it makes full use of all experimental data, because it does not discard data that may have eluded identification. In addition, the approach contains an inherent method of metabolite identification: genetic spiking. The effect of a genetic variant on the concentration of a metabolite tends to translate, in an untargeted mGWAS, to associations with the features that correspond to the NMR spectrum of the metabolite. In certain cases, genetic association can therefore allow for identification of the underlying metabolite. We developed metabomatching to automate metabolite identification by genetic spiking. We used this design for the first CoLaus urine NMR mGWAS, published under “Genome-Wide Association Study of Metabolic Traits Reveals Novel Gene-Metabolite-Disease Links”. Subsequently, we applied metabomatching in a second mGWAS, and formalized the method as a software package (“Metabomatching: Using genetic association to identify metabolites in proton NMR spectroscopy.”). Our software is available both as standalone package and within PhenoMeNal, a large-scale computing e-infrastructure project for metabolomics of which we are a partner.

We also work on the integration of gene expression profiles from 555 LCLs derived from CoLaus subjects using RNAseq technology. Our data show very good concordance of cis-eQTLs with those identified in previous studies. Searching for associations between gene expression and metabotype profiles revealed several gene-metabotype pairs with highly significant associations. Our current research focuses on better understanding these associations and develop tools to dissect genotype-gene-metabotype triangles to map potential pathways and chains of causality: Does a change in gene expression lead to an altered metabolite concentration or vice versa?

Modular analyses

Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.

In order to identify such modules the Iterative Signature Algorithm (ISA) was originally conceived in the Barkai group, and then further optimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).

Data integration and the Ping-pong algorithm

With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets. We published it under “A modular approach for integrative analysis of large-scale gene-expression and drug-response data”.

Study of small genetic networks

A complementary direction of research pertains to relatively small genetic networks, whose components are well known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patterning along the anterior-posterior axis in the early embryo development (governed by the morphogen Bicoid, see Robustness in Drosophila embryo patterning for details) and for structuring the precursor of the wing (know as the imaginal wing disc). We are also interested in phototropism in plants based on the model organism Arabidopsis thaliana, which is a small flowering plant known otherwise as "thale cress".

Developmental Patterning

We are intrigued by the fundamental question of how a developing tissue can tell head from tail. In other words: how do individual cells "know" their respective differentiation program? One solution nature has come up with to read out positional information is to use graded profiles of so-called "morphogens". Within the SystemsX.ch project WingX together with the Affolter and Basler labs we study how the morphogen gradients are formed and how the information is processed to give rise to robust patterning.

Light response

Plants need light for their basic metabolism. Sunflowers literally follow the sun, but even little seedlings already bend their first leaves towards the light to maximize their energy intake. When in competition for light with other plants one can observe that many plants become much taller than when they grow alone. Within the SystemsX.ch project on Plant Growth we collaborate with the Fankhauser lab with the aim to unravel the molecular mechanisms that give rise to these fascinating phenomena.