Difference between revisions of "Science"

 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Category:Homepage]]
 
[[Category:Homepage]]
  
We have interest in various fields of computational biology, which can be divided into two main directions: The major part of the group is involved in developing and applying methods for the '''integrative analysis of large-scale biological and clinical data'''. Yet, we also take a keen interest in the '''study of small genetic networks''', whose components are well-known and which can be modeled quantitatively. More information describing our research and its background is provided below.
+
== Large-scale data analyses ==
 +
=== Classical GWAS studies ===
 +
The CBG has participated in a large number of genome-wide association studies (GWAS). Most of our studies use data from the [http://www.colaus.ch Cohorte Lausannoise (CoLaus)] focusing on cardiovascular health. We also participated in a few other large clinical studies (e.g. on [https://www.sciencedirect.com/science/article/pii/S0016508510000089?via%3Dihub HCV] and [https://www.nature.com/articles/ng.647 narcolepsy]), as well as GWAS using data from collections of [https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0041032 inbred mice] and [https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005616 flies]. We have a number of efficient tools to perform standard GWAS and combine the results from others into meta-analyses. A major focus of the lab is the development of new methods for the analysis of GWAS-related data, including:
  
== Integrative analysis of large-scale biological and clinical data ==
+
=== Methods to improve standard GWAS ===
The possibilities to measure the properties and the behavior of biological systems advance at a rapid pace. Whole-genome sequencing provides not only an inventory of genes, including their regulatory regions, but has paved the way for high-throughput technologies that elucidate their genetic variability across populations and their transcriptional response subject to different genetic and environmental conditions. In particular, DNA microarrays allow for cost-efficient measurements of genome-wide SNP- and expression-profiles.
+
We have worked on a number projects aimed at enhancing the standard GWAS approach: Specifically, we have published
  
=== Genome-wide association studies ===
+
* on population stratification (“Genes mirror geography within Europe”)  
[[Genome Wide Association Studies]] (GWAS) have employed this technology to genotype large cohorts whose individuals have been phenotyped for various clinical parameters. Such studies search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors, and potentially lead to new therapeutic avenues.
+
* on “Methods for Testing Association Between Uncertain Genotypes and Quantitative Traits”
 +
* on “Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort”
 +
* on detecting pairwise interactions through “FastEpistasis: A high performance computing solution for quantitative trait epistasis”
 +
* on evaluating their prevalence by “Comparison of strategies to detect epistasis from eQTL data”
 +
* on a “Novel Method to Estimate the Phenotypic Variation Explained by Genome Wide Association Studies Reveals Large Fraction of the Missing Heritability” (theoretical work investigating possible explanations for the missing variance problem)
  
From the many GWAS that were performed in the last years it became apparent that even well-powered meta-studies with many thousands (or even tens of thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the total genetic variance of the phenotype. Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for assessing the risk of predisposition to any of the complex diseases that have been studied.
+
=== Methods processing GWAS results in the context of external data ===
 +
A major focus of our recent research has been on methods that make use of the vast amount of (published) associations between variants and traits with the aim to reveal the underlying biology. Specifically, our Pathway scoring algorithm (Pascal) (published under “Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics”) computes a score for each gene based on all the trait association p-values of SNPs within a window around the coding region of this gene. Importantly, the score integrates the SNP-wise support for association in a statistically rigorous manner accounting for the local correlation structure (known as linkage disequilibrium (LD), such that independent support of (groups of) SNPs is weighted more strongly. Pascal avoids the computational burden of permutation tests, allowing it to compute gene scores about 100 times faster and much more accurately. This advance was crucial for the primary goal of Pascal, namely to further integrate these gene-scores into pathway-scores, reflecting the association of a trait with (preselected) sets of genes. A key advantage of taking into account the signal of gene sets, is that such sets may turn out to be significant even if none of the contributing SNP- or gene-wise summary statistics are nominally significant. The ability to compute gene scores extremely fast allowed us to process large collections of annotated gene sets (such as GO, KEGG and other pathway collections) for many traits. We showed that our method is well-calibrated in terms of its false discovery rate and typically more sensitive than competing approaches. In particular, Pascal accounts for gene sets with neighboring genes whose SNP-wise signals may be correlated due to LD, which is usually ignored, but can lead to inflated pathway scores.
  
=== Current challenges and limitations of GWAS ===
+
=== Methods processing external data in the context of GWAS results ===
Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, sometimes several decades ago, and it has been argued that these estimates may be problematic. Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants in populations of European descent. While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction of variants which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes. Moreover, other genetic variants like Copy Number Variations (CNVs) may also play an important role. Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, covariates, like sex, age and principal components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since the number of all possible pair-wise interactions already scales like the number of genetic markers squared.
+
The rapid computation of gene- and pathway-scores with Pascal also enabled us to leverage the large body of GWAS data for shedding light on other genomic data. In a recent study “Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases” we derived close to 400 cell type- tissue-specific regulatory networks from FANTOM5 data and asked in which of these networks the genes with high scores for a given disease phenotype had higher connectivity than expected by chance. This mapping of tissues to diseases was consistent with common knowledge, but also implicated some so far unexpected tissues as disease relevant. In particular, we were able to make new hypotheses about the relevance of some specific brain tissues for psychiatric disorders.  
  
=== Integrating molecular phenotypes ===
+
Furthermore, the capability to rapidly compute whether a given gene set has a significant enrichment in GWAS signals across a large panel of traits, inspired and enabled us to launch the Disease Module Identification DREAM Challenge (https://synapse.org/modulechallenge). The goal of this challenge was to facilitate a community effort to build and evaluate unsupervised molecular network modularization algorithms (see our preprint ”Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases” which has been accepted for publication in Nature Methods). Participants had to dissect a panel of six gene and protein networks into modular components (i.e. gene sets). Our challenge received a lot of interest by the community with over 400 registered participants and 380 posts in the discussion forum. After multiple training rounds, 43 international teams made valid final submissions including the predicted network modules, detailed method descriptions and code. Our idea was to use a completely novel evaluation strategy, where each submitted module was tested against a large panel of 180 GWAS using the Pascal tool. Thus rather than testing modules for enrichment in already annotated pathways, we argued that the information entailed in the GWAS results would provide an alternative, potentially more comprehensive evaluation of modules. This is because any GWAS by definition is genome-wide, thereby providing information for all genes, while (human) gene/pathway annotations are necessarily biased and include only a subset of all genes. Moreover, gene annotations are often based on similar data as the networks (e.g., gene expression data or protein interaction data) and therefore do not provide an independent means of validation, while the GWAS data are completely orthogonal to the networks. The predictions submitted by the community revealed 1,632 modules with significant association to at least one of the phenotypes in our GWAS panel. Further study of these modules showed that not all genes in the gene-sets are necessarily associated to the underlying phenotype with genome-wide significance. We make the three top methods for network modularisation available in a single toolbox.  
There is a long path from a genetic variant to an “organismal” phenotype (i.e. one observed at the level of the organism). A variant nucleotide can have many effects: Exonic variants may disrupt proper transcription by generating a premature stop-codon, or alter an amino-acid that is crucial for protein function, while intronic variants may affect splicing. Also variants outside the transcribed region can modify the level of expression by altering regulatory sites for chromatin state, as well as transcriptional and post-transcriptional regulation.  
 
  
It is important to realize that regulatory networks have evolved to function robustly under external and internal perturbations. Any effect of a genetic variant on the organismal phenotype is propagated through these networks. This propagation, in particular if it involves crucial cellular functions, is likely to involve compensatory effects mediated by regulatory circuits like feedback loops. Moreover, robust functions are often achieved by “backup systems”, alternative pathways that can at least partially compensate each other. Thus, for the vast majority of variants segregating in a population the resulting macroscopic phenotypic variation is expected to be small, since variants giving rise to dramatic effects reducing individual fitness will quickly be purged from the population. Indeed, rare mono- or poly-genetic diseases mainly arise from such variants that alter gene products (or their expression) in a way that cannot be compensated for. In contrast, propensities to common diseases are likely to be governed by a large number of variants, each of which has a small, if any, effect, and only many “weak links” can lead to a systemic breakdown of homeostasis.
+
=== Methods integrating molecular phenotypes with GWAS data ===
Hence, it is not surprising that the effects of genetic variability are more pronounced “up-stream” at the molecular level than “down-stream” at macroscopic level of the organism. Thus an alternative to the forward genetics approach is the construction of molecular networks defining the molecular states of a system that underlie a particular phenotype or disease. In order to construct these networks from molecular data large cohorts have to be phenotyped both at the molecular and the macroscopic level.
+
Our main advance in this domain is in the development of methods for integrating metabolomics with genotypic data. We have access to proton NMR spectra from both urine and serum for about 850 genotyped CoLaus participants. These data were, however, not quantified into metabolite concentrations. This presents a limitation, as previous mGWAS had typically only used metabolite concentrations. In order to avoid the often imperfect identification and quantification of metabolites, we designed a different type of mGWAS we called untargeted mGWAS. In this approach, we directly test metabolome features, obtained by simple alignment and normalization of raw spectral data, for association with genetic variants. The key advantage of this design is that it makes full use of all experimental data, because it does not discard data that may have eluded identification. In addition, the approach contains an inherent method of metabolite identification: genetic spiking. The effect of a genetic variant on the concentration of a metabolite tends to translate, in an untargeted mGWAS, to associations with the features that correspond to the NMR spectrum of the metabolite. In certain cases, genetic association can therefore allow for identification of the underlying metabolite. We developed metabomatching to automate metabolite identification by genetic spiking. We used this design for the first CoLaus urine NMR mGWAS, published under “Genome-Wide Association Study of Metabolic Traits Reveals Novel Gene-Metabolite-Disease Links”. Subsequently, we applied metabomatching in a second mGWAS, and formalized the method as a software package (“Metabomatching: Using genetic association to identify metabolites in proton NMR spectroscopy.”). Our software is available both as standalone package and within PhenoMeNal, a large-scale computing e-infrastructure project for metabolomics of which we are a partner.  
  
=== The need for reduction of complexity ===
+
We also work on the integration of gene expression profiles from 555 LCLs derived from CoLaus subjects using RNAseq technology. Our data show very good concordance of cis-eQTLs with those identified in previous studies. Searching for associations between gene expression and metabotype profiles revealed several gene-metabotype pairs with highly significant associations. Our current research focuses on better understanding these associations and develop tools to dissect genotype-gene-metabotype triangles to map potential pathways and chains of causality: Does a change in gene expression lead to an altered metabolite concentration or vice versa?
Molecular phenotypes, like transcript and metabolite concentrations, provide much more immediate information on the impact of genotypic variation than the resulting organismal phenotypes. Yet, in general the number of molecular observables (e.g. the number of genes or metabolites) is much larger. Moreover, their measurements are often noisy. Thus assigning genes or metabolites into groups and considering the group average has the following advantages:
 
  
1. It reduces the complexity of such data, since the number of groups is typically much smaller than the number of individual elements.
+
=== Modular analyses ===
 
 
2. It reduces the noise in the data, since fluctuations in the individual (redundant) variables tend to cancel each other out.
 
 
 
3. It may provide biological focus if the individual elements share common features (e.g. genes belonging to the same metabolic pathway).
 
 
 
4. It may provide insights into the structure of the underlying regulatory network (e.g. groups of gene being organized in a hierarchical manner).
 
 
 
These advantages have been well-recognized for large-scale gene-expression data and a multitude of methods has been developed to identify groups (or “modules”) from such data.
 
 
 
=== Transcription Modules and the Iterative Signature Algorithm ===
 
 
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.  
 
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.  
  
In order to identify such modules the Iterative Signature Algorithm ([[ISA]]) was originally conceived in the Barkai group, and then further opimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).  
+
In order to identify such modules the Iterative Signature Algorithm (ISA) was originally conceived in the Barkai group, and then further optimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).  
  
 
=== Data integration and the Ping-pong algorithm ===
 
=== Data integration and the Ping-pong algorithm ===
With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets.  
+
With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets. We published it under “A modular approach for integrative analysis of large-scale gene-expression and drug-response data”.
  
 
== Study of small genetic networks ==
 
== Study of small genetic networks ==
A complementary direction of research pertains to relatively small genetic networks, whose components are well known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patterning along the anterior-posterioaxis in the early embryo development (goverened by the morphogen Bicoid, see [[Robustness in Drosophila embryo patterning]] for details) and for structuring the precursor of the wing (know as the imaginal wing disk). We are also interested in [[Phototropism in Arabidopsis|phototropism in plants]] based on the model organism ''Arabidopsis thaliana'', which is a small flowering plant known otherwise as "thale cress".
+
A complementary direction of research pertains to relatively small genetic networks, whose components are well known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patterning along the anterior-posterior axis in the early embryo development (governed by the morphogen Bicoid, see Robustness in Drosophila embryo patterning for details) and for structuring the precursor of the wing (know as the imaginal wing disc). We are also interested in phototropism in plants based on the model organism Arabidopsis thaliana, which is a small flowering plant known otherwise as "thale cress".
  
 
=== Developmental Patterning ===
 
=== Developmental Patterning ===
 
+
We are intrigued by the fundamental question of how a developing tissue can tell head from tail. In other words: how do individual cells "know" their respective differentiation program? One solution nature has come up with to read out positional information is to use graded profiles of so-called "morphogens". Within the SystemsX.ch project WingX together with the Affolter and Basler labs we study how the morphogen gradients are formed and how the information is processed to give rise to robust patterning.
We are intrigued by the fundamental question of how a developing tissue can tell head from tail. In other words: how do indiviudal cell "know" their respective differentiation program? One solution nature has come up with to read out positional information is to use graded profiles of so-called "morphogens". Within the SystemsX.ch project [http://www.systemsx.ch/index.php?id=149 WingX] together with the [http://www.biozentrum.unibas.ch/affolter/index.html Affolter] and [http://www.imls.uzh.ch/research/basler.html Basler] labs we study how the morphogen gradients are formed and how the information is processed to give rise to robust patterning.
 
  
 
=== Light response ===
 
=== Light response ===
 
+
Plants need light for their basic metabolism. Sunflowers literally follow the sun, but even little seedlings already bend their first leaves towards the light to maximize their energy intake. When in competition for light with other plants one can observe that many plants become much taller than when they grow alone. Within the SystemsX.ch project on Plant Growth we collaborate with the Fankhauser lab with the aim to unravel the molecular mechanisms that give rise to these fascinating phenomena.
Plants need light for their basic metabolism. Sunflowers literally follow the sun, but even little seedlings already bend their first leaves towards the light to maximize their energy intake. When in competition for light with other plants one can observe that many plants become much taller than when they grow alone. Within the SystemsX.ch project on [http://www.systemsx.ch/projects/systemsxch-projects/research-technology-and-development-projects-rtd/plant-growth/ Plant Growth] we collaborate with the [http://www.unil.ch/cig/page8391.html Fankhauser lab] with the aim to unravel the molecular mechanisms that give rise to these fascinating phenomena.
 

Latest revision as of 14:26, 12 October 2020


Large-scale data analyses

Classical GWAS studies

The CBG has participated in a large number of genome-wide association studies (GWAS). Most of our studies use data from the Cohorte Lausannoise (CoLaus) focusing on cardiovascular health. We also participated in a few other large clinical studies (e.g. on HCV and narcolepsy), as well as GWAS using data from collections of inbred mice and flies. We have a number of efficient tools to perform standard GWAS and combine the results from others into meta-analyses. A major focus of the lab is the development of new methods for the analysis of GWAS-related data, including:

Methods to improve standard GWAS

We have worked on a number projects aimed at enhancing the standard GWAS approach: Specifically, we have published

  • on population stratification (“Genes mirror geography within Europe”)
  • on “Methods for Testing Association Between Uncertain Genotypes and Quantitative Traits”
  • on “Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort”
  • on detecting pairwise interactions through “FastEpistasis: A high performance computing solution for quantitative trait epistasis”
  • on evaluating their prevalence by “Comparison of strategies to detect epistasis from eQTL data”
  • on a “Novel Method to Estimate the Phenotypic Variation Explained by Genome Wide Association Studies Reveals Large Fraction of the Missing Heritability” (theoretical work investigating possible explanations for the missing variance problem)

Methods processing GWAS results in the context of external data

A major focus of our recent research has been on methods that make use of the vast amount of (published) associations between variants and traits with the aim to reveal the underlying biology. Specifically, our Pathway scoring algorithm (Pascal) (published under “Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics”) computes a score for each gene based on all the trait association p-values of SNPs within a window around the coding region of this gene. Importantly, the score integrates the SNP-wise support for association in a statistically rigorous manner accounting for the local correlation structure (known as linkage disequilibrium (LD), such that independent support of (groups of) SNPs is weighted more strongly. Pascal avoids the computational burden of permutation tests, allowing it to compute gene scores about 100 times faster and much more accurately. This advance was crucial for the primary goal of Pascal, namely to further integrate these gene-scores into pathway-scores, reflecting the association of a trait with (preselected) sets of genes. A key advantage of taking into account the signal of gene sets, is that such sets may turn out to be significant even if none of the contributing SNP- or gene-wise summary statistics are nominally significant. The ability to compute gene scores extremely fast allowed us to process large collections of annotated gene sets (such as GO, KEGG and other pathway collections) for many traits. We showed that our method is well-calibrated in terms of its false discovery rate and typically more sensitive than competing approaches. In particular, Pascal accounts for gene sets with neighboring genes whose SNP-wise signals may be correlated due to LD, which is usually ignored, but can lead to inflated pathway scores.

Methods processing external data in the context of GWAS results

The rapid computation of gene- and pathway-scores with Pascal also enabled us to leverage the large body of GWAS data for shedding light on other genomic data. In a recent study “Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases” we derived close to 400 cell type- tissue-specific regulatory networks from FANTOM5 data and asked in which of these networks the genes with high scores for a given disease phenotype had higher connectivity than expected by chance. This mapping of tissues to diseases was consistent with common knowledge, but also implicated some so far unexpected tissues as disease relevant. In particular, we were able to make new hypotheses about the relevance of some specific brain tissues for psychiatric disorders.

Furthermore, the capability to rapidly compute whether a given gene set has a significant enrichment in GWAS signals across a large panel of traits, inspired and enabled us to launch the Disease Module Identification DREAM Challenge (https://synapse.org/modulechallenge). The goal of this challenge was to facilitate a community effort to build and evaluate unsupervised molecular network modularization algorithms (see our preprint ”Open Community Challenge Reveals Molecular Network Modules with Key Roles in Diseases” which has been accepted for publication in Nature Methods). Participants had to dissect a panel of six gene and protein networks into modular components (i.e. gene sets). Our challenge received a lot of interest by the community with over 400 registered participants and 380 posts in the discussion forum. After multiple training rounds, 43 international teams made valid final submissions including the predicted network modules, detailed method descriptions and code. Our idea was to use a completely novel evaluation strategy, where each submitted module was tested against a large panel of 180 GWAS using the Pascal tool. Thus rather than testing modules for enrichment in already annotated pathways, we argued that the information entailed in the GWAS results would provide an alternative, potentially more comprehensive evaluation of modules. This is because any GWAS by definition is genome-wide, thereby providing information for all genes, while (human) gene/pathway annotations are necessarily biased and include only a subset of all genes. Moreover, gene annotations are often based on similar data as the networks (e.g., gene expression data or protein interaction data) and therefore do not provide an independent means of validation, while the GWAS data are completely orthogonal to the networks. The predictions submitted by the community revealed 1,632 modules with significant association to at least one of the phenotypes in our GWAS panel. Further study of these modules showed that not all genes in the gene-sets are necessarily associated to the underlying phenotype with genome-wide significance. We make the three top methods for network modularisation available in a single toolbox.

Methods integrating molecular phenotypes with GWAS data

Our main advance in this domain is in the development of methods for integrating metabolomics with genotypic data. We have access to proton NMR spectra from both urine and serum for about 850 genotyped CoLaus participants. These data were, however, not quantified into metabolite concentrations. This presents a limitation, as previous mGWAS had typically only used metabolite concentrations. In order to avoid the often imperfect identification and quantification of metabolites, we designed a different type of mGWAS we called untargeted mGWAS. In this approach, we directly test metabolome features, obtained by simple alignment and normalization of raw spectral data, for association with genetic variants. The key advantage of this design is that it makes full use of all experimental data, because it does not discard data that may have eluded identification. In addition, the approach contains an inherent method of metabolite identification: genetic spiking. The effect of a genetic variant on the concentration of a metabolite tends to translate, in an untargeted mGWAS, to associations with the features that correspond to the NMR spectrum of the metabolite. In certain cases, genetic association can therefore allow for identification of the underlying metabolite. We developed metabomatching to automate metabolite identification by genetic spiking. We used this design for the first CoLaus urine NMR mGWAS, published under “Genome-Wide Association Study of Metabolic Traits Reveals Novel Gene-Metabolite-Disease Links”. Subsequently, we applied metabomatching in a second mGWAS, and formalized the method as a software package (“Metabomatching: Using genetic association to identify metabolites in proton NMR spectroscopy.”). Our software is available both as standalone package and within PhenoMeNal, a large-scale computing e-infrastructure project for metabolomics of which we are a partner.

We also work on the integration of gene expression profiles from 555 LCLs derived from CoLaus subjects using RNAseq technology. Our data show very good concordance of cis-eQTLs with those identified in previous studies. Searching for associations between gene expression and metabotype profiles revealed several gene-metabotype pairs with highly significant associations. Our current research focuses on better understanding these associations and develop tools to dissect genotype-gene-metabotype triangles to map potential pathways and chains of causality: Does a change in gene expression lead to an altered metabolite concentration or vice versa?

Modular analyses

Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.

In order to identify such modules the Iterative Signature Algorithm (ISA) was originally conceived in the Barkai group, and then further optimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).

Data integration and the Ping-pong algorithm

With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets. We published it under “A modular approach for integrative analysis of large-scale gene-expression and drug-response data”.

Study of small genetic networks

A complementary direction of research pertains to relatively small genetic networks, whose components are well known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patterning along the anterior-posterior axis in the early embryo development (governed by the morphogen Bicoid, see Robustness in Drosophila embryo patterning for details) and for structuring the precursor of the wing (know as the imaginal wing disc). We are also interested in phototropism in plants based on the model organism Arabidopsis thaliana, which is a small flowering plant known otherwise as "thale cress".

Developmental Patterning

We are intrigued by the fundamental question of how a developing tissue can tell head from tail. In other words: how do individual cells "know" their respective differentiation program? One solution nature has come up with to read out positional information is to use graded profiles of so-called "morphogens". Within the SystemsX.ch project WingX together with the Affolter and Basler labs we study how the morphogen gradients are formed and how the information is processed to give rise to robust patterning.

Light response

Plants need light for their basic metabolism. Sunflowers literally follow the sun, but even little seedlings already bend their first leaves towards the light to maximize their energy intake. When in competition for light with other plants one can observe that many plants become much taller than when they grow alone. Within the SystemsX.ch project on Plant Growth we collaborate with the Fankhauser lab with the aim to unravel the molecular mechanisms that give rise to these fascinating phenomena.