Difference between revisions of "Science"

Line 1: Line 1:
 
[[Category:Homepage]]
 
[[Category:Homepage]]
  
== Background ==
+
We have interest in various fields related to Computational Biology, which can be devided into two main directions: The major part of the group is involved in developing and applying methods for the '''integrative analysis of large-scale biological and clinical data'''. Yet, we also take a keen interest in the '''study of small genetic networks''', whose components are well-known and which can be modeled quantitatively. More information describing our research and its background is provided below.
  
DNA microarrays have firmly established themselves as a standard tool in biological and biomedical research. Together with the rapid advancement of genome sequencing projects, microarrays and related high-throughput technologies have been key factors in the study of the more global aspects of cellular systems biology. While genomic sequence provides an inventory of parts, a proper organization and eventual understanding of these parts and their functions requires comprehensive views also of the regulatory relations between them. Genome-wide expression data offer such a global view by providing a simultaneous read-out of the mRNA levels of all (or many) genes of the genome.
+
== Integrative analysis of large-scale biological and clinical data ==
 +
The possibilities to measure the properties and the behavior of biological systems advance at a rapid pace. Whole-genome sequencing provides not only an inventory of genes, including their regulatory regions, but has paved the way for high-throughput technologies that elucidate their genetic variability across populations and their transcriptional response subject to different genetic and environmental conditions. In particular, DNA microarrays allow for cost-efficient measurements of genome-wide SNP- and expression-profiles.
  
Most microarray experiments are conducted to address specific biological issues. In addition to the specific biological questions probed in such individual focused experiments, it is widely recognized that a wealth of additional information can be retrieved from a large and heterogeneous dataset describing the transcriptional response to a variety of different conditions. Also the relatively high level of noise in these data can be dealt with most effectively by combining many arrays probing similar conditions.
+
=== Genome-wide association studies ===
 +
[[Genome Wide Association Studies]] (GWAS) have employed this technology to genotype large cohorts whose individuals have been phenotyped for various clinical parameters. Such studies search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors, and potentially lead to new therapeutic avenues.
  
=== The modular concept ===
+
From the many GWAS that were performed in the last years it became apparent that even well-powered meta-studies with many thousands (or even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the total genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for assessing risk for predisposition to any of the complex diseases that have been studied.
  
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.
+
=== Current challenges and limitations of GWAS ===
 +
Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, sometimes several decades ago, and it has been argued that these estimates may be problematic. Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants in populations of European descent. While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction  which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes. Moreover, other genetic variants like Copy Number Variations (CNVs) may also play an important role. Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, covariates, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.
  
=== Regulatory patterns are context specific ===
+
=== Integrating molecular phenotypes ===
 +
There is a long path from a genetic variant to an “organismal” phenotype (i.e. one that is observed at the level of the organism). A variant nucleotide can have many effects: Exonic variants may disrupt proper transcription by generating a premature stop-codon, or alter an amino-acid that is crucial for protein function, while intronic variants may affect splicing. Also variants outside the transcribed region can modify the level of expression by altering regulatory sites for chromatin state, as well as transcriptional and post-transcriptional regulation.
  
The central problem in the analysis of large and diverse collections of expression profiles lies in the context-dependent nature of co-regulation. Usually genes are coordinately regulated only in specific experimental contexts, corresponding to a subset of the conditions in the dataset. Most standard analysis methods classify genes based on their similarity in expression across all available conditions. The underlying assumption of uniform regulation is reasonable for the analysis of small datasets, but limits the utility of these tools for the analysis of heterogeneous large datasets for the following reasons: First, conditions irrelevant for the analysis of a particular regulatory context contribute noise, hampering the identification of correlated behavior over small subsets of conditions. Second, genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another. This is particularly relevant for splice isoforms, that are not distinguished by the probes on the array, but may differ in their physiological function or localization. Thus, combinatorial regulation necessitates the assignment of genes to several context-specific and potentially overlapping modules. In contrast, most commonly used clustering techniques yield disjoint partitions, assigning each gene to a single cluster.
+
It is important to realize that regulatory networks have evolved to function robustly under external and internal perturbations. Any effect of a genetic variant on the organismal phenotype is propagated through these networks. This propagation, in particular if it involves crucial cellular functions, is likely to involve compensatory effects mediated by regulatory circuits like feedback loops. Moreover, robust functions are often achieved by “backup systems”, alternative pathways that can at least partially compensate each other. Thus, for the vast majority of variants segregating in a population the resulting macroscopic phenotypic variation is expected to be small, since variants giving rise to dramatic effects reducing individual fitness will quickly be purged from the population. Indeed, rare mono- or poly-genetic diseases mainly arise from such variants that alter gene products (or their expression) in a way that cannot be compensated for. In contrast, propensity to common diseases are likely to be governed by a large number of variants, each of which has a small, if any, effect, and only many “weak links” can lead to a systemic breakdown of homeostasis.
 +
Hence, it is not surprising that the effects of genetic variability are more pronounced “up-stream” at the molecular level than “down-stream” at macroscopic level of the organism. Thus an alternative to the forward genetics approach is the construction of molecular networks defining the molecular states of a system that underlie a particular phenotype or disease. In order to construct these networks from molecular data large cohorts have to be phenotyped both at the molecular and the macroscopic level.  
  
=== Co-classification of genes and conditions ===
+
=== The need for reduction of complexity ===
 +
Molecular phenotypes, like transcript and metabolite concentrations, provide much more immediate information on the impact of genotypic variation than the resulting organismal phenotypes. Yet, in general the number of molecular observables (e.g. the number of genes or metabolites) is much larger. Moreover, their measurements are often noisy. Thus assigning genes or metabolites into groups and considering the group average has the following advantages:
 +
1. It reduces the complexity of such data, since the number of groups is typically much smaller than the number of individual elements.
 +
2. It reduces the noise in the data, since fluctuations in the individual (redundant) variables tend to cancel each other out.
 +
3. It may provide biological focus if the individual elements share common features (e.g. genes belonging to the same metabolic pathway)
 +
4. It may provide insights into the structure of the underlying regulatory network (e.g. groups of gene being organized in a hierarchical manner)
  
To take these considerations into account, expression patterns must be analyzed with respect to specific subsets; genes and conditions should be co-classified. The resulting ‘transcription modules’ (another common term is ‘bicluster’) consist of sets of co-expressed genes together with the conditions over which this co-expression is observed. The naïve approach of evaluating expression coherence of all possible subsets of genes over all possible subsets of conditions is computationally infeasible, and most analysis methods for large datasets seek to limit the search space in an appropriate way. Thus we (and others!) have devised new tools to extract modules from large-scale data: During my Post-doc with Prof. Naama Barkai at the Weizmann Institute we developed together with Dr. Jan Ihmels the Signature Algorithm and an iterative extension of it (the Iterative Signature Algorithm). These methods have been shown to compete well with others in terms of efficiency and accuracy. Moreover, because these algorithms do not compute correlations, the computation time scales extremely well with the size of the data.
+
These advantages have been well-recognized for large-scale gene-expression data and a multitude of methods has been developed to identify groups (or “modules”) from such data.
  
== Research interests ==
+
=== Transcription Modules and the Iterative Signature Algorithm ===
 +
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.
  
The "genomic" revolution in biology will have a fundamental impact on the improvement of diagnosis, prevention and treatment of disease. Yet, while researchers already started to use gene expression data for predictive purposes, the next challenge lies in integrating the massive data produced by different high-throughput technologies. We believe that this can be done best at the level of modules. Thus one aim of our research is the development of new modular approaches for the integrative analysis of multiple large-scale datasets.
+
In order to identify such modules the Iterative Signature Algorithm ([[ISA]]) was originally conceived in the Barkai group, and then further opimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).  
  
A complementary direction of research pertains to relatively small genetic networks, whose components are well-known. We collaborate closely with experts of the field to identify biological systems that can be modeled quantitatively. Our goal in developing such models is not only to give an approximate description of system, but also to obtain a better understanding of its properties. For example, regulatory networks evolved to function reliably under ever-changing environmental conditions. This notion of robustness can guide computational analysis and provide constraints on models that complement those from direct measurements of the system's output.
+
=== Data integration and the Ping-pong algorithm ===
 +
With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets.  
  
== Collaborations ==
+
== Study of small genetic networks ==
 
+
A complementary direction of research pertains to relatively small genetic networks, whose components are well-known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patternming along the anterior-posterioaxis in the early embryo development (goverened by the morphogen Bicoid, see [[Robustness in Drosophila embryo patterning]] for details) and for structuring the precusror of the wing (know as the imaginal wing disk, see [[WingX: Systems Biology of the Drosophila Wing]]). The second model organism we are inrterested in is the plant ''Arabidopsis'', for which we study both [[Asymmetric growth during phototropism]] and [[Shade induced hypocotyl elongation]] within the PlantX collaboration.
Our lab collaborates with experimental groups within and outside our department. In particular, due to our proximity to the CHUV we have close contacts to medical research groups and assist the analysis of clinical data. Experimentors, who find the approach outlined above interesting, are encouraged to get in contact with us to discuss possible analysis of their data.
 

Revision as of 16:43, 23 May 2011


We have interest in various fields related to Computational Biology, which can be devided into two main directions: The major part of the group is involved in developing and applying methods for the integrative analysis of large-scale biological and clinical data. Yet, we also take a keen interest in the study of small genetic networks, whose components are well-known and which can be modeled quantitatively. More information describing our research and its background is provided below.

Integrative analysis of large-scale biological and clinical data

The possibilities to measure the properties and the behavior of biological systems advance at a rapid pace. Whole-genome sequencing provides not only an inventory of genes, including their regulatory regions, but has paved the way for high-throughput technologies that elucidate their genetic variability across populations and their transcriptional response subject to different genetic and environmental conditions. In particular, DNA microarrays allow for cost-efficient measurements of genome-wide SNP- and expression-profiles.

Genome-wide association studies

Genome Wide Association Studies (GWAS) have employed this technology to genotype large cohorts whose individuals have been phenotyped for various clinical parameters. Such studies search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors, and potentially lead to new therapeutic avenues.

From the many GWAS that were performed in the last years it became apparent that even well-powered meta-studies with many thousands (or even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the total genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for assessing risk for predisposition to any of the complex diseases that have been studied.

Current challenges and limitations of GWAS

Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, sometimes several decades ago, and it has been argued that these estimates may be problematic. Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants in populations of European descent. While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes. Moreover, other genetic variants like Copy Number Variations (CNVs) may also play an important role. Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, covariates, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.

Integrating molecular phenotypes

There is a long path from a genetic variant to an “organismal” phenotype (i.e. one that is observed at the level of the organism). A variant nucleotide can have many effects: Exonic variants may disrupt proper transcription by generating a premature stop-codon, or alter an amino-acid that is crucial for protein function, while intronic variants may affect splicing. Also variants outside the transcribed region can modify the level of expression by altering regulatory sites for chromatin state, as well as transcriptional and post-transcriptional regulation.

It is important to realize that regulatory networks have evolved to function robustly under external and internal perturbations. Any effect of a genetic variant on the organismal phenotype is propagated through these networks. This propagation, in particular if it involves crucial cellular functions, is likely to involve compensatory effects mediated by regulatory circuits like feedback loops. Moreover, robust functions are often achieved by “backup systems”, alternative pathways that can at least partially compensate each other. Thus, for the vast majority of variants segregating in a population the resulting macroscopic phenotypic variation is expected to be small, since variants giving rise to dramatic effects reducing individual fitness will quickly be purged from the population. Indeed, rare mono- or poly-genetic diseases mainly arise from such variants that alter gene products (or their expression) in a way that cannot be compensated for. In contrast, propensity to common diseases are likely to be governed by a large number of variants, each of which has a small, if any, effect, and only many “weak links” can lead to a systemic breakdown of homeostasis. Hence, it is not surprising that the effects of genetic variability are more pronounced “up-stream” at the molecular level than “down-stream” at macroscopic level of the organism. Thus an alternative to the forward genetics approach is the construction of molecular networks defining the molecular states of a system that underlie a particular phenotype or disease. In order to construct these networks from molecular data large cohorts have to be phenotyped both at the molecular and the macroscopic level.

The need for reduction of complexity

Molecular phenotypes, like transcript and metabolite concentrations, provide much more immediate information on the impact of genotypic variation than the resulting organismal phenotypes. Yet, in general the number of molecular observables (e.g. the number of genes or metabolites) is much larger. Moreover, their measurements are often noisy. Thus assigning genes or metabolites into groups and considering the group average has the following advantages: 1. It reduces the complexity of such data, since the number of groups is typically much smaller than the number of individual elements. 2. It reduces the noise in the data, since fluctuations in the individual (redundant) variables tend to cancel each other out. 3. It may provide biological focus if the individual elements share common features (e.g. genes belonging to the same metabolic pathway) 4. It may provide insights into the structure of the underlying regulatory network (e.g. groups of gene being organized in a hierarchical manner)

These advantages have been well-recognized for large-scale gene-expression data and a multitude of methods has been developed to identify groups (or “modules”) from such data.

Transcription Modules and the Iterative Signature Algorithm

Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.

In order to identify such modules the Iterative Signature Algorithm (ISA) was originally conceived in the Barkai group, and then further opimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).

Data integration and the Ping-pong algorithm

With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets.

Study of small genetic networks

A complementary direction of research pertains to relatively small genetic networks, whose components are well-known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patternming along the anterior-posterioaxis in the early embryo development (goverened by the morphogen Bicoid, see Robustness in Drosophila embryo patterning for details) and for structuring the precusror of the wing (know as the imaginal wing disk, see WingX: Systems Biology of the Drosophila Wing). The second model organism we are inrterested in is the plant Arabidopsis, for which we study both Asymmetric growth during phototropism and Shade induced hypocotyl elongation within the PlantX collaboration.