From Computational Biology Group
We have interest in various fields related to Computational Biology, which can be devided into two main directions: The major part of the group is involved in developing and applying methods for the integrative analysis of large-scale biological and clinical data. Yet, we also take a keen interest in the study of small genetic networks, whose components are well-known and which can be modeled quantitatively. More information describing our research and its background is provided below.
Integrative analysis of large-scale biological and clinical data
The possibilities to measure the properties and the behavior of biological systems advance at a rapid pace. Whole-genome sequencing provides not only an inventory of genes, including their regulatory regions, but has paved the way for high-throughput technologies that elucidate their genetic variability across populations and their transcriptional response subject to different genetic and environmental conditions. In particular, DNA microarrays allow for cost-efficient measurements of genome-wide SNP- and expression-profiles.
Genome-wide association studies
Genome Wide Association Studies (GWAS) have employed this technology to genotype large cohorts whose individuals have been phenotyped for various clinical parameters. Such studies search for correlations between genetic markers (usually Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals. The motivation is that such associations could provide new candidates for causal variants in genes (or their regulatory elements) that play a role for the phenotype of interest. In the clinical context this may eventually lead to a better understanding of the genetic components of diseases and their risk factors, and potentially lead to new therapeutic avenues.
From the many GWAS that were performed in the last years it became apparent that even well-powered meta-studies with many thousands (or even ten-thousands) of samples could at best identify a few (dozen) candidate loci with highly significant associations. While many of these associations have been replicated in independent studies, each locus explains but a tiny (<1%) fraction of the total genetic variance of the phenotype (as predicted from twin-studies). Remarkably, models that pool all significant loci into a single predictive scheme still miss out by at least one order of magnitude in explained variance. Thus, while GWAS already today provide new candidates for disease-associated genes and potential drug targets, very few of the currently identified (sets of) genotypic markers are of any practical use for assessing risk for predisposition to any of the complex diseases that have been studied.
Current challenges and limitations of GWAS
Various solutions to this apparent enigma have been proposed: First, it is important to realize that the expected heritabilities usually have been estimated from twin-studies, sometimes several decades ago, and it has been argued that these estimates may be problematic. Second, the genotypic information is still incomplete. Most analyses used microarrays probing only around half a million of SNPs, which is almost one order of magnitude less than the current estimates of about 4 million common variants in populations of European descent. While many of these SNPs can be imputed accurately using information on linkage disequilibrium, there still remains a significant fraction which are poorly tagged by the measured SNPs. Furthermore, rare variants with a Minor Allele Frequency (MAF) of less than 1% are not accessed at all with SNP-chips, but may nevertheless be the causal agents for many phenotypes. Moreover, other genetic variants like Copy Number Variations (CNVs) may also play an important role. Third, it is important to realize that current analyses usually only employ additive models considering one SNP at a time with few, if any, covariates, like sex, age and principle components reflecting population substructures. This obviously only covers a small set of all possible interactions between genetic variants and the environment. Even more challenging is taking into account purely genetic interactions, since already the number of all possible pair-wise interactions scales like the number of genetic markers squared.
Integrating molecular phenotypes
There is a long path from a genetic variant to an “organismal” phenotype (i.e. one that is observed at the level of the organism). A variant nucleotide can have many effects: Exonic variants may disrupt proper transcription by generating a premature stop-codon, or alter an amino-acid that is crucial for protein function, while intronic variants may affect splicing. Also variants outside the transcribed region can modify the level of expression by altering regulatory sites for chromatin state, as well as transcriptional and post-transcriptional regulation.
It is important to realize that regulatory networks have evolved to function robustly under external and internal perturbations. Any effect of a genetic variant on the organismal phenotype is propagated through these networks. This propagation, in particular if it involves crucial cellular functions, is likely to involve compensatory effects mediated by regulatory circuits like feedback loops. Moreover, robust functions are often achieved by “backup systems”, alternative pathways that can at least partially compensate each other. Thus, for the vast majority of variants segregating in a population the resulting macroscopic phenotypic variation is expected to be small, since variants giving rise to dramatic effects reducing individual fitness will quickly be purged from the population. Indeed, rare mono- or poly-genetic diseases mainly arise from such variants that alter gene products (or their expression) in a way that cannot be compensated for. In contrast, propensity to common diseases are likely to be governed by a large number of variants, each of which has a small, if any, effect, and only many “weak links” can lead to a systemic breakdown of homeostasis. Hence, it is not surprising that the effects of genetic variability are more pronounced “up-stream” at the molecular level than “down-stream” at macroscopic level of the organism. Thus an alternative to the forward genetics approach is the construction of molecular networks defining the molecular states of a system that underlie a particular phenotype or disease. In order to construct these networks from molecular data large cohorts have to be phenotyped both at the molecular and the macroscopic level.
The need for reduction of complexity
Molecular phenotypes, like transcript and metabolite concentrations, provide much more immediate information on the impact of genotypic variation than the resulting organismal phenotypes. Yet, in general the number of molecular observables (e.g. the number of genes or metabolites) is much larger. Moreover, their measurements are often noisy. Thus assigning genes or metabolites into groups and considering the group average has the following advantages: 1. It reduces the complexity of such data, since the number of groups is typically much smaller than the number of individual elements. 2. It reduces the noise in the data, since fluctuations in the individual (redundant) variables tend to cancel each other out. 3. It may provide biological focus if the individual elements share common features (e.g. genes belonging to the same metabolic pathway) 4. It may provide insights into the structure of the underlying regulatory network (e.g. groups of gene being organized in a hierarchical manner)
These advantages have been well-recognized for large-scale gene-expression data and a multitude of methods has been developed to identify groups (or “modules”) from such data.
Transcription Modules and the Iterative Signature Algorithm
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.
In order to identify such modules the Iterative Signature Algorithm (ISA) was originally conceived in the Barkai group, and then further opimized by the CBG. This algorithm was designed to overcome the well-known limitations of standard clustering algorithms (as well as those of other tools relying on correlation matrices, like principal component analysis).
Data integration and the Ping-pong algorithm
With the advent of high-throughput data covering different aspects of gene-regulation (e.g. post-transcriptional modifications or protein expression), as well as other properties of the samples (e.g. drug-response), it is increasingly important to integrate multiple datasets. The information from other datasets is commonly integrated a-posteriori. That is, groups of genes that have been assigned to a cluster are tested for “significant enrichment” with genes of predefined groups (e.g. those having the same functional annotation or belonging to a cluster of a different dataset). While this procedure is useful for automatic annotation of gene groups it is not really an integrative analysis that would aim to produce coherent groups of genes by co-analyzing several datasets at the same time, rather than sequentially. To this end we devised our Ping-pong algorithm (PPA) which is powerful tool for uncovering co-occurring modular units when considering noisy or complex paired datasets. This algorithm generates co-modules through an iterative scheme that refines groups of coherent patterns by alternating between the two datasets.
Study of small genetic networks
A complementary direction of research pertains to relatively small genetic networks, whose components are well-known. Our research is focused on such networks in two different organisms: In the fruitfly we study the systems responsible for patternming along the anterior-posterioaxis in the early embryo development (goverened by the morphogen Bicoid, see Robustness in Drosophila embryo patterning for details) and for structuring the precusror of the wing (know as the imaginal wing disk, see WingX: Systems Biology of the Drosophila Wing). The second model organism we are inrterested in is the plant Arabidopsis, for which we study both Asymmetric growth during phototropism and Shade induced hypocotyl elongation within the PlantX collaboration.