DNA microarrays have firmly established themselves as a standard tool in biological and biomedical research. Together with the rapid advancement of genome sequencing projects, microarrays and related high-throughput technologies have been key factors in the study of the more global aspects of cellular systems biology. While genomic sequence provides an inventory of parts, a proper organization and eventual understanding of these parts and their functions requires comprehensive views also of the regulatory relations between them. Genome-wide expression data offer such a global view by providing a simultaneous read-out of the mRNA levels of all (or many) genes of the genome.
Most microarray experiments are conducted to address specific biological issues. In addition to the specific biological questions probed in such individual focused experiments, it is widely recognized that a wealth of additional information can be retrieved from a large and heterogeneous dataset describing the transcriptional response to a variety of different conditions. Also the relatively high level of noise in these data can be dealt with most effectively by combining many arrays probing similar conditions.
The modular concept
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.
Regulatory patterns are context specific
The central problem in the analysis of large and diverse collections of expression profiles lies in the context-dependent nature of co-regulation. Usually genes are coordinately regulated only in specific experimental contexts, corresponding to a subset of the conditions in the dataset. Most standard analysis methods classify genes based on their similarity in expression across all available conditions. The underlying assumption of uniform regulation is reasonable for the analysis of small datasets, but limits the utility of these tools for the analysis of heterogeneous large datasets for the following reasons: First, conditions irrelevant for the analysis of a particular regulatory context contribute noise, hampering the identification of correlated behavior over small subsets of conditions. Second, genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another. This is particularly relevant for splice isoforms, that are not distinguished by the probes on the array, but may differ in their physiological function or localization. Thus, combinatorial regulation necessitates the assignment of genes to several context-specific and potentially overlapping modules. In contrast, most commonly used clustering techniques yield disjoint partitions, assigning each gene to a single cluster.
Co-classification of genes and conditions
To take these considerations into account, expression patterns must be analyzed with respect to specific subsets; genes and conditions should be co-classified. The resulting ‘transcription modules’ (another common term is ‘bicluster’) consist of sets of co-expressed genes together with the conditions over which this co-expression is observed. The naïve approach of evaluating expression coherence of all possible subsets of genes over all possible subsets of conditions is computationally infeasible, and most analysis methods for large datasets seek to limit the search space in an appropriate way. Thus we (and others!) have devised new tools to extract modules from large-scale data: During my Post-doc with Prof. Naama Barkai at the Weizmann Institute we developed together with Dr. Jan Ihmels the Signature Algorithm and an iterative extension of it (the Iterative Signature Algorithm). These methods have been shown to compete well with others in terms of efficiency and accuracy. Moreover, because these algorithms do not compute correlations, the computation time scales extremely well with the size of the data.
The "genomic" revolution in biology will have a fundamental impact on the improvement of diagnosis, prevention and treatment of disease. Yet, while researchers already started to use gene expression data for predictive purposes, the next challenge lies in integrating the massive data produced by different high-throughput technologies. We believe that this can be done best at the level of modules. Thus one aim of our research is the development of new modular approaches for the integrative analysis of multiple large-scale datasets.
A complementary direction of research pertains to relatively small genetic networks, whose components are well-known. We collaborate closely with experts of the field to identify biological systems that can be modeled quantitatively. Our goal in developing such models is not only to give an approximate description of system, but also to obtain a better understanding of its properties. For example, regulatory networks evolved to function reliably under ever-changing environmental conditions. This notion of robustness can guide computational analysis and provide constraints on models that complement those from direct measurements of the system's output.
Our lab collaborates with experimental groups within and outside our department. In particular, due to our proximity to the CHUV we have close contacts to medical research groups and assist the analysis of clinical data. Experimentors, who find the approach outlined above interesting, are encouraged to get in contact with us to discuss possible analysis of their data.