High throughput data
DNA microarrays have firmly established themselves as a standard tool in biological and biomedical research. Together with the rapid advancement of genome sequencing projects, microarrays and related high-throughput technologies have been key factors in the study of the more global aspects of cellular systems biology. While genomic sequence provides an inventory of parts, a proper organization and eventual understanding of these parts and their functions requires comprehensive views also of the regulatory relations between them. Genome-wide expression data offer such a global view by providing a simultaneous read-out of the mRNA levels of all (or many) genes of the genome.
Most microarray experiments are conducted to address specific biological issues. In addition to the specific biological questions probed in such individual focused experiments, it is widely recognized that a wealth of additional information can be retrieved from a large and heterogeneous dataset describing the transcriptional response to a variety of different conditions. Also the relatively high level of noise in these data can be dealt with most effectively by combining many arrays probing similar conditions.
The modular concept
Whenever we face a large number of individual elements that have heterogeneous properties, grouping elements with similar properties together can help to obtain a better understanding of the entire ensemble. For example, we may attribute human individuals of a large cohort to different groups based on their sex, age, profession, etc., in order to obtain an overview over the cohort and its structure. Similarly, individual genes can be categorized according to their properties to obtain a global picture of their organization in the genome. Evidently, in both cases alike, the assignment of the elements to groups – or modules – depends on which of their properties are considered and on how these properties are processed in order to associate different elements with the same module. A major advantage of studying properties of modules, rather than individual elements, relies on a basic principle of statistics: The variance of an average decreases with the number N of (statistical) variables used to compute its value like 1/N, because fluctuations in these variables tend to cancel each other out. Thus mean values over the elements of a module or between the elements of different modules are more robust measures than the measurements of each single element alone. This is particularly relevant for the noisy data produced by chip-based high-throughput technologies.
Regulatory patterns are context specific
The central problem in the analysis of large and diverse collections of expression profiles lies in the context-dependent nature of co-regulation. Usually genes are coordinately regulated only in specific experimental contexts, corresponding to a subset of the conditions in the dataset. Most standard analysis methods classify genes based on their similarity in expression across all available conditions. The underlying assumption of uniform regulation is reasonable for the analysis of small datasets, but limits the utility of these tools for the analysis of heterogeneous large datasets for the following reasons: First, conditions irrelevant for the analysis of a particular regulatory context contribute noise, hampering the identification of correlated behavior over small subsets of conditions. Second, genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another. This is particularly relevant for splice isoforms, that are not distinguished by the probes on the array, but may differ in their physiological function or localization. Thus, combinatorial regulation necessitates the assignment of genes to several context-specific and potentially overlapping modules. In contrast, most commonly used clustering techniques yield disjoint partitions, assigning each gene to a single cluster.
Co-classification of genes and conditions
To take these considerations into account, expression patterns must be analyzed with respect to specific subsets; genes and conditions should be co-classified. The resulting ‘transcription modules’ (another common term is ‘bicluster’) consist of sets of co-expressed genes together with the conditions over which this co-expression is observed. The naïve approach of evaluating expression coherence of all possible subsets of genes over all possible subsets of conditions is computationally infeasible, and most analysis methods for large datasets seek to limit the search space in an appropriate way. Thus extensive research has been focused on devising new tools to extract modules from large-scale data.
Iterative Signature Algorithm
The Iterative Signature Algorithm (ISA) has become a standard tool for the analysis of large sets of gene expression data. It rapidly identifies coherent subsets in large data tables. For example, applied on a gene-expression dataset it decomposes the data into transcription modules consisting of sets of genes that are co-expressed in specific subsets of the samples.
Ping Pong Algorithm
There is an increasing trend to analyze sets of biological samples with multiple high-throughput assays probing different aspects of their phenotype or genotype. The efficient integration of such data has become a central challenge in computational biology. We extended the ISA to be applicable to two datasets, for which we identify consistent modular patterns termed co-modules. Our Ping-Pong Algorithm (PPA) is most powerful for uncovering co-occurring modular units when considering noisy or complex paired data sets. For example, using the DrugBank database and the Connectivity Map as reference we showed that the PPA predicts known drug-gene associations significantly better from the NCI-60 data than other methods.