From CBG
Jump to: navigation, search

This is an internal page providing additional information about our long-term vision about "A Novel Framework for the Integrated Analysis of large-scale biomedical data" (NOFIA).


Vast amounts of financial and human resources have been invested into clinical and genomic profiling of large cohorts creating enormous amounts of data. While genome-wide association studies (GWAS) have already successfully revealed new candidate loci that potentially affect human disease or related phenotypes, they still fail to predict a significant portion of the heritable component of phenotypic variability. We believe that part of this failure may be overcome by developing novel analysis concepts and methodologies.

The main goal of this proposal is to develop and apply a new analysis framework for the integrated analysis of large-scale medical data. Such data include molecular phenotypes as well as large collections of organismal or clinical observables. Molecular phenotypes, like expression- or metabolomics-profiles are now becoming available for many cohorts, but efficient methods to integrate these data into association studies are still missing. We propose to adapt and extend the modular technologies we have developed in recent years in order to address this challenge. Specifically, we plan to (1) perform modular analyses generating meta-phenotypes of metabolomics, transcriptomics and large-scale clinical data from genotyped individuals in order to facilitate the identification of genetic variants associated with these traits, (2) perform coupled co-module decompositions for the unsupervised integrated analysis of distinct large sets of molecular and clinical phenotypes in order to generate modular links between the various types of data, and (3) develop predictive models using (co )modules as features and explore practical applications aimed at predicting disease risks or response to treatment with better accuracy than classical approaches based on individual biomarkers.

Our work will synthesize our expertise with modular analysis (including our well-established state-of-the-art tools) and our ample experience with GWAS. While our methodological developments will be set within concrete bio-medical questions and applied to real data from the Cohorte Lausannoise and other large data collections, they will be relevant for a large field of data-driven bio-medical research.

Extended Synopsis of the project proposal

Context and state of the art

Standard GWAS for one phenotype.png
Standard GWAS for multiple phenotypes.png
Figure 1: Standard GWAS aim to identify genotypic variants (G) that are significantly associated with a phenotypic trait (P) in order to improve annotation (A). The large number of variants imposes a huge burden of multiple hypotheses testing, which is even more severe when associating multiple phenotypes (b), or highdimensional molecular (M) traits (c).

Genome-wide association studies (GWAS) search for significant correlations between genetic markers (most commonly Single Nucleotide Polymorphisms, short SNPs) and any measurable trait in a population of individuals (see Ref. [1] for review). The motivation is that such associations could provide new candidate loci for causal variants in genes (or their regulatory elements) that play a causal role for the phenotype of interest. In the clinical context there is hope that this would eventually lead to a better understanding of the genetic components of diseases and their risk factors, and potentially lead to more accurate diagnostics and novel therapeutic avenues.

From the hundreds of GWAS that were performed for complex traits in the last years, it became apparent that for most complex traits the elucidated loci explain a very small fraction of the phenotypic variance, even for highly heritable traits that are known to have a significant genetic component to their variability. This applies not only to individual SNPs, where the most significantly associated ones rarely account for more than one percent of the variability, but also for additive combinations thereof, which even in the case of meta-studies with extremely high power (like GIANT [2,3] integrating data from >100`000 individuals) usually explain less than 20%. This so-called “missing variance” enigma4 has triggered some disappointment for those who expected that GWAS could rapidly become of any practical use for assessing risk for predisposition to any of the complex diseases that have been studied.

Several explanations for the lack of predictive power have been proposed [4-6]. Firstly, many traits may be influenced by genetic variants that are not yet routinely measured, including copy number variants (CNVs) [5,7,8] and rare variants [9] that are not captured by SNP-arrays. New genotyping approaches (including whole genome sequencing) will eventually overcome this technical limitation, but this will only increase the number of explanatory variables. Indeed, the more fundamental challenge of current GWAS is rooted in the enormous size of this feature space (i.e. around a million of non-redundant SNPs and potentially many more rare variants and CNVs). Within the standard GWAS approach each variant within the genotypic data (G) is independently tested for association with the phenotype (P) of interest (Fig. 1a). This imposes a huge burden of multiple hypotheses testing and only extremely significant associations survive stringent Bonferroni correction (i.e. those “low hanging fruits” above the line in the Manhattan plots in Fig.1), while there may be many more relevant genetic variants whose contributions are too small to be detected yet [10,11]. In some cases existing annotation (A) from previous GWAS, or data about the implicated gene’s function or expression, like those provided by the ENCODE [12] project, may help to prioritize marginally significant associations. Yet, the burden of multiple testing is even more severe when considering sizable collections of phenotypic traits (Fig. 1b), let alone the high-dimensional features of molecular data (M), like those generated by metabolomics or transcriptomics assays (Fig. 1c).

A complementary limitation relates to the fact that most models used in GWAS allow only for linear effects of single variants. Moreover, models including multiple variants usually combine their effects in an additive manner, ignoring possible interactions. Indeed, already the number of possible pair-wise interactions grows quadratically with the number of variants, so even gigantic cohorts are underpowered to overcome the combinatorial complexity within any brute-force modeling approach.

Ground-breaking nature of this project

Figure 3: Novel analysis framework for medical data integration

I surmise that the linear analysis pathway of current GWAS is central to their failure to achieve predictive power. What is needed to overcome the current impasse is an integrated approach with the following hallmarks (illustrated in Fig. 2):

1) Use all potentially relevant phenotypic information available for a cohort. This means that rather than considering one phenotype at a time, our framework will integrate many relevant traits in a single analysis.

2) Integrate intermediate molecular features whenever feasible. Molecular data provide valuable information on how genetic variability is transmitted to organismal traits and how this process is modulated by the environment. Thus establishing links between molecular features and both the available genotypic and phenotypic information is crucial for elucidating the causal pathways bridging from one to the other.

3) Reduce the complexity of all involved large-dimensional data. The idea is to identify meta-features p, m and g, which have significantly lower dimensionality than the corresponding full datasets (P, M and G). This applies in particular to the organismal phenotypes and the molecular data, which often contain redundant information (e.g. from closely related traits or molecular features) and for which various tools for dimensional reduction already exist. Yet, it is also potentially relevant for the enormous genotypic space, where little is known on how to reduce the effective number of variants beyond combining proximal ones which are in very high linkage disequilibrium (LD).

4) Use existing annotation to help the identification of relevant meta-features. The available annotation should be used to prioritize the potential relevance of the various meta-features. While for organismal traits there are sometimes well-established heuristics on how to combine elementary traits (like the BMI from weight and height), there is much less known on how to integrate effectively the large amount of information on genes that can help to prioritize the genetic variants impacting their function, or the molecular traits they affect.

5) Generate new annotation by combining these features. Any pair of meta-features can be used to create new knowledge. For example, testing models that explain molecular meta-phenotypes in terms of meta-genotypes can identify sets of genetic variants that have a molecular phenotypic effect. Prioritizing these variants can in turn improve power for modeling the response of down-stream organismal traits. Finally, connecting molecular and organismal meta-features is likely to provide interesting links between these different levels that can be used to further refine these features.

6) Perform an iterative analysis that progressively identifies the most relevant meta-features needed for a particular biomedical question. This implies that the analysis should not stop once interesting links between the different data have been identified. Rather, these links should inform the integrative model to further refine and prioritize the meta-features within a specific analysis. For example, starting from a particular set of organismal phenotypes, one may identify the most relevant molecular traits and/or genotypes, which in turn may implicate additional phenotypes, and so on.

This integrated analysis framework is conceptually very different from the conventional GWAS pipeline, and has the potential to overcome some of its limitations. It builds on existing analysis tools developed previously by my group (see Early Achievements on page 7), that will be adapted and extended.

Importantly, as for any innovative approach, it will have to be evaluated rigorously within a concrete setting to demonstrate its potential benefits. We are in a unique position to have direct access to genotypic, phenotypic and molecular data from the Cohorte Lausannoise (CoLaus) [13], a population-based of 6182 participants from Lausanne, Switzerland.

Project Objectives

1) Uncoupled generation of meta-features

a) Perform modular analyses generating meta-features from all molecular profiles: Using our Iterative Signature Algorithm [14-16] (ISA) and other standard tools (like PCA or clustering) we will first analyze existing metabolomics data from ~1000 CoLaus samples to access whether metabolomics meta-features reflect any annotated compounds or pathways. Using RNAseq we will also generate transcriptomics profiles for lymphoblastoid cell-lines derived from the same samples to enable the analogous analysis for expression data. We will then perform standard GWAS to access which meta-features have a significant genetically determined component and whether the association is stronger than any of its constituent metabolomics or transcriptomics features.

b) Perform modular analyses for phenotypic traits, including both the clinical phenotypes gathered for CoLaus and the mental health parameters obtained within its sub-study PsyCoLaus [17]: We will analyze which traits co-aggregate in the same module and perform standard GWAS to test for a stronger genetic component of any of the phenotypic meta-features (as in 1a). We will also check systematically whether linear models for major cardio-vascular risk factors explain more of the data when including certain meta-features related to environmental conditions as co-variables (similar to correcting for population stratification using genotypic PCs).

c) Develop new methods for aggregating genotypes: We will explore new ways to reduce the complexity of the genotypic data. PCA analysis has been successful in capturing the population structure [18], but these very global features usually reflect shared environmental factors (like diet) and are therefore considered as co-variables that can mask the causal effects of individual genotypes. What is needed are new approaches to bundle relatively small groups of genotypes that co-segregate more often than expected. This may include LD blocks, but more interestingly long-range interactions, on which there is an increasing body of complementary information from new genomics tools unravelling chromosome architecture [19]. This will allow for reducing the burden of multiple hypotheses testing, because all constituent genotypes can be discarding at once, if their representative “meta-genotypes” exhibits no association signal with a phenotype of interest.

2) Coupled and iterative generation of meta-features

a) Perform coupled analysis of distinct sets of molecular and clinical phenotypes using our Ping-Pong Algorithm [20] (PPA) and other tools (like Partial-Least-Squares) in order to generate modular links between the various types of data: We will use this approach to co-analyze pairs of phenotypic datasets, including:

i) NMR vs mass-spec metabolomics data to characterize the overlap and comple¬mentarity of these two technologies, and derive robust metabolomics signatures using coherent features from both types of data;

ii) Metabolomics vs transcriptomics data to reveal relationships between gene expression and metabolite concentrations;

iii) Blood chemistry data vs metabolomics and transcriptomics data to better understand the relation between the relatively inexpensive measurements routinely used in the clinics and the features of high-resolution molecular profiles;

iv) Organismal traits vs blood chemistry data, metabolomics and transcriptomics data to identify potential molecular signatures for disease-related abnormal organismal profiles.

b) Score genotypic markers for their relevance to any of the meta-features derived in the previous analyses. This will be done using three strategies:

i) Within the annotation-based approach genotypic markers will receive scores (or “priors” within a Bayesian statistic framework) if they are in LD with a gene (or its regulatory region) that can be linked to the meta-feature based on existing annotation (e.g. a known enzyme involved in the metabolism of a particular compound tagged by a metabolomics meta-feature);

ii) Within the model-based approach genotypic markers will receive priors based on the likelihood ratio of a specific model (e.g. a (set of) marker(s) explaining the meta-phenotype against some null model) using a regression or machine learning framework, c.f. point (3);

iii) Iterative refine all meta-features: The sets of most relevant genotypic meta-features (i.e. sets of markers with the highest scores) will be used as new cues to update and refine the organismal and molecular meta-features (c.f. Fig. 2). This process will be repeated as long as there is a measurable increase in predictive power, see point (3).

3) Benchmarking

It is important to combine this framework with a rigorous benchmarking procedure, since the identification and refinement procedure for meta-features in (1) and (2) will unavoidably include heuristic elements. Here we take a practical point of view with regard to this general challenge: Ultimately the goal of any framework for medical data integration should be the generation of new knowledge and the ability to predict clinically relevant endpoints, based on the available data.

a) As for the first goal, we will investigate systematically whether our novel analysis frame work is able to elucidate genetic variants whose relevance for certain phenotypes has been demonstrated by extremely large meta-studies (like GIANT [2,3]) using only CoLaus data. In other words, we will ask whether data from a moderately sized cohort, if analyzed in a more sophisticated manner (e.g. using the scores in 2b), would be able to recapitulate (at least some of) the results of extremely well-powered studies.

b) As for the second goal, we will take advantage of the fact that CoLaus recently has become a longitudinal study, allowing for prospective analyses. Specifically, one can try to predict various clinically relevant parameters measured at follow-up (including cardio-vascular incidences, development of diabetes and even death) based on the data that were available at the baseline investigation (i.e. about five years earlier). We will apply well-developed machine learning tools, like Support-Vector Machines [21] (SVM) and Random Forests [22], to compare the predictive power using our meta-features with that based on the unprocessed raw data (using a cross-validation methodology).


The trade-off relation between innovation and feasibility for our three main analysis goals.

Devising new strategies for medical data analysis is very timely at the current data deluge. Central to our proposal is our vision of the integrative framework illustrated in Fig. 2, which departs radically from the canonical analysis pipelines used by most GWAS. Nevertheless it is important to realize that the impasses of this linear and brute-force approach are becoming more and more realized, and that a growing community is moving towards a more integrated approach (sometimes termed as “Systems Genetics” [23,24]). This approach has already made remarkable progress for model organisms25-27, but is less established for human data. Thus, while my proposal derives its strengths and uniqueness from the available resources outlined above (including the first massive collection of already existing metabolomics and that will be matched with transcriptomics data), it is well aligned and likely to cross-pollinate with other research in this field.

The feasibility of our proposal rests primarily on the well-established nature of the three components we aim to synthesize: (i) our expertise with (modular) analysis of large-scale phenotypic data [15,16,20,28-33], (ii) our experience with GWAS [2,3,34-38], and (iii) our direct access to existing data from the CoLaus study. The challenge lies in combining these assets, and connecting them with new methodologies. The trade-off relation between innovation and feasibility is determined by this difficulty and increases in a balanced manner for our three main analysis objectives (see Fig. 3 for illustration): For objectives (1a) and (1b) we can rely largely on our existing resources in terms of data and analysis tools. Objective 1c is a bit more challenging, because it calls for new ideas to reduce the genotypic complexity (like the use of information on chromosomal architecture [19]). Objective 2a has great potential to yield new insights of high methodological (2a-i/ii) or clinical (2a-iii/iv) relevance, but requires the integration of external annotation. We have ample experience in using gene annotation (like GO term enrichment analysis). We also profit from the close proximity to our colleagues at the Lausanne University Hospital, with whom we can consult on clinical matters. Since the analysis of metabolomics data is not within our direct expertise we are fortunate to have an on-going collaboration with the Steinbeck Chemoinformatics group at the European Bioinformatics Institute (EBI), which has great experience in the analysis of mass- and NMR-spectra for structure elucidation. This support structure will also be invaluable for objective 2b-i, which also relies on the integration of external information. The most significant challenge in remaining objectives is the integration of machine-learning approaches with our modular analysis tools. We have a solid background in non-linear classification theory, so we are confident that we can apply the well-established SVM [21] and “random forests” [22] to the problem at hand.


1. McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9, 356-69 (2008).

2. Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832-8 (2010).

3. Heid, I.M. et al. Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimorphism in the genetic basis of fat distribution. Nat Genet 42, 949-60 (2010). 4. Maher, B. Personal genomes: The case of the missing heritability. Nature 456, 18-21 (2008).

5. Eichler, E.E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11, 446-50 (2010).

6. Manolio, T.A. et al. Finding the missing heritability of complex diseases. Nature 461, 747-53 (2009).

7. McCarroll, S.A. Extending genome-wide association studies to copy-number variation. Hum Mol Genet 17, R135-42 (2008).

8. Beckmann, J.S., Sharp, A.J. & Antonarakis, S.E. CNVs and genetic medicine (excitement and consequences of a rediscovery). Cytogenet Genome Res 123, 7-16 (2008).

9. Goldstein, D.B. Common genetic variation and human traits. N Engl J Med 360, 1696-8 (2009).

10. Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42, 565-9 (2010).

11. Visscher, P.M., Brown, M.A., McCarthy, M.I. & Yang, J. Five years of GWAS discovery. Am J Hum Genet 90, 7-24 (2012).

12. Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012).

13. Firmann, M. et al. The CoLaus study: a population-based study to investigate the epidemiology and genetic determinants of cardiovascular risk factors and metabolic syndrome. BMC Cardiovasc Disord 8, 6 (2008).

14. Bergmann, S., Ihmels, J. & Barkai, N. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys 67, 031902 (2003).

15. Ihmels, J., Bergmann, S. & Barkai, N. Defining transcription modules using large-scale gene expression data. Bioinformatics 20, 1993-2003 (2004). 16. Ihmels, J. et al. Revealing modular organization in the yeast transcriptional network. Nat Genet 31, 370-7 (2002).

17. Preisig, M. et al. The PsyCoLaus study: methodology and characteristics of the sample of a population-based survey on psychiatric disorders and their association with genetic and cardiovascular risk factors. BMC Psychiatry 9, 9 (2009).

18. Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98-101 (2008).

19. van Steensel, B. & Dekker, J. Genomics tools for unraveling chromosome architecture. Nat Biotechnol 28, 1089-1095 (2010).

20. Kutalik, Z., Beckmann, J.S. & Bergmann, S. A modular approach for integrative analysis of large-scale gene-expression and drug-response data. Nat Biotechnol 26, 531-9 (2008).

21. Cristianini, N. & Shawe-Taylor, J. An introduction to Support Vector Machines : and other kernel-based learning methods, xi, 189 p. (Cambridge University Press, Cambridge, 2000).

22. Breiman, L. Random forests. Machine Learning 45, 5-32 (2001).

23. Li, H. Systems genetics in "-omics" era: current and future development. Theory Biosci 132, 1-16 (2013).

24. Nadeau, J.H. & Dudley, A.M. Genetics. Systems genetics. Science 331, 1015-6 (2011).

25. Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627-31 (2010). 26. Mackay, T.F. et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173-8 (2012).

27. Bloom, J.S., Ehrenreich, I.M., Loo, W.T., Lite, T.L. & Kruglyak, L. Finding the sources of missing heritability in a yeast cross. Nature 494, 234-7 (2013).

28. Bergmann, S., Ihmels, J. & Barkai, N. Similarities and Differences in Genome-Wide Expression Data of Six Organisms. PLoS Biol 2, E9 (2004).

29. Henrichsen, C.N. et al. Using transcription modules to identify expression clusters perturbed in Williams-Beuren syndrome. PLoS Comput Biol 7, e1001054 (2011).

30. Ihmels, J., Bergmann, S., Berman, J. & Barkai, N. Comparative gene expression analysis by differential clustering approach: application to the Candida albicans transcription program. PLoS Genet 1, e39 (2005).

31. Ihmels, J., Levy, R. & Barkai, N. Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nat Biotechnol 22, 86-92 (2004).

32. Brawand, D. et al. The evolution of gene expression levels in mammalian organs. Nature 478, 343-8 (2011).

33. Piasecka, B., Kutalik, Z., Roux, J., Bergmann, S. & Robinson-Rechavi, M. Comparative modular analysis of gene expression in vertebrate organs. BMC Genomics 13, 124 (2012).

34. Genick, U.K. et al. Sensitivity of genome-wide-association signals to phenotyping strategy: the PROP-TAS2R38 taste association as a benchmark. PLoS One 6, e27745 (2011).

35. Hor, H. et al. Genome-wide association study identifies new HLA class II haplotypes strongly protective against narcolepsy. Nat Genet 42, 786-9 (2010).

36. Kapur, K., Schupbach, T., Xenarios, I., Kutalik, Z. & Bergmann, S. Comparison of strategies to detect epistasis from eQTL data. PLoS One 6, e28415 (2011).

37. Kutalik, Z. et al. Methods for testing association between uncertain genotypes and quantitative traits. Biostatistics 12, 1-17 (2011).

38. Kutalik, Z., Whittaker, J., Waterworth, D., Beckmann, J.S. & Bergmann, S. Novel method to estimate the phenotypic variation explained by genome-wide association studies reveals large fraction of the missing heritability. Genet Epidemiol 35, 341-9 (2011).


I have strong support for the NOFIA project from my colleagues Prof. Peter Vollenweider (PI of CoLaus, see media:letter_PV.pdf) and Prof. Martin Preisig (PI of PsyCoLaus, see media:letter_MP.pdf).