Difference between revisions of "Robust inference of gene regulatory networks using bootstrapping"

Line 11: Line 11:
 
'''Supervisor''':  [[User:Daniel|Daniel Marbach]]
 
'''Supervisor''':  [[User:Daniel|Daniel Marbach]]
  
'''Students''': [Laure Décombaz, Sashka Kaufmann, Yosra Zhang]
+
'''Students''':  
 +
 
 +
'''Backgroung''' : Our project is based on the data used in the "Dream challenge" (In vivo, E.coli,...) and on the method used by the teams having done the best predicted networks : bootstrapping!
 +
Here is a summary of what these teams have done:
 +
Microarray data (expression data of E.Coli,...)-->Bootstrapping-->Inference methods-->Predict networks
 +
 
 +
 
 +
'''Goals''' : The best teams of the Dream challenge were using bootstrapping so we want to go further in the exploration of this method.
 +
This is why our goals are:
 +
To see if the performance varies when we
 +
1) change the fraction of the data used for the bootstrap runs
 +
2) change the number of bootstrap runs
 +
3) change of inference methods
 +
 
 +
'''Inference methods used to reach our goals''' :
 +
- Correlation
 +
- Multiple regression
 +
 
 +
'''Precision recall curve''' :
 +
To evaluate the performance of making bootstrap runs with different inference methods,... we have to create a plot in R with the precision in the vertical axis and the recall in the horizontal axis.
 +
Here are the functions:
 +
Recall =  TP(k)/P where TP(k) are the true positif of our first k edges of the list and P are the positif of all the list.
 +
Precision = TP(k)/TP(k)+FP(k) = TP(k)/k
 +
Example :
 +
 
 +
 
 +
 
 +
 
 +
'''Results''' :
 +
1)
 +
 
 +
 
 +
2)
 +
 
 +
 
 +
 
 +
3)
 +
 
 +
 
 +
 
 +
'''conclusions''' :
 +
- Bootstrapping doesn't improve the performance in comparison of the performance when using all the data when we use different inference methods or different fractions or different number of bootstrap runs.
 +
-Consensus improves over individual bootstrap runs, but does not improve compared to using all data for the tested methods (Spearman and multiple regression).
 +
- When we increase the number of bootstrap runs the performance using all the data (without bootstrapping) is the same as the performance of the consensus network.
 +
 
 +
'''Further exploration''' : We could see if the performance varies using different inference methods (svm,Lasso,random forest,...) when we bootstrap the data.

Revision as of 10:20, 28 May 2014

Background: Genome-scale inference of transcriptional gene regulation has become possible with the advent of high-throughput technologies such as microarrays and RNA sequencing, as they provide snapshots of the transcriptome under many tested experimental conditions. From these data, the challenge is to computationally predict direct regulatory interactions between a transcription factor and its target genes; the aggregate of all predicted interactions comprises the gene regulatory network. A wide range of network inference methods have been developed to address this challenge. We have previously organized a competition (the DREAM network inference challenge), where we rigorously assessed the state-of-the-art in gene network inference (see our paper to learn more). However, robustness of predictions to variability in the input data has so far not been characterized.

Goal: The aims of this project are to: (1) investigate the performance robustness of top-performing network inference methods from the DREAM5 challenge to variability in the input data, (2) improve the quality of predicted networks using a bootstrapping approach, (3) generate an improved prediction for the transcriptional regulatory network of E. coli and analyze its structural properties.

Mathematical tools: This project has a computational flavor. Students will familiarize themselves (at a high level) with gene network inference approaches, ensemble based approaches in machine learning (bootstrapping, bagging), and basic network properties such as degree distribution. A programming environment such as R or Matlab will be used. Network inference tools may have to be run from the command line (Unix console).

Biological or Medical aspects: The students will predict and analyze a genome-wide transcriptional regulatory network for E. coli.

Slides: Project introduction slides (pdf)

Supervisor: Daniel Marbach

Students:

Backgroung : Our project is based on the data used in the "Dream challenge" (In vivo, E.coli,...) and on the method used by the teams having done the best predicted networks : bootstrapping! Here is a summary of what these teams have done: Microarray data (expression data of E.Coli,...)-->Bootstrapping-->Inference methods-->Predict networks


Goals : The best teams of the Dream challenge were using bootstrapping so we want to go further in the exploration of this method. This is why our goals are: To see if the performance varies when we 1) change the fraction of the data used for the bootstrap runs 2) change the number of bootstrap runs 3) change of inference methods

Inference methods used to reach our goals : - Correlation - Multiple regression

Precision recall curve : To evaluate the performance of making bootstrap runs with different inference methods,... we have to create a plot in R with the precision in the vertical axis and the recall in the horizontal axis. Here are the functions: Recall = TP(k)/P where TP(k) are the true positif of our first k edges of the list and P are the positif of all the list. Precision = TP(k)/TP(k)+FP(k) = TP(k)/k Example :



Results : 1)


2)


3)


conclusions : - Bootstrapping doesn't improve the performance in comparison of the performance when using all the data when we use different inference methods or different fractions or different number of bootstrap runs. -Consensus improves over individual bootstrap runs, but does not improve compared to using all data for the tested methods (Spearman and multiple regression). - When we increase the number of bootstrap runs the performance using all the data (without bootstrapping) is the same as the performance of the consensus network.

Further exploration : We could see if the performance varies using different inference methods (svm,Lasso,random forest,...) when we bootstrap the data.