% \VignetteIndexEntry{ISA module trees} \documentclass{article} \usepackage{ragged2e} \usepackage{url} \newenvironment{narrow}[2]{% \begin{list}{}{% \setlength{\topsep}{0pt}% \setlength{\leftmargin}{#1}% \setlength{\rightmargin}{#2}% \setlength{\listparindent}{\parindent}% \setlength{\itemindent}{\parindent}% \setlength{\parsep}{\parskip}}% \item[]}{\end{list}} \newcommand{\Rfunction}[1]{\texttt{#1()}} \newcommand{\Rpackage}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\Rargument}[1]{\textsl{#1}} \newcommand{\filename}[1]{\texttt{#1}} \newcommand{\variable}[1]{\texttt{#1}} \SweaveOpts{cache=TRUE} \SweaveOpts{prefix.string=plot} \setlength{\parindent}{2em} \begin{document} \title{Transcription Module Trees from Gene Expression Data} \author{G\'abor Cs\'ardi} \maketitle \tableofcontents \RaggedRight <>= options(width=60) options(continue=" ") @ \section{Introduction} The Iterative Signature Algorithm (ISA) finds transcription modules in gene expression data. A module is a block of the reordered gene expression matrix. The ISA has two threshold parameters: the gene (or feature) threshold and the conditions (or sample) threshold. These essentially specify the stringency of the biclustering. If the gene threshold is high, then modules will have less genes, only the one that are consistently far above or below the mean expression level will be included. The same is true for the sample threshold and the samples in the modules. It is possible to run the ISA using different threshold parameters and define a hierarchical organization of the found modules. We call such a description a sweep graph or sweep tree. The vertices (or nodes) of the sweep graph are the modules and there is a directed edge from module A to module B if the ISA converges to module B, when started from module A, and the thresholds that were used to find module B are used. A sweep tree can be generated by keeping one threshold parameter fixed, and vary the other one. We first find modules at a stringent threshold value and then check where these converge at less stringent thresholds. This document shows how it is possible to create a sweep tree with the \Rpackage{eisa} package and the tree can be visualized. \section{Running the ISA} We will use the acute lymphoblastic leukemia data set from the \Rpackage{ALL} package. <>= library(ALL) library(hgu95av2.db) library(genefilter) library(eisa) library(GO.db) data(ALL) @ <>= varLimit <- 0.5 kLimit <- 4 ALimit <- 5 @ First we filter the genes and keep only the ones that have an expression level of at least \Sexpr{ALimit} in at least \Sexpr{kLimit} samples; and that have a variance of at least \Sexpr{varLimit} across the samples. <>= flist <- filterfun(function(x) var(x)>varLimit, kOverA(kLimit,ALimit)) ALL.filt <- ALL[genefilter(ALL, flist), ] @ We remove the T-cell samples as well, and use only the B-cell leukemia samples. <>= ALL.filt2 <- ALL.filt[, grepl("^B", ALL.filt$BT)] @ Next, we run the ISA, with keeping the condition threshold fixed, but varying the gene threshold. <>= set.seed(2) modules <- ISA(ALL.filt2, flist=NA, thr.gene=seq(2,4,by=0.5), thr.cond=1) @ \section{Generating the module tree} The \Rfunction{ISASweep} function can be used to create a sweep tree from a set of ISA modules. This essentially means finding the edges between the modules and occasionally involves finding new modules as well. <>= modules2 <- ISASweep(ALL.filt2, modules) @ \Rfunction{ISASweep} returns an \Rclass{ISAModules} object, but adds some new seed data that defines the edges of the sweep graph, plus possibly some new modules: <>= modules modules2 colnames(seedData(modules)) colnames(seedData(modules2)) @ The \texttt{father} column contains pointers to the parent vertices in the graph and the \texttt{level} column gives levels defined based on the gene thresholds, higher levels indicate less stringent thresholds. This data can be converted into a graph and then plotted using the \Rpackage{igraph} package. The \Rfunction{ISASweepGraph} function creates the graph. <>= G <- ISASweepGraph(modules2) @ $G$ has a lot of attributes: <>= list.graph.attributes(G) list.vertex.attributes(G) @ The \texttt{layout} graph attribute contains a two-column matrix with carefully calculated coordinates for plotting the graph. The \texttt{width} and \texttt{height} graph attributes contain the suggested size of the plot in inches. <>= G$width G$height @ The \texttt{id} vertex attribute contains the ids of the modules. The other vertex attributes essentially define the look of the graph, when it is plotted. See the documentation of the \Rpackage{igraph} package for more about igraph graph objects. \section{Plotting the sweep tree} \subsection{Simple plots} The \Rfunction{ISASweepGraphPlot} function can be used to plot the graph returned by \Rfunction{ISASweepGraph}. We will use the suggested size for the plot, and leave all plotting options at their default values. The result is in Fig.~\ref{fig:moduletree}. <>= X11(width=G$width, height=G$height) @ <>= ISASweepGraphPlot(G) @ \begin{figure} \begin{narrow}{-1cm}{-1cm} \setkeys{Gin}{width=\linewidth} <>= ps.options(fonts=c("serif", "mono")) <> @ \caption{An ISA module tree. Each rectangle symbolizes a transcription module, modules in the same column were found at the same gene threshold. An arrow between two modules indicates that one is the ``generalization'' of the other, in the sense that ISA converges to the more general (=bigger) module at the milder threshold.} \label{fig:moduletree} \end{narrow} \end{figure} \subsection{Customized plots} Let us show a little example on how to customize the module tree plot. We will add some annotation to the module, based on Gene Ontology enrichment calculations. Let us perform the hypergeometric test for the enrichment calculation. The $p$-values are automatically corrected using the Benjamini-Hochberg method. <>= GO <- ISAGO(modules2) @ We will color the vertices according to their enrichment in the Biological Process GO ontology. We also create some labels for them, the abbreviated GO terms will be used for this. <>= p.bp <- sapply(summary(GO$BP), function(x) as.integer(-log10(x$Pvalue[1]))) c.bp <- d.bp <- sapply(sigCategories(GO$BP), function(x) x[1]) d.bp[!is.na(c.bp)] <- sapply(mget(na.omit(c.bp), GOTERM), Term) d.bp <- abbreviate(d.bp, 6) colbar <- hcl(h=260, c=35, l=seq(30, 100, length=20)) colbar <- c("#FFFFFF", rev(colbar)) col <- colbar[p.bp] @ <>= X11(width=G$width, height=G$height) @ We are ready to plot the sweep tree now. We put the abbreviated GO terms to the top left corner, and the size of the module (i.e. number of genes and number of samples) in the top right corner. We also increase the height of the vertices a bit. <>= ISASweepGraphPlot(G, vertex.color=col, vertex.size2=50, vertex.label.topleft=d.bp, vertex.label.topright=paste(V(G)$noFeatures, sep=",", V(G)$noSamples)) @ Finally, we add a key for the abbreviated names at the top right corner. The result is in Fig.~\ref{fig:moduletree2}. <>= key <- na.omit(d.bp)[unique(names(na.omit(d.bp)))] key2 <- paste(key, sep=": ", names(key)) legend("topright", key2) @ \begin{figure} \begin{narrow}{-1cm}{-1cm} \setkeys{Gin}{width=\linewidth} <>= <> <> @ \caption{The same module tree, but with some annotation added. The modules are colored according to their Gene Ontology Biological Process enrichment $p$-values, darker colors correspond to more significant enrichment. The enriched GO terms and the sizes of the modules were also added.} \label{fig:moduletree2} \end{narrow} \end{figure} \section{More information} For more information about the ISA, please see the references below. The ISA homepage at \url{http://www.unil.ch/cbg/homepage/software.html} has example data sets, and all ISA related tutorials and papers. \section{Session information} The version number of R and packages loaded for generating this vignette were: <>= toLatex(sessionInfo()) @ \bibliographystyle{apalike} \bibliography{EISA} \nocite{*} \end{document}