% \VignetteIndexEntry{ISA internals} \documentclass{article} \usepackage{ragged2e} \usepackage{url} \newcommand{\Rfunction}[1]{\texttt{#1()}} \newcommand{\Rpackage}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\Rargument}[1]{\textsl{#1}} \newcommand{\filename}[1]{\texttt{#1}} \newcommand{\variable}[1]{\texttt{#1}} \SweaveOpts{prefix.string=plot} \begin{document} \title{ISA internals} \author{G\'abor Cs\'ardi} \maketitle \tableofcontents \RaggedRight \setlength{\parindent}{2em} <>= options(width=60) options(continue=" ") .getmaxmem <- function() { ml <- system("free", intern=TRUE)[2] m <- sub(" .*$", "", sub("^[^1-9]*", "", ml)) format(as.numeric(m) / 1024 / 1024, digits=4) } .getnoproc <- function() { pc <- system("cat /proc/cpuinfo | grep processor | tail -1", intern=TRUE) as.numeric(sub("^[^1-9]*", "", pc)) + 1 } .getproctype <- function() { mn <- system("cat /proc/cpuinfo | grep 'model name' | head -1", intern=TRUE) sub("^.*:[ ]*", "", gsub("[\t ]+", " ", mn)) } @ <>= load("ISA_internals.Rdata") @ \section{Introduction} The Iterative Signature Algorithm (ISA) is a biclustering method, it finds consistent blocks (modules, biclusters) in tabular data, e.g. gene expression measurements across a number of samples. Please see the introductory tutorials at the ISA homepage, \url{http://www.unil.ch/cbg/ISA}, and also \cite{bergmann03} for details. In this document we specifically deal with the implementation of the ISA in the \Rpackage{isa2} and \Rpackage{eisa} R packages. \section{Why two packages?} We implemented the ISA in two R packages, \Rpackage{isa2} and \Rpackage{eisa}. ISA is a very general algorithm, that can be used for any tabular data to find correlated blocks. Examples for such tabular data are gene expression \cite{ihmels04}, response of different cell lines to a number of drugs \cite{kutalik08}. It can also be used as a general classifier, for biological or other data. It is also true, however, that the ISA is frequently used for gene expression analysis. Thus, we decided to provide two different user interfaces to the ISA. One interface, provided by \Rpackage{isa2} package, is very general, the input of the \Rpackage{isa2} functions is a numeric matrix, and the output contains two matrices, defining the ISA modules. The \Rpackage{isa2} package can be used for the modular analysis of tabular data of any kind, from any source. The \Rpackage{eisa} package (the leading `e' stands for expression) provides a second interface, for people that specifically deal with gene expression data. This package builds on the infrastructure created by the BioConductor project \cite{BioC}. The input of the \Rpackage{eisa} functions is an \Rclass{ExpressionSet} object, the standard BioConductor data structure for storing gene expression data. BioConductor provides functions to create such an \Rclass{ExpressionSet} object from raw data and to download data from public repositories, such as the Gene Expression Omnibus \cite{davis07,barret09} and transform it into an \Rclass{ExpressionSet}. The output of the \Rpackage{eisa} functions is an object that contains the ISA modules, the annotation of the genes and samples in the data set and possibly also further experimental meta data. The \Rpackage{eisa} package provides functions for calculating enrichment statistics against various databases for the ISA modules. The \Rpackage{eisa} package uses already existing BioConductor annotation packages, so it works for any organism that is supported by BioConductor. Having two ISA packages, however, does not mean two implementations of the ISA. The \Rpackage{eisa} package is fully built on top of the services of the \Rpackage{isa2} package, only the latter one contains the implementation of the ISA iteration. The two packages allow ease of installation and use: users dealing with gene expression data install the \Rpackage{eisa} package, and this automatically installs the \Rpackage{isa2} package as well. Users analyzing other data install the \Rpackage{isa2} package only, this does not need any BioConductor packages. The \Rpackage{isa2} package is part of the standard R package repository (CRAN), the \Rpackage{eisa} package has been accepted as an official BioConductor package and is included in BioConductor from the 2.6 release, due in April, 2010. \section{Speeding up the ISA iteration} ISA is an unsupervised, iterative, randomized algorithm. It starts with a seed vector. $r_0$. This vector is an initial guess for the rows of the input matrix that form a single module. This guess is then refined, by iterating itself and another vector, $c_i$, that defines the columns of the module. During the iteration, the ISA uses two matrices, $E_r$ and $E_c$, derived from the input matrix, by standardizing it row-wise and column-wise, respectively. One step of the ISA iteration involves (1) multiplying $E_r$ by $r_n$ and then (2) thresholding the result to keep elements that are further away from its mean than a prescribed value, $\Theta_c$, in standard deviation units. This gives $c_{n+1}$. Next, (3) $E_c$ is multiplied by $c_{n+1}$ and (4) the result is thresholded with $\Theta_r$. This gives $r_{n+1}$. The iteration is finished if $r_{n+1}$ is close to $r_n$ and $c_{n+1}$ is close to $c_n$. Considerable speedup can be achived, if the ISA iteration is performed in batches of seed vectors, instead of handling them individually. The reason for this is the availability of the highly optimized linear algebra libraries that perform matrix-matrix multiplication much faster than all the corresponding matrix-vector multiplications individually. The seed vectors can be merged into a seed matrix. This can be done, even if the ISA thresholds are different for the different seeds, the \Rpackage{isa2} and \Rpackage{eisa} packages support this. We considered using sparse matrices during the ISA iteration, because the matrix of seed vectors is sparse, but according to our measurements, sparse matrices are only marginally faster, and only in some cases. The reason for this is, that the product of the two matrices is always dense, and gets sparse only after the thresholding; the dense-sparse matrix multiplication, plus the conversion to a sparse matrix again, takes about the same time as the dense-dense matrix multiplication. Different input seeds converge in different number of steps, in parctice many seeds tend to converge quickly, and very few seeds need a lot of steps. Because of this, it is essential to remove the seeds that have already converged, from the seed matrix, so that only a smaller seed matrix needs to be iterated for many steps. As loops are typically slow in R, we implemented the thresholding operations in C. \section{Running time analysis} In this section we show an experimental running time analysis for our ISA implementation. \subsection{The hardware and software} The code in the following sections were run under \Sexpr{Sys.info()["sysname"]} operating system, release '\verb+\Sexpr{Sys.info()["release"]}+', version `\verb+\Sexpr{Sys.info()["version"]}+', on an `\verb+\Sexpr{Sys.info()["machine"]}+' machine with \Sexpr{.getnoproc()} processors of type `\texttt{\Sexpr{.getproctype()}}' and \Sexpr{.getmaxmem()} GiB memory. \subsection{Getting the data} We use subsets of the same data set for the test, this is a data set that is publicly available in the Gene Expression Omnibus (GEO) repository, its id is \Sexpr{"GSE18858"}. Note, that for the analysis of gene expression data, the \Rpackage{eisa} package is a better choice. It makes sense, however, if this tutorial can be run without any BioConductor packages, so we will use the \Rpackage{isa2} package. The non-normalized data set of the experiment is available from GEO as a numeric matrix. To spare time and bandwidth, we only download it once, even if the code of this tutorial is run multiple times. We store the data file in the current local directory. If the file is already there, then there is nothing to do. <>= GEO <- "GSE18858" GEOfile <- paste(sep="", GEO, "_series_matrix.txt.gz") GEOurl <- paste(sep="", "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/", GEO, "/", GEOfile) if (!file.exists(GEOfile)) { download.file(GEOurl, GEOfile) } @ Next, we are read in the data. Lines that are part of the file header start with an exclamation mark in GEO files, so we skip them by setting the \verb+comment.char+ argument of \Rfunction{read.table}. The table has a header, the first lines are the sample names; it also has the probeset names in the first column, we use these as row names. <>= data <- read.table(gzfile(GEOfile), comment.char="!", header=TRUE, row.names=1) data <- as.matrix(data) @ Let's check the size of the data matrix. <>= dim(data) @ It has \Sexpr{nrow(data)} rows (=probesets) and \Sexpr{ncol(data)} columns. We load the \Rpackage{isa2} package, to perform the ISA. No BioConductor packages are needed. <>= library(isa2) @ \subsection{Measuring running time} We define a simple function first, that runs the various steps of the ISA toolchain and measures the running time of each step. We use the \Rfunction{system.time} function for this. The results are returned in table. <>= mesISA <- function(E, thr.row, thr.col, no.seeds) { t1 <- system.time({ NE <- isa.normalize(E) }) t2 <- system.time({ seeds <- generate.seeds(length=nrow(E), count=no.seeds) }) t3 <- system.time({ mods <- isa.iterate(NE, row.seeds=seeds, thr.row=thr.row, thr.col=thr.col) }) t4 <- system.time({ mods2 <- isa.unique(NE, mods) }) cbind(normalization=t1, seeds.generation=t2, isa.iteration=t3, module.merge=t4, full=t1+t2+t3+t4) } @ We quickly test this function, on a small subset of our data, with 1000 rows and 30 columns, chosen randomly. <>= mydata <- data[ sample(nrow(data), 1000), sample(ncol(data), 30)] mesISA(mydata, 3, 3, 100) @ The table has one column for each step of the ISA toolchain: normalization, seed generation, performing the ISA iteration, merging the modules, and there is a column for the total time of these operations as well. The first row shows the processor time used by process itself, the second row the time spent in system calls. In the following we will use the total of both of them to measure the speed of the implementation. \subsection{Number of rows and columns} First, we increase the number of rows in the data matrix gradually, and measure the running time of ISA. We do this for variuos (row) thresholds, to see if the trend is threshold-dependent. The column threshold will be fixed now. <>= row.thresholds <- seq(1,3,by=0.5) col.thresholds <- 2 @ We create a function, \Rfunction{do.row.thr}, that runs the ISA for fixed threshold parameters, and different number of rows in the data matrix. The function also does replications, 5 by default. <>= no.rows <- seq(5000, min(40000, nrow(data)), by=5000) do.row.thr <- function(thr, rep=5) { res <- lapply(no.rows, function(x) { lapply(1:rep, function(xxx) { mydata <- data[sample(nrow(data), x),] mesISA(mydata, thr, col.thresholds, 100) }) }) res } @ We are ready to do the running time measurement now; separately for each row threshold parameter. This takes about three hours to run, on the platform mentioned above. <>= by.rows <- lapply(row.thresholds, do.row.thr) @ Next, we define a function to plot the results, with error bars. Error bars are not supported by the builtin R plotting functions, so we put them together from line segments. <>= myplot <- function(x, y, sd, xlim=range(x), ylim=c(min(y-sd),max(y+sd)), xlab="", ylab="", ...) { plot(NA, type="n", xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab) xmin <- par("usr")[1] xr <- par("usr")[2]-xmin bw <- xr/200 segments(x,y-sd,x,y+sd) segments(x-bw,y-sd,x+bw,y-sd) segments(x-bw,y+sd,x+bw,y+sd) points(x, y, ...) } @ The following two functions calculate the mean and standard deviation of the running times in the result lists, we will use them later. <>= get.mean <- function(xx) { sapply(xx, function(x) mean(sapply(x, function(y) sum(y[1:2,5])))) } get.sd <- function(xx) { sapply(xx, function(x) sd(sapply(x, function(y) sum(y[1:2,5])))) } @ We are ready to create a plot, the running times in the function of the number of rows in the input matrix. The results can be seen in Fig.~\ref{fig:by-rows}. <>= layout(rbind(1:2,3:4,5:6)) for (i in 1:length(row.thresholds)) { par(mar=c(5,4,1,1)+0.1) y <- get.mean(by.rows[[i]]) s <- get.sd(by.rows[[i]]) myplot(no.rows, y, s, type="b", pch=20, xlab="# of rows", ylab="running time [s]") rt <- row.thresholds[i] text(min(no.rows), max(y+s), substitute(Theta[r]==rt, list(rt=rt)), adj=c(0,1), cex=1.3) } @ \begin{figure} \centering \setkeys{Gin}{width=\linewidth} <>= <> @ \caption{Running time of the ISA, in the function of the number of rows in the input matrix, for various row thresholds. Crearly, the running time increases linearly with the number of rows, for all row thresholds. It is also true, that for higher thresholds the running times tend to be smaller, this is because in these cases more seeds converge quickly to the null vector, and these don't need to be iterated further. Each data point is the mean of five runs. } \label{fig:by-rows} \end{figure} In the following, we perform a similar analysis for the number of columns in the input matrix. ISA is a symmetric algorithm, rows are treated the same way as columns. The reason for discussing them separately here, is that gene expression matrices have usually much more rows than columns and this difference might affect the running time. For these runs, the row threshold is fixed and the column threshold changes from 1 to 3. <>= row.thresholds2 <- 2 col.thresholds2 <- seq(1,3,by=0.5) @ We create a function to perform all the runs for a given column threshold, with replications five by default. <>= no.cols <- seq(30, ncol(data), by=30) do.col.thr <- function(thr, rep=5) { res <- lapply(no.cols, function(x) { lapply(1:rep, function(xxx) { mydata <- data[, sample(ncol(data), x)] mesISA(mydata, row.thresholds2, thr, 100) }) }) res } @ We are ready to do the measurements. On the above mentioned hardware configuration, this takes about 4 hours to run, depending on the load of the system. <>= by.cols <- lapply(col.thresholds2, do.col.thr) @ The mean running times are plotted, the results are shown in Fig.~\ref{fig:by-cols}. <>= layout(rbind(1:2,3:4,5:6)) for (i in 1:length(col.thresholds2)) { par(mar=c(5,4,1,1)+0.1) y <- get.mean(by.cols[[i]]) s <- get.sd(by.cols[[i]]) myplot(no.cols, y, s, type="b", pch=20, xlab="# of cols", ylab="running time [s]") rt <- col.thresholds2[i] text(min(no.cols), max(y+s), substitute(Theta[c]==rt, list(rt=rt)), adj=c(0,1), cex=1.3) } @ \begin{figure} \centering \setkeys{Gin}{width=\linewidth} <>= <> @ \caption{Running times, in the function of the number of columns in the input matrix, for various column thresholds. Interestingly, the running time can be considered as independent of the number of columns. This is simply because the row seed matrix is two orders of magnitude larger than the column seed matrix, and the former dominates the running time. Each data point is the mean of five runs.} \label{fig:by-cols} \end{figure} \subsection{Number of seeds} Finally, we also check the running time in the function of the number of starting seeds. We define three threshold configurations to test. The first has intermediate thresholds for both the rows and the columns, the second has a low threshold for the rows and a high threshold for the columns, the third is the opposite of the second. <>= thr.comb <- list( c(2,2), c(1,3), c(3,1) ) no.seeds <- seq(50, 400, by=50) @ The following function does all the runs for a given threshold combination. The size of the data matrix is fixed here, 20000 times 100. <>= do.no.seeds <- function(thr, rep=5) { nr <- min(20000, nrow(data)) nc <- min(100, ncol(data)) res <- lapply(no.seeds, function(x) { lapply(1:rep, function(xxx) { mydata <- data[sample(nrow(data), nr), sample(ncol(data), nc)] mesISA(mydata, thr[1], thr[2], x) }) }) res } @ We are ready to do the measurement now. <>= by.no.seeds <- lapply(thr.comb, do.no.seeds) @ The results are plotted in Fig.~\ref{fig:no-seeds}. <>= layout(rbind(1:2,3:4)) for (i in 1:length(thr.comb)) { par(mar=c(5,4,1,1)+0.1) y <- get.mean(by.no.seeds[[i]]) s <- get.sd(by.no.seeds[[i]]) myplot(no.seeds, y, s, type="b", pch=20, xlab="# of seeds", ylab="running time [s]") th <- thr.comb[[i]] text(min(no.seeds), max(y+s), substitute(paste(Theta[r]==r1, ", ", Theta[c]==r2), list(r1=th[1], r2=th[2])), adj=c(0,1), cex=1.3) } @ \begin{figure} \centering \setkeys{Gin}{width=\linewidth} <>= <> @ \caption{The running time, in the function of the number of starting seeds, for various row and column threshold combinations. Each point is the average of five ISA runs. The running time is increasing linearly with the number of seeds. } \label{fig:no-seeds} \end{figure} \section{Running ISA in parallel} It is trivial to run an ISA analysis in parallel, on a multi-processor machine, or a computer cluster: one just runs different threshold-combinations and/or different seeds on the different processors. The achived speedup is close to linear, since the ISA iteration step dominates in the toolchain. Please see more about this on the ISA homepage at \url{http://www.unil.ch/cbg/ISA}. \section{Session information} The version number of R and packages loaded for generating this vignette were: <>= toLatex(sessionInfo()) @ \bibliographystyle{apalike} \bibliography{tissues} \end{document}