% \VignetteIndexEntry{The Iterative Signature Algorithm}
\documentclass{article}
\usepackage{ragged2e}
\usepackage{url}
\newcommand{\Rfunction}[1]{\texttt{#1()}}
\newcommand{\Rpackage}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{\texttt{#1}}
\newcommand{\Rargument}[1]{\textsl{#1}}
\newcommand{\filename}[1]{\texttt{#1}}
\newcommand{\listentry}[1]{`\texttt{#1}'}
\SweaveOpts{prefix.string=graphics/plot}
\begin{document}
\setkeys{Gin}{width=\textwidth}
\title{The Iterative Signature Algorithm}
\author{G\'abor Cs\'ardi}
\maketitle
% \RaggedRight
\tableofcontents
<>=
## Some options for R
options(width=60)
options(continue=" ")
if (require(Cairo)) { png <- CairoPNG }
dir.create("graphics")
@
\section{For the impatient}
To run the typical ISA work flow with default parameters on your data
matrix, just load the \Rpackage{isa2} package and call the
\Rfunction{isa} function with your matrix as the single argument. The
return value of \Rfunction{isa} is a named list, the members
\texttt{rows} and \texttt{columns} contain the biclusters ISA have
found. Every bicluster is given by a pair of columns from
\texttt{rows} and \texttt{columns} (i.e. the first columns define the
first bicluster, etc.) and the elements of the biclusters are the
non-zero elements in the columns of \texttt{rows} and
\texttt{columns}.
Please continue reading for a less dense tutorial.
\section{Introduction}
The Iterative Signature Algorithm (ISA)~\cite{sa,isa,isamod} is
biclustering method. Its input is a matrix and its output is a set of
biclusters: blocks of the potentially reordered input matrix, that
fulfill some predefined criteria. A biclustering algorithm typically
tries to find blocks that are different from the rest of the matrix,
e.g. the values covered by the bicluster are all above or below the
background.
The ISA is developed to find biclusters (or modules as most of the ISA
papers call them) that have correlated rows and columns. More
precisely, the rows in the bicluster need to be only correlated across
the columns of the bicluster and vice versa.
Fig.~\ref{fig:twoclusters} shows possibly the simplest
example of a (rather artificial) data matrix with very strong modular
structure. It is a $20\times 20$ matrix and---after reordering its rows
and columns---it has two correlated blocks,
each of size $10\times 10$.
\begin{figure}
<>=
library(isa2)
library(Matrix)
data <- data2 <- isa.in.silico(20,20,2,10,10)[[1]]
perm1 <- sample(seq_len(nrow(data)))
data2 <- data2[perm1,]
perm2 <- sample(seq_len(ncol(data)))
data2 <- data2[,perm2]
alldata <- structure(c(data2, data), dim=c(nrow(data),ncol(data),2),
dimnames=list(NULL,NULL,c("Original matrix",
"Reordered matrix")))
rx <- range(alldata, finite = TRUE)
nn <- 100
n0 <- min(nn, max(0, round((0 - rx[1])/(rx[2] - rx[1]) * nn)))
col.regions <- c(colorRampPalette(c("blue3", "gray80"))(n0),
colorRampPalette(c("gray75", "red3"))(nn - n0))
lp <- levelplot(alldata, xlab="", ylab="",
scale=list(y=list(draw=FALSE),
x=list(draw=FALSE)),
col.regions=col.regions, between=list(x=0.5))
print(lp)
@
\caption{An artificial data matrix, on the left. On the right the
reordered data matrix with two blocks.}
\label{fig:twoclusters}
\end{figure}
\section{How ISA works}
Before showing an actual ISA tool chain, a few words about how the
algorithm works are in order.
\subsection{ISA iteration}
ISA works in an iterative way. For an $E$ $(m\times n)$
input matrix it starts from a seed vector $r_0$, which is
typically a sparse 0/1 vector of length $m$. The non-zero elements in
the seed vector define a set of rows in $E$. First, the transposed of
$E$, $E'$ is multiplied by $r_0$ and the result is thresholded.
The thresholding is an important step of the ISA, without thresholding
ISA would be equivalent to a (not too effective) numerical singular
value decomposition (SVD) algorithm. Currently thresholding is done by
calculating the mean and standard deviation of the vector and keeping
only elements that are further than a given number of standard
deviations from the mean. Based on the \Rargument{direction} parameter,
this means keeping values that are significantly higher than the
mean (direction=``\emph{up}''), or keeping the ones that are
significantly lower than the mean (direction=``\emph{down}''); or
keeping both (direction=``\emph{updown}'').
The thresholded vector $c_0$ is the (column)
\emph{signature} of $r_0$. Then the (row) signature of
$c_0$ is calculated, $E$ is multiplied by $c_0$ and then thresholded
to get $r_1$.
This iteration is performed until it converges, i.e. $r_{i-1}$
and $r_i$ are \emph{close}, and $c_{i-1}$ and
$c_i$ are also close. The convergence criteria,
i.e. what \emph{close} means, is by default defined by high Pearson
correlation.
The \Rfunction{isa.iterate} function performs the ISA iteration from a
given set of seed vectors.
It is very possible that the ISA finds the same module more than once;
two or more seed vectors might converge to the same module. The function
\Rfunction{isa.unique} eliminates every module from the result of
\Rfunction{isa.iterate} that is very similar (in terms of
Pearson correlation) to one that was already found from a different
seed.
The \Rfunction{isa} function performs the whole ISA workflow, this
includes running \Rfunction{isa.iterate} and \Rfunction{isa.unique}.
It might be also apparent, that the ISA
biclusters are soft, i.e. they might have an overlap in their rows,
columns, or both. It is also possible that some rows and/or
columns of the input matrix are not found to be part of any
biclusters. Depending on the stringency parameters in the
thresholding (i.e. how far the values should be from the mean), it
might even happen that ISA does not find any biclusters.
\subsection{Parameters}
The two main parameters of ISA are the two thresholds (one for the
rows and one for the columns). They basically define the stringency of
the modules. If the row threshold is high, then the modules will have
very similar rows. If it is mild, then modules will be bigger, with
less similar rows than in the first case. The same applies to the
column threshold and the columns of the modules.
\subsection{Random seeding and smart seeding}
By default (i.e. if the \Rfunction{isa} function is used) the ISA is
performed from random sparse starting seeds, generated by the
\Rfunction{generate.seeds} function. This way the algorithm is
completely unsupervised, but also stochastic: it might give different
results for different runs.
It is possible to use non-random seeds as well. If you have some
knowledge about the data or are interested in a particular subset of
rows/columns, then you can feed in your seeds into the
\Rfunction{isa.iterate} function directly. In this case the
algorithm is deterministic, for the same seed you will always get the
same results. Using smart (i.e. non-random) seeds can be considered as
a semi-supervised approach.
\subsection{Normalization}
Using in silico data we observed that ISA has the best performance if the
input matrix is normalized (see \Rfunction{isa.normalize}). The
normalization produces two matrices: $E_r$ and
$E_c$. $E_r$ is calculated by transposing $E$, then
centering and scaling its rows (see the \Rfunction{scale} R function). $E_c$ is
calculated by centering and scaling the rows of $E$. $E_r$ is
used to calculate the (column) signature of rows and $E_c$ is used
to calculate the (row) signature of the columns.
It is possible to use another normalization. In this case the user is
requested to supply the normalized input data in a named list,
including the two matrices of appropriate
dimensions to the \Rfunction{isa.iterate} function.
The \texttt{Er} entry of the list will be used for calculating the
signature of the rows, \texttt{Ec} will be used for the signature of the
columns. If you want to use the same matrix in both steps, then supply
it twice, the first one transposed.
\subsection{Row and column scores}
In addition to finding biclusters in the input matrix, the ISA also
assigns scores to the rows and columns, separately for each
module. The scores are between minus one and one and they are by
definition zero for the rows/columns that are not included in the
module. For the non-zero entries, the further the score of a
row/columns is from zero, the stronger the association between the
row/column and the module (i.e. the other rows/columns of the
module). If the sign of two rows/columns are the
same, then they are correlated, if they have opposite signs, then they
are anti-correlated.
\subsection{Robustness}
As ISA is an unsupervised algorithm, it may very well find some
modules, even if you feed in noise as the input matrix. To avoid these
spurious modules we defined a robustness measure, a single number for
a module that measures how well the rows and the columns are
correlated. The robustness score is a generalization of the first
singular value of the matrix. If there would be no thresholding during
the ISA iteration, then
the ISA would (almost always) converge to the leading SVD vector and
the robustness score would be the corresponding singular value.
It is recommended that the user uses \Rfunction{isa.filter.robust} to
run ISA on the scrambled input matrix with the same threshold
parameters and then drop every module, which has a robustness score
lower than the highest robustness score among modules found in the
scrambled data.
\section{A simple work flow}
The simplest way to use ISA on your data is by calling the
\Rfunction{isa} function with your input matrix as the single
argument. This function uses random seeding and has three optional
arguments:
\begin{description}
\item[\Rargument{thr.row}] A numeric vector, the row thresholds to
use. By default this is
\Sexpr{paste(eval(formals(isa2:::isa.default)$thr.row), collapse=", ")}. % $
\item[\Rargument{thr.col}] A numeric vector, the column thresholds
to use. By default this is
\Sexpr{paste(eval(formals(isa2:::isa.default)$thr.col), collapse=", ")}. % $
\item[\Rargument{no.seeds}] The number of random seed vectors to
generate. By default
\Sexpr{formals(isa2:::isa.default)$no.seeds} seeds are generated. % $
\end{description}
The \Rfunction{isa} function runs the ISA algorithm for all threshold
combinations of the supplied row and column thresholds. For every
single run the same set of random seeds are used.
Let us see an example. We load the \Rpackage{isa2}
package~\cite{isapackage}, and generate some in
silico data with the \Rfunction{isa.in.silico} function first:
<>=
set.seed(10)
library(isa2)
data <- isa.in.silico(200,200,2,50,50)[[1]]
@
The function call above generates a data matrix similar to the one in
Fig.~\ref{fig:twoclusters}, but bigger. The first two arguments define the size of
the data matrix ($200\times 200$), the third the number of modules (two)
and the fourth and fifth give the size of the modules, both are
$50\times 50$ submatrices in this case. By default
\Rfunction{isa.in.silico} creates non-overlapping modules with some
background noise.
\Rfunction{isa.in.silico} returns a list with three
elements, we are most interested in the first now, that is the actual
artificial data matrix. The second and third entries in the list give
the correct module memberships of the elements, see the manual page of
\Rfunction{isa.in.silico} for the details. Let's take a look at the
input data, see Fig.~\ref{fig:block}.
<>=
png("graphics/blockplot.png", 600, 400)
@
<>=
images(list(data), xlab="", ylab="", strip=FALSE)
@
<>=
dev.off()
@
\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{graphics/blockplot.png}
\caption{Another artificial data matrix with two modules and a lot of
noisy elements.}
\label{fig:block}
\end{figure}
All we have to do now is calling the \Rfunction{isa} function on the
data matrix.
<>=
modules <- isa(data)
names(modules)
@
The \Rfunction{isa} function returns a list with four
elements. \listentry{rows} and \listentry{columns} define the
modules. The non-zero elements in the first column of both define the
first module, the second columns define the second module, etc.
The number of columns in \listentry{rows} (and \listentry{columns})
correspond to the number of modules.
<>=
ncol(modules$rows)
@
ISA has found \Sexpr{ncol(modules$rows)} %$
modules in the artificial data. This seems a bit surprising, as we
expected only two, so let us take a closer look. We
simply plot the row scores of the individual modules
(Fig.~\ref{fig:simpleplot}). The first two modules corresponds to the
two blocks in the input matrix. The third module is the union of them,
with opposite score signs. The \Rfunction{isa} function also finds modules
containing anticorrelated rows/columns, as long as the
anti-correlation is coordinated. I.e. in the third module, the first
50 rows \emph{always} behave in an opposite way compared to the second
50 rows. This is why they are collected into a single module. The
following code creates Fig.~\ref{fig:simpleplot}.
<>=
layout(cbind(seq(ncol(modules$rows))))
sapply(seq(ncol(modules$rows)), function(x) {
par(mar=c(3,5,1,1))
plot(modules$rows[,x], ylim=c(-1,1),
ylab="scores", xlab=NA,
cex.lab=2, cex.axis=1.5)
})
@
\begin{figure}
\centering
\setkeys{Gin}{width=0.7\textwidth}
<>=
<>
@
\caption{The row scores of the three modules found in the two-block
artificial data matrix. A score of zero means that the row is not
included in the module. Notice that in the third module the first
50 and the second 50 rows have opposite score signs, i.e. they are
anti-correlated.}
\label{fig:simpleplot}
\end{figure}
The \listentry{rundata} entry of the return value of \Rfunction{isa}
contains information about the ISA run, all the parameters (with the default
values in our case) are included here. It has the following members:
<>=
names(modules$rundata)
@
See the documentation of \Rfunction{isa} for more about these.
Finally, the \listentry{seeddata} entry is a data frame, it contains
various information about the individual modules and the seeds that
were used to find them. There is one row for each module. It has the
following columns:
<>=
colnames(modules$seeddata)
@
The columns \listentry{thr.row} and \listentry{thr.col} contain the
row and columns thresholds that were used to find the module, while
the \listentry{rob} column contains its robustness score:
<>=
modules$seeddata
@
See the documentation of the \Rfunction{isa} function for more
information about these and other fields.
Finally, let us show how to transform the result of the
\Rfunction{isa} function to a list of biclusters. Each entry of the
list will have two sublists, the first -- named \listentry{rows} --
will contain the rows indices of the module, the second -- named
\listentry{columns} -- the column indices:
<>=
mymodules <- lapply(seq(ncol(modules$rows)), function(x) {
list(rows=which(modules$rows[,x] != 0),
columns=which(modules$columns[,x] != 0))
})
length(mymodules)
mymodules[[1]]$rows
mymodules[[1]]$columns
@
\section{A detailed work flow}
In this section we will perform each step of the ISA analysis
individually. This makes sense only if the user wants to adjust the
parameters of some steps. Otherwise a simple call to the
\Rfunction{isa} function would do.
Let us create some in-silico data to be analyzed:
<>=
data <- isa.in.silico(200, 100, 10)
@
This will be a $\Sexpr{nrow(data[[1]])}\times \Sexpr{ncol(data[[1]])}$
matrix, with \Sexpr{ncol(data[[2]])} modules. By default ``half'' of
the matrix is filled with noise, so each module will be a
$\Sexpr{round(nrow(data[[1]])/2/10)}\times
\Sexpr{round(ncol(data[[1]])/2/10)}$ submatrix. See the
input matrix on Fig.~\ref{fig:tenmodules}.
<>=
png("graphics/tenmods.png", 600, 400)
@
<>=
images(list(data[[1]]), strip=FALSE, xlab="", ylab="")
@
<>=
dev.off()
@
\begin{figure}
\centering
\includegraphics{graphics/tenmods.png}
\caption{In-silico data with ten non-overlapping modules.}
\label{fig:tenmodules}
\end{figure}
\subsection{Preparing the data}
One ISA iteration consists of two thresholded matrix
multiplications. The data preparation step means producing these two
matrices, $E_r$ and $E_c$ from the input matrix, $E$. The default
behavior (i.e. what the \Rfunction{isa} function does) is to
row-wise scale and center the transposed of $E$ to get $E_r$; and to
row-wise scale and center $E$ to get $E_c$. The
\Rfunction{isa.normalize} function performs the appropriate scaling
and returns $E_r$ and $E_c$ in a list:
<>=
normed.data <- isa.normalize(data[[1]])
names(normed.data)
dim(normed.data$Er)
dim(normed.data$Ec)
@
Let us do a quick check that the rows of the matrices are indeed
centered and scaled.
<>=
max(abs(rowSums(normed.data$Er)))
max(abs(rowSums(normed.data$Ec)))
max(abs(apply(normed.data$Er, 1, sd)-1))
max(abs(apply(normed.data$Ec, 1, sd)-1))
@
\subsection{Running the ISA}
To run the ISA, we need to create some starting seeds. The
\Rfunction{isa} function uses sparse random seeds, produced by
\Rfunction{generate.seeds}. We also have the choice of using
non-random seeds, e.g. if you have a matrix with gene expression data
measured on many samples you can use ``gene'' seeds that correspond to
biological pathways; in this case ISA does a biased search to find
modules related to the input pathways. Or, in a case/control
experiment one can use seed vectors that correspond to cases and look
for modules that are different in the control and the
case samples.
For now, we will use thousand random seeds, with different sparseness:
<>=
row.seeds <- generate.seeds(length=nrow(data[[1]]), count=1000,
sparsity=c(2,5,10,100))
@
Note, that vector giving the sparsity is recycled, thus we will have
four kind of seeds, with 2, 5, 10 and 100 rows; 250 of each kind.
We are ready to run the ISA now. Let us assume that we only want to
find modules that are ``above'' the background. (In this artificial
case we actually know that there are no modules below the background.)
The \Rargument{direction} argument is thus set to \verb+"up"+.
<>=
modules <- isa.iterate(normed.data, row.seeds=row.seeds,
thr.row=2, thr.col=2, direction="up")
@
Unlike \Rfunction{isa}, \Rfunction{isa.iterate} does not merge or
filter the modules the input seeds converge to. Consequently, the
\verb+modules+ object contains 1000 modules, but these are not
necessarily unique:
<>=
ncol(modules$rows)
@
It is also possible that some seeds converge to an all-zero vector, we
have \Sexpr{sum(apply(modules$rows==0, 2, all))} such modules: % $
<>=
sum(apply(modules$rows==0, 2, all))
@
The \Rfunction{isa.unique} function finds duplicate or very similar
modules in a list returned by \Rfunction{isa.iterate} and keeps only a
single one of them. It also eliminates all-zero modules and seeds
that did not converge; \verb+NA+
columns in the \listentry{rows} and \listentry{columns} matrix
correspond to non-convegent seeds:
<>=
modules2 <- isa.unique(normed.data, modules, cor.limit=0.9)
@
The \Rargument{cor.limit} argument specifies the correlation limit
above which two modules are considered to be the same. Let us see how
many modules are left:
<>=
ncol(modules2$rows)
@
We still have \Sexpr{ncol(modules2$rows)} modules. Let us check %$
whether each real module was find by the ISA. Remember that
\Rfunction{isa.in.silico} stores the correct modules in the second and
third entry of its return value.
<>=
found.rows <- cor(data[[2]], modules2$rows!=0)
found.cols <- cor(data[[3]], modules2$columns!=0)
found <- pmin(found.rows,found.cols)
apply(found, 1, max)
@
Out of the ten modules, \Sexpr{sum(apply(found, 1, max)-1<1e-14)} were
correctly identified by the ISA.
\subsection{Robustness of biclusters, filtering the results}
There are two features that we expect from a good biclustering
algorithm. The first is sensitivity: it is able to find all
biclusters in the input data. The second is specificity: it should not
generate spurious biclusters, or in other words biclusters that are
correlated just by chance.
To address this problem, we have developed a robustness measure, a
single scalar number that gives how correlated the rows and columns of
a given module are. This measure is a simple generalization of the
singular value decomposition (SVD) singular value, and can be
calculated with the \Rfunction{robustness} function:
<>=
rob <- robustness(normed.data, modules2$rows, modules2$columns)
summary(rob)
@
The robustness of the \Sexpr{ncol(modules2$rows)} %$
modules varies considerably, from \Sexpr{round(min(rob),2)} to
\Sexpr{round(max(rob), 2)}. For more insight we create a boxplot for
the robustness scores (Fig.~\ref{fig:robbox}).
<>=
par(cex.lab=1.5, cex.axis=1.5)
boxplot(rob, ylab="Robustness")
@
\begin{figure}
\centering
\setkeys{Gin}{width=0.7\textwidth}
<>=
<>
@
\caption{Box-plot for the robustness scores of the modules found by
the ISA. Some modules have a much higher score than the rest.}
\label{fig:robbox}
\end{figure}
One can use the robustness measure to filter a list of modules, in
the following way. We permute the input matrix and run the same
module finding procedure that we did for the original matrix. Then we
calculate the robustness scores for both the real and the scrambled
modules and eliminate the ``real'' modules having a robustness score
that is less than the score of at least one scrambled module. Note,
that this filtering can be done only if both sets of modules were
found using the same threshold parameters. The procedure is
implemented in the \Rfunction{isa.filter.robust} function:
<>=
modules3 <- isa.filter.robust(data[[1]], normed.data, modules2,
perms=1, row.seeds=row.seeds)
@
We used the very same row seeds, as for the original matrix; this is
not strictly required. The \Rargument{perms} argument sets the number
of permutations to perform.
Note that the ISA run on the scrambled matrix usually takes longer,
than on the real data, because it takes more steps for the ISA to
converge.
After the robustness filtering we have \Sexpr{ncol(modules3$rows)} %$
modules left. This is still more than the ten correct modules, so let
us take a closer look. First, let us create another plot with the
robustness scores of the remaining modules.
\Rfunction{isa.filter.robust} places the robustness scores in the seed
data, so we don't need to calculate them again. See Fig.~\ref{fig:robbox2}.
<>=
plot(modules3$seeddata$rob, ylab="Robustness", cex.lab=1.5)
@
\begin{figure}
\centering
\setkeys{Gin}{width=0.7\textwidth}
<>=
<>
@
\caption{The robustness scores of ISA modules after filtering out some
modules with low robustness. Ten modules
have a higher score than the rest.}
\label{fig:robbox2}
\end{figure}
The scatterplot shows that ten modules have in fact higher
robustness scores than the rest. Let us check that these are indeed the
ten ``real'' modules of the data.
<>=
bestmods <- order(modules3$seeddata$rob, decreasing=TRUE)[1:10]
mod.cor <- pmin(cor(modules3$rows[,bestmods]!=0, data[[2]]),
cor(modules3$columns[,bestmods]!=0, data[[3]]))
apply(mod.cor, 1, max)
apply(mod.cor, 2, max)
@
Indeed, we have found the ten real modules, the ISA is both specific
and sensitive. If one analyzes real data, then it is much more
difficult to filter the modules found by the biclustering algorithm,
but the robustness based filtering still helps getting rid of spurious
modules.
Finally, let us take a look at some of the other modules, that were
also kept after the robustness-based filtering. Here we plot five of
them, together with the input matrix.
<>=
png("graphics/othermods.png", 400, 400)
@
<>=
othermods <- order(modules3$seeddata$rob)[1:5]
print(plotModules(modules3, othermods, data=data[[1]]))
@
<>=
dev.off()
@
\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{graphics/othermods.png}
\caption{The input matrix and five modules with relatively low
robustness scores. As it turns out, each of these modules contains
two blocks of the input matrix.}
\label{fig:othermods}
\end{figure}
It turns out that these are double-modules, each is a union of two
real modules. ISA finds these, since they also have correlated rows
and columns, although the correlation is lower than for the
single-modules. The double-modules are typically found at lower
thresholds.
\subsection{Visualize the results}
Visualizing overlapping biclusters is a challenging task. We
show simple methods that usually visualize a single bicluster
at a time. For some of these we will use the \Rpackage{biclust} R
package~\cite{biclust}.
\subsubsection{The \Rpackage{biclust} package}
The \Rpackage{biclust} R package implements several biclustering
algorithms in a unified framework. It uses the class \Rclass{Biclust}
to store a set of biclusters. The \Rfunction{isa.biclust} function
converts ISA modules to a \Rclass{Biclust} object. This requires the
binarization of the modules, i.e. the ISA scores are lost, they are
converted to zeros and ones:
<>=
library(biclust)
five.mods <- isa.in.silico(200,200,5,20,20)
modules <- isa(five.mods[[1]],thr.row=2, thr.col=2)
Bc <- isa.biclust(modules)
Bc
@
\subsubsection{Image plots}
There are some examples in this document that show how to create image
plots for the modules and potentially also the input data. See
e.g. Figs~\ref{fig:othermods} and
\ref{fig:overlapplot}. These plots use the \Rfunction{plotModules}
function, that directly takes the output of \Rfunction{isa} (and other
functions with the same output format: \Rfunction{isa.iterate},
\Rfunction{isa.unique}, etc.) In fact \Rfunction{plotModules} calls the
\Rfunction{images} function to do its job. \Rfunction{images} takes a
list of matrices, see Fig.~\ref{fig:noise} for an example.
The \Rfunction{drawHeatmap} function of the \Rpackage{biclust} package
can be also used to draw an image plot. This visualizes a single
bicluster on top of the input matrix. It reorders the rows and columns
of the input matrix to move the block of the bicluster to the top
left corner of the input matrix (Fig.~\ref{fig:heatmap}).
<>=
png("graphics/drawheatmap.png", 400,400)
@
<>=
drawHeatmap(five.mods[[1]], Bc, number=1,local=FALSE)
@
<>=
dev.off()
@
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{graphics/drawheatmap.png}
\caption{Heatmap plot of the reordered input matrix. The rows and
columns corresponding to the first bicluster are moved to the
top-left corner of the matrix.}
\label{fig:heatmap}
\end{figure}
\subsubsection{Profile plots}
The \Rfunction{parallelCoordinates} function of the \Rpackage{biclust}
package plots the profile of the rows (or columns) that are included
in a bicluster with a different color. These plots visualize the
difference between the bicluster and the rest of the matrix, see
Fig.~\ref{fig:parallel} for an example.
<>=
parallelCoordinates(five.mods[[1]], Bc, number=1, plotBoth=TRUE)
@
\begin{figure}
\centering
\setkeys{Gin}{width=0.6\textwidth}
<>=
<>
@
\caption{Profile plots for the first bicluster found by ISA. In this
artificial data set there is a clear distinction between the
background and the rows/columns of the module.}
\label{fig:parallel}
\end{figure}
\subsubsection{Contrast bar plots}
The \Rfunction{plotclust} function of the \Rpackage{biclust} package
creates barplots for one or more biclusters. A single bar is the
mean of the columns of the bicluster for a given row of the bicluster;
or the mean of the columns of the background (i.e. the rest of the
input matrix). The bigger the difference between the bars of the two
colors, the better the bicluster. The results of the following code
are in Fig.~\ref{fig:plotclust}.
<>=
plotclust(Bc, five.mods[[1]], Titel="")
@
\begin{figure}
\centering
\setkeys{Gin}{width=0.8\textwidth}
<>=
<>
@
\caption{Bar plots show the difference between the biclusters and
the rest of the input matrix, for the first five modules found by
ISA.}
\label{fig:plotclust}
\end{figure}
\section{Features of ISA}
In this section we show examples that highlight key features
of the ISA: its resilience to noise and its ability to find
overlapping modules.
\subsection{Resilience to noise}
To test the behavior of ISA with noisy inputs we generate a number of
in-silico data sets with different noise levels
(Fig.~\ref{fig:noise}).
<>=
noise <- seq(0.1,1,by=0.1)
data <- lapply(noise, function(x) isa.in.silico(500,200,10,noise=x))
@
<>=
png("graphics/noise.png", width=600, height=400)
@
<>=
images(lapply(data, "[[", 1), names=as.character(noise),
xlab="", ylab="")
@
<>=
dev.off()
@
\begin{figure}
\centering
\includegraphics[width=\textwidth]{graphics/noise.png}
\caption{Modular data sets with different levels of noise. The data
sets are generated by adding normally distributed background noise
with given variance to the checkerboard matrix.}
\label{fig:noise}
\end{figure}
Next, we run ISA with the default parameters on all the data
sets. This might take a while.
<>=
modules <- lapply(data, function(x) isa(x[[1]]))
@
Let us check the sensitivity of ISA, for every real module, we pick
one from the ones ISA has found, the one that is the most correlated
to it.
<>=
best <- lapply(seq_along(modules), function(i) {
cc <- pmin(cor(modules[[i]]$rows!=0, data[[i]][[2]]),
cor(modules[[i]]$columns!=0, data[[i]][[3]]))
apply(cc, 2, max)
})
best.mean <- sapply(best, mean)
@
Let's create boxplots for the different noise
levels. (Fig.~\ref{fig:sensbox}.)
<>=
boxplot(best, names=noise, xlab="Noise level", ylab="Sensitivity",
cex.lab=1.5)
@
\begin{figure}
\centering
<>=
<>
@
\caption{The sensitivity of the ISA in the function of noise. The
sensitivity score is calculated by finding the best (in terms of
Pearson correlation) ISA module for each block in the input matrix.
Each boxplot is contains ten points, one for each module.
ISA performs very well, even in the presence of relatively high
noise levels.
}
\label{fig:sensbox}
\end{figure}
To test the specificity, we create boxplots for the correlation
coefficients of all modules found by ISA and their closest real
module. (Fig.~\ref{fig:specbox}.)
<>=
spec <- lapply(seq_along(modules), function(i) {
cc <- pmin(cor(modules[[i]]$rows != 0, data[[i]][[2]]),
cor(modules[[i]]$columns != 0, data[[i]][[3]]))
apply(cc, 1, max)
})
@
<>=
boxplot(spec, names=noise, xlab="Noise level", ylab="Specificity",
cex.lab=1.5)
@
\begin{figure}
\centering
<>=
<>
@
\caption{Specificity of the ISA modules, in the function of
noise. Specificity was calculated by finding the best -- again,
by highest Pearson correlation -- matching block in the input matrix
for each ISA module. These correlation coefficients are plotted in a
boxplot for each noise level.
Specificity of ISA decreases approximately linearly with the
increase of noise.
}
\label{fig:specbox}
\end{figure}
\subsection{Finding overlapping biclusters}
ISA biclusters might overlap in their rows, columns, or both. To
illustrate this, we create an artificial data set with two overlapping
blocks.
<>=
set.seed(1)
two.over <- isa.in.silico(100,100,2,40,40,overlap.row=10)
@
We run ISA on this input matrix, with two threshold parameters, the
first thresholds are very mild, they are both zero.
In other words, in each iteration step all rows/columns are kept in each
that have a higher score than the average. The second set of
thresholds are stricter, the scores have to be at least one standard
deviation away from the mean to keep them.
<>=
ov.normed <- isa.normalize(two.over[[1]])
ov.seeds <- generate.seeds(count=100, length=100)
ov.modules.1 <- isa.iterate(ov.normed, ov.seeds, convergence="cor",
thr.row=0, thr.col=0, direction="up")
ov.modules.1 <- isa.unique(ov.normed, ov.modules.1)
ov.modules.1 <- isa.filter.robust(two.over[[1]], ov.normed, ov.modules.1)
ov.modules.2 <- isa.iterate(ov.normed, ov.seeds, convergence="cor",
thr.row=1, thr.col=1, direction="up")
ov.modules.2 <- isa.unique(ov.normed, ov.modules.2)
ov.modules.2 <- isa.filter.robust(two.over[[1]], ov.normed, ov.modules.2)
ncol(ov.modules.1$rows)
ncol(ov.modules.2$rows)
@
ISA found \Sexpr{ncol(ov.modules.1$rows)} %$
modules with the mild thresholds and \Sexpr{ncol(ov.modules.2$rows)} %$
modules with the strict ones. Let us plot the original data and the
modules found by ISA (Fig.~\ref{fig:overlapplot}).
<>=
png("graphics/overlapplot.png", 600, 400)
@
<>=
no.modules <- ncol(ov.modules.1$rows) + ncol(ov.modules.2$rows)
plotModules(data=two.over[[1]],
list(rows=cbind(ov.modules.1$rows, ov.modules.2$rows),
columns=cbind(ov.modules.1$columns,
ov.modules.2$column)),
names=c("Input matrix", paste("Module", seq_len(no.modules))))
@
<>=
dev.off()
@
\begin{figure}
\centering
\includegraphics{graphics/overlapplot.png}
\caption{The input matrix with two overlapping blocks and the ISA
modules for this data set. The first three modules were found at
milder threshold parameters, that is why they are bigger. As you can
see for Modules \#1 and \#2, at milder thresholds there is a higher
probability that ISA will pick up some incorrect rows/columns.
}
\label{fig:overlapplot}
\end{figure}
As expected the ISA modules are in general bigger with the mild
thresholds. ISA correctly identifies the two overlapping modules and
also the union of them. With the stricter thresholds it finds the
non-overlapping parts of the two modules, plus their overlap as a
separate module.
\section{More information}
For more information about the ISA please see the references at the
end of this paper. The ISA homepage is at
\url{http://www.unil.ch/cbg/homepage/software.html}.
For analyzing gene expression data with ISA, we suggest using
BioConductor~\cite{bioc} and the \Rpackage{eisa}~\cite{eisapackage} R
package.
If you want to run ISA on a computer cluster or a multi-processor
machine, then see the vignette titled ``Running ISA in parallel with
the \Rpackage{snow} pacakge'', in the \Rpackage{isa2} package.
\section{Session information}
The version number of R and packages loaded for generating this
vignette were:
<>=
toLatex(sessionInfo())
@
\bibliographystyle{apalike}
\begin{thebibliography}{}
\bibitem[Bergmann et~al., 2003]{isa}
Bergmann, S., Ihmels, J., and Barkai, N. (2003).
\newblock Iterative signature algorithm for the analysis of large-scale gene
expression data.
\newblock {\em Phys Rev E Nonlin Soft Matter Phys}, page 031902.
\bibitem[Gentleman, 2004]{bioc}
R. Gentleman, V. J. Carey, D. M. Bates, B.Bolstad, M.
Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, and others (2004).
\newblock Bioconductor: Open software development for computational biology and
bioinformatics.
\newblock {\em Genome Biology}, Vol. 5, R80
\bibitem[Cs\'ardi, 2009]{isapackage}
Cs\'ardi, G. (Apr 1, 2009).
\newblock {\em isa2: The Iterative Signature Algorithm}.
\newblock R package version 0.1.1.
\bibitem[Cs\'ardi, 2009]{eisapackage}
Cs\'ardi, G. (Sep 15, 2009).
\newblock {\em eisa: The Iterative Signature Algorithm for Gene
Expression Data}.
\newblock R package version 0.1.
\bibitem[Ihmels, 2002]{sa}
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y.,
Barkai, N. (2002).
\newblock Revealing modular organization in the yeast transcriptional
network.
\newblock { \em Nat Genet }, page 370--7.
\bibitem[Ihmels, 2004]{isamod}
Ihmels, J., Bergmann, S., Barkai, N. (2004).
\newblock Defining transcription modules using large-scale gene
expression data.
\newblock { \em Bioinformatics}, page 1993--2003.
\bibitem[Kaiser, 2009]{biclust}
Sebastian Kaiser, Rodrigo Santamaria, Roberto Theron, Luis Quintales
and Friedrich Leisch. (2009).
\newblock {\em biclust: BiCluster Algorithms}.
\newblock R package version 0.7.2.
\end{thebibliography}
\end{document}