Biological systems are orchestrated by heterogeneous regulatory programs that control complex processes and adapt to a dynamic environment. Recent advances in high‐throughput experimental methods provide genome‐wide perspectives on such regulatory programs. A considerable amount of data on the behavior of model systems in a variety of conditions is rapidly accumulating. Still, the dominant paradigm is to analyze new genome‐wide experiments separately from any other extant data, for example, by clustering the new data alone. Here we introduce a new methodology for analyzing the results of a new functional genomic study vis‐à‐vis a large compendium of previously published results from heterogeneous experimental techniques. We demonstrate our methodology on Saccharomyces cerevisiae, using a compendium of some 2000 experiments from 60 different publications. Most importantly, we show how the integrated analysis reveals unexpected connections among biological processes, and differentiates between novel and known effects in the analyzed experiments. Such characterization is impossible when new data sets are studied in isolation. Our results exemplify the power of the integrative approach in the analysis of genomic scale data sets and call for a paradigm shift in their study.
Some of the greatest “success stories” in modern biology can be attributed to coordinated community efforts that tackled an overwhelmingly large problem using a web of semi‐independent efforts. In the most prominent and recent of these efforts, the emergence of genomics was facilitated by the ability to share and compare sequence data, by the availability of these data to extensive search, and by the aggregation of data into one body of knowledge. A major challenge of today's biology is the functional characterization of biological systems. This problem (in any of several alternative forms) is probably one of the largest ever attempted by biologists, and is thus a natural candidate for being tackled by such a community‐based scheme. With this long‐term goal in mind, the present study proposes a methodology that can help to exploit large‐scale compendia of functional genomics data as part of the routine analysis of high throughput experiments. Using a large collection of different types of data obtained for the baker's yeast S. cerevisiae, we demonstrate how fruitful such combined approach may be in characterizing responses to specific conditions from a system level perspective.
We focus on a relatively simple building block of biological systems ‐ the functional module. Following the pioneering studies of gene expression profiles (Eisen et al., 1998), researchers have extensively used clusters of co‐expressed genes to gain insights into the organization of regulatory processes. Clustering, in its simple form, partitions the genome into disjoint gene sets (possibly obeying hierarchical organization), such that each set manifests a different characteristic expression pattern across all the experimental conditions. A natural generalization of a co‐expressed gene cluster is a transcriptional module (Ihmels et al., 2002) ‐ a set of genes that are co‐expressed in some (but not necessarily all) experimental conditions. Transcriptional modules are a more flexible and realistic building block for biological systems. A certain gene may belong to more than a single transcriptional module, as it can be expressed (or may exhibit different genetic and physical interactions) under different conditions. Transcriptional modules can be detected using bicluster analysis of gene expression datasets. In bicluster analysis, the output is not a set of disjoint clusters, but a collection of (possibly overlapping) transcriptional modules that can represent phenomena like pleiotropy or context‐dependent regulation. Finally, a functional module (FM) generalizes a transcriptional module by taking into account other heterogeneous sources of biological information in addition to gene expression (e.g., protein interactions, synthetic lethality, etc.). A functional module is thus a set of genes that are correlated with each other across a set of biological properties. In previous work (Tanay et al., 2004) we have introduced the SAMBA algorithm for detecting FMs in very large scale and highly heterogeneous datasets. Biological properties can represent any source of information on genes and their products, including gene expression, phenotype and protein interactions.
What can be gained from dissection of biological systems using FMs? FMs simplify the understanding of biological systems by representing cellular processes in terms of the activity of a modest number of modules instead of thousands of genes. As we show here, a comprehensive set of FMs for a model system, built by integrating data from many different studies and sources, may form a valuable foundation when analyzing the results of a new experiment. For example, O'Rourke and Herskowitz (O'Rourke and Herskowitz, 2004) studied the response of several key S. cerevisiae mutant strains to variable levels of hyper‐osmotic stress. By analyzing the resulting gene expression dataset using standard clustering and extensive expert analysis, the process of hyper osmotic adaptation was dissected into several clusters containing hundreds of genes each. These clusters represent groups of genes that exhibit typical response patterns in the osmotic shock treatments. On the other hand, by adding the Orourke‐Herskowitz gene expression profiles to the vast compendium of available yeast functional properties accumulated so far (including almost 2000 different conditions from 60 different studies) and analyzing the combined dataset using the methods described here, we can characterize the osmo‐adaptation process in terms of the activity of a small number of well‐defined, highly specific FMs (Fig. 4a, b). By using more data, which shed light on different aspects of the biological system, we can better separate modules that seem to respond alike under the limited number of particular conditions of the original study.
We envision a framework (Fig. 5) in which the set of currently available functional profiles for an organism will be continuously updated in one of today's public repositories (www.ncbi.nlm.nih.gov/geo/, www.ebi.ac.uk/arrayexpress/). In addition to the data itself, a set of characterized and annotated functional modules will be maintained and incrementally refined. The analysis of a new dataset will be performed in light of the entire compendium and annotated modules (using, for example, the algorithms we describe here). In this way, researchers performing even modest‐size experiments will be able to probe into the system level effects of the conditions they study, and benefit from the cumulative community knowledge base. Upon publication, each new dataset will become part of the compendium, contributing to the robustness of future analyses by refining the compendium and the FMs.
Functional genomics is naturally evolving into a multidisciplinary collaborative effort, and the development of tools that facilitate the communication and the use of published data has become an active field of research in recent years. Several efforts to perform genome‐wide characterization of biological systems using heterogeneous data are underway (a beautiful example is the integrated analysis of cancer gene expression studies by Segal et al. (Segal et al., 2004)). We believe that the methods we present here, and more importantly, their implementation, make progress in this direction. They may prove useful for dissecting functional genomics experiments and for integrating results obtained by several different types of genome‐wide methodologies. A prototype website that demonstrates this methodology is available at www.cs.tau.ac.il/~samba.
A new methodology is proposed for the analysis of gene expression experiments in the context of a large and diverse compendium of prior published data. Examples on yeast show how such approach allows detailed dissection of single experiments from the system level perspective.
By proposing a method for analyzing the results of a single microarray study in the context of all prior studies, deeper and more detailed analysis of the conditions in the study is possible.
By allowing the routine analysis of new experiments in the context of large functional genomics compendia, the communication of new results and the agglomeration of diverse datasets into one body of knowledge are simplified.
The new methodology can accelerate the evolution of functional genomics into a collaborative community effort, by proposing a common language for describing and analyzing the accumulated experimental data.
Over the last several years, a variety of methods have been used to monitor biological processes on a genomic scale. In a typical study, the researchers define a target cellular response (e.g., a signaling pathway, response to certain environment or disease), select a set of conditions (genetic perturbations, environmental stimulations, a new drug, etc.) and perform high‐throughput experiments in these settings. A rapidly increasing pool of techniques allows measurements of gene expression (DeRisi et al, 1997), transcription factor binding (Ren et al, 2000; Iyer et al, 2001), selection from a mutant collection (Birrell et al, 2002), two‐hybrid analysis (Schwikowski et al, 2000), synthetic lethality (Tong et al, 2004) and more. The typical computational analysis, following Eisen et al (1998), clusters the data obtained and then tries to characterize each cluster's gene set using known gene functions and promoter analysis. Usually, a laborious expert scrutiny of specific genes and their behavior is needed in order to reach meaningful biological conclusions. This methodology has proven very effective in identifying primary trends in the experimental results. Following the publication of many dozens of high‐throughput studies, a very rich resource is now available, containing thousands of different molecular snapshots of wild‐type and mutant cells under different conditions. The current analysis paradigm does not directly take full advantage of this vast resource. In analogy, the current method for analyzing microarray profiles is similar to trying to find structural motifs in a small number of new cDNA clones without using homology searches in appropriate sequence databases.
In this work, we introduce a new methodology that takes advantage of a large data compendium in the analysis of novel high‐throughput experiments. Previously, we have shown how data from different sources can be integrated using the SAMBA biclustering algorithm (Tanay et al, 2004b). Here we develop a method to characterize the response of biological pathways to various stimuli at the system level, capturing not only the dominant primary responses but also finer and less‐easily tractable processes. Previous approaches to integrated analysis of functional genomics data focused on predicting single gene functions (Kemmeren et al, 2002; Wu et al, 2002; Troyanskaya et al, 2003), studied the global organization of molecular networks and transcriptional programs (Ihmels et al, 2002; Beer and Tavazoie, 2004; Segal et al, 2004; Tanay et al, 2004b) or combined several experimental approaches to construct and test networks for specific systems (Ideker et al, 2001; Prinz et al, 2004). The approach introduced here is aimed at the analysis of a data set from one new study in the context of a large compendium of data from many diverse and heterogeneous prior studies. We combine the broad perspective of global analysis, with a focused and easy‐to‐use dissection of a single experimental data set, very much like standard clustering‐based analysis. Our approach thus allows for rapid interpretation of novel data sets in terms of the activity of known and novel biological modules.
To illustrate the use of the new methodology, we reanalyzed publicly available data on the yeast Saccharomyces cerevisiae. We assembled a comprehensive collection of data from 60 different studies and close to 2000 different experiments. We show that using this compendium and our algorithms, it is possible to greatly extend the understanding of complex regulatory mechanisms, beyond what can be done using single studies. Our tools, data compendium and comprehensive results are available through a new web interface (www.cs.tau.ac.il/~rshamir/simba/).
An integrated compendium of yeast functional data
We have built a compendium of yeast functional data including profiles from 52 gene expression studies, five transcription factor location studies, three synthetic lethality studies and data on protein interactions from the GRID database (http://biodata.mshri.on.ca/yeast_grid/servlet/SearchPage). The complete list of references for all data sources is available on Supplementary website (www.cs.tau.ac.il/~rshamir/simba/). Our algorithmic framework (Tanay et al, 2004b) transforms all sources of information into generalized conditions and analyzes them together (Materials and methods). We applied biclustering to the combined data set and derived a set of ∼1200 statistically significant modules. A module consists of a set of genes and a set of conditions, such that the genes have significant and correlated values over the set of conditions. For example, a bicluster may be defined by a set of genes that are (1) coexpressed in several conditions (2) are targeted by the same specific transcription factors and (3) their protein products are likely to interact with a certain protein. To understand the biology behind specific modules, we automatically associated them with known processes and regulatory mechanisms. We assigned modules to biological processes using functional enrichment tests based on the SGD GO annotation (Materials and methods). We searched for known and novel enriched cis‐elements in the promoters of the genes in each module and manually annotated the discovered motifs (Materials and methods). When discussing modules, we use the module number, the primary biological process associated with it (when available) and the module's number of genes and properties, for example, module #524 (RNA processing, 76 × 211). A single biological process may be represented by several modules of varying sizes and specificities, but our algorithm guarantees that no two modules are similar.
Synergism between different sources of data
We first asked how much synergism exists among the experimental data from different studies. The distribution of module dimensions (Figure 1B and Supplementary Figure 1) indicates that the comprehensive compendium gives rise to highly specific modules, with 10–50 genes supported by 20–100 conditions. The distribution of the number of studies contributing properties to each module (Figure 1C) demonstrates a high level of synergism in the multistudy data compiled. A total of 86% of the modules used data from more than one study and 68% used data from three studies or more, showing that indeed, information was extracted from multiple data sets and is not biased by one predominant source. A global representation of the compendium and its dissection into modules is obtained by clustering the mean module expression across all experimental conditions. Since the same gene may be part of several modules, such clustering allows the ‘unfolding’ of the function of pleiotropic genes and differs substantially from a standard gene‐by‐condition clustering. The resulting representation (Figure 1D) shows how two opposite environmental stress responses (ESRs; Gasch et al, 2000) dominate the entire compendium. This response to stress is so strong and widespread that other, condition‐specific regulatory programs are hard to detect without the combination of multiple studies and the application of sensitive algorithms. As we shall see below, separating the general stress response into specific modules and comparing their activities in different conditions provides further insights into the complex regulation of this biological process.
The cytokinesis transcriptional module
Defined by data from many different experiments, modules can characterize highly specific biological phenomena. Module #126 (Figure 2A and Supplementary website) consists of 11 genes related to cytokinesis and daughter‐specific expression. Of these genes, DSE1‐4, SCW11, CTS1, EGT2, AMN1 and BUD9 are known to be localized to the daughter cell during late mitosis, and are associated with cell wall separation and exit from mitosis (Colman‐Lerner et al, 2001). SUN4 is also known to be involved in cell septation (Velours et al, 2002) and PRY3 encodes a cell wall‐specific protein of unknown function. The association of these genes into a single module was based on gene expression data from 261 conditions taken from 30 different studies, and the transcription factor location profiles of the cell cycle regulators Ace2, Swi5 and Fkh2. Indeed, Ace2 and Swi5 are known to have positive and negative effects, respectively, on the transcription of some of the genes in this module (Doolin et al, 2001). Fkh2 is known to regulate genes required for the G2/M transition and has been implicated (together with Ndd1 and Mcm1) in the regulation of the SWI5 and ACE2 genes (Simon et al, 2001), but its direct association to cytokinesis genes, to the best of our knowledge, was not noted before. This possible role for Fkh2 is supported by evidence for its involvement in the regulation of pseudohyphal growth (Zhu et al, 2000) and by its synthetic lethality with CLA4 (Goehring et al, 2003), a gene involved in polarization and budding, which functions in a cascade regulating exit from mitosis (Hofken and Schiebel, 2002). The association of Fkh2 with cytokinesis genes may reflect the need to inhibit the function of these genes until mitosis is completed or during transition to pseudohyphal growth.
The wealth of functional information used to construct the module enabled us to explore the behavior of this important transcriptional program across many different experimental conditions. In particular, we analyzed the behavior of the module genes in experiments perturbing different transcriptional coactivators and corepressors (Sudarsanam et al, 2000; Angus‐Hill et al, 2001; Geisberg et al, 2001) to try and refine our understanding of the mechanisms of transcriptional regulation used in timing the mitotic events. The module exhibits a statistically significant response in several such experiments (Figure 2B). Strong induction is observed upon perturbation of the SWI/SNF chromatin remodeling complex (t‐test, P<0.0001 in minimal media, P<0.001 in rich media, for both mutants). Strong repression was observed in an experiment that inactivated the RSC factor Rsc3 (P<0.0004), but no effect was detected when the RSC factor Rsc30 was inactivated (P<0.89). In addition, a strain knocked out for NC2 activity (BUR6 deletion) exhibited strong increase in the expression of this module (P<0.0002). Interestingly, the behavior of module #126 in the SWI/SNF, RSC and NC2 experiments is unique among all the modules (Supplementary Table 1), suggesting that the particular combination of cofactors uncovered may define the particular regulatory behavior of this module. Taken together, our analysis suggests that the module is controlled by an extended regulatory program that includes the well‐known Ace2/Swi5 and Fkh2 transcription factors and a unique combination of coactivators and corepressors (Figure 2C). The cytokinesis module thus exemplifies the power of our methodology to unravel the complex regulation network of a group of coordinated genes.
Regulation of the galactose system
We next turned to the analysis of a single high‐throughput data set versus the entire compendium. The yeast galactose utilization pathway is among the best‐characterized biological systems. In a systematic set of experiments, Ideker et al (2001) measured the transcriptional response of yeast strains knocked out for a set of enzymes and regulators involved in galactose metabolism. The data were then clustered and analyzed in light of the known Gal4–Gal80–Gal3 regulatory circuit. We used the galactose data set as a test case for our methodology. Instead of clustering yeast genes given their expression in the galactose data set only, we screened our complete set of modules, which are based on almost 2000 experiments, for modules that are responsive in at least one of the conditions analyzed by Ideker et al. Since the data defining our modules are relevant to many different aspects of the yeast regulatory network, we were able to interpret galactose‐related conditions from a broad perspective. We depict the effect of galactose‐related conditions on several central modules in Figure 3A (interactive visualization of all modules is available on the website). As expected, the strongest effects are well known and were easily observed using clustering of the galactose data set alone. For example, module #389 (Galactose metabolism, 20 × 160), the classical Gal4 regulon, consists mainly of enzymes required for the utilization of galactose (GAL1,2,7,10) and is strongly repressed when galactose is lacking from the medium or when knockouts in the GAL pathway compromise its yield. The response of other modules, however, is less predictable and reveals novel regulatory relations between different processes.
A first surprising effect revealed by our analysis is the repression of module #524 (RNA processing, 76 × 211) in gal4 strains, in both galactose‐containing (paired t‐test, gal4+galactose/wt+galactose, P<10−21) and galactose‐free media (gal4−galactose/wt−galactose, P<10−22). The repression of this module in mutants lacking structural enzymes is much weaker, and so is the response of the wild‐type strain to lack of galactose (gal4+galactose/wt−galactose, P<10−10). Moreover, in three strains knocked out for Gal80 (the Gal4 inhibitor), grown in medium lacking galactose, we observe induction of module #524 (gal80/wt, P<10−17; gal80gal2/wt, P<10−25; gal80gal4/wt, P<10−20). This result includes the double mutant gal4gal80, implying that the effect is Gal4‐independent. The induction of module #524 is particularly interesting given the slow growth and transcriptional repression of module #232 (Ribosomal proteins, 145 × 269) in the gal80 strains. Across the entire compendium, the expression of modules #524 and #232 is tightly coupled, as both are strongly repressed under general stress conditions (Gasch et al, 2000). The correlation between the mean expressions of the two modules across 1500 gene expression conditions is indeed very high (Pearson=0.73; Supplementary Figure 2). The marked difference between the expression of the two modules in the gal80 and gal4 experiments (Figure 3B) represents a regulatory discrepancy whose mechanistic causes are still unclear. Module #524 is regulated by the two highly enriched cis‐elements PAC (GCGATGAG) and RRPE (GAAAATTTT) (Hughes et al, 2000), but it is still not known which factors bind these sites. Module #232 is regulated by Rap1 and possibly by additional factors (Marion et al, 2004). Some interaction between these factors, their coactivators/repressors and the Gal4/Gal80 circuit may account for the mutants altered response.
Mutations in genes of the galactose pathway and changes in the carbon source have an extensive effect on the yeast metabolism as a whole. The transcriptional regulation of nonfermentative metabolism involves a complex network of transcriptional regulators, coactivators and corepressors (Schuller, 2003). Many of the modules that were associated with the galactose data set are linked to different metabolic activities. Using data from different studies, we can dissect the general metabolic response into basic building blocks, thereby shedding light on the regulatory interactions that gave rise to it (Figure 3). Overall, we observe two types of behavior. Modules #1215 (Gluconeogenesis, 54 × 86) and #536 (Respiration, 44 × 156) are generally induced in conditions in which the yield of the galactose pathway is compromised. Modules #503 (Purine metabolism, 13 × 198), #686 (Amino‐acid biosynthesis, 18 × 150) and #967 (Methionine metabolism, 15 × 156) are repressed under these conditions. This general trend fits well with our understanding of the yeast regulatory program. Yeast cells respond to the lack of galactose‐based energy by increasing the activity of the respiratory pathway and adapt to slower growth by reducing biomass production. Given these general, well‐documented trends, the behavior of the gal80 strains again remains unexplained. Module #536 (Respiration), for example, is repressed in gal80, gal2gal80 and gal4gal80 strains in the absence of galactose (Figure 3C), although there is no yield from the galactose pathway under these conditions. The repression cannot be explained by constitutive expression of GAL genes, given that expression is reduced also in the gal4gal80 double mutant. Module #536 is regulated by the Hap2–5 complex, and HAP4 is itself part of the module (Schuller, 2003). There is a strong correlation between Hap4 expression and expression of module #536 across the entire compendium (Pearson=0.65; Supplementary Figure 3). Nevertheless, in the three conditions in which GAL80 is inactivated, Hap4 is strongly induced while its module exhibits significant repression, suggesting the involvement of other factors in the repression of the respiratory genes. Other modules show different deviant responses to the gal80 knockout. For example, a Met4/31 module (#967) is induced in the gal80 strain, in contrast to its general repression in other conditions with reduced energy flux. Given the involvement of Gal80 in the repression of SAGA recruitment to Gal4‐binding sites (Carrozza et al, 2002) and the similar acetylation patterns found in the Gal4‐, Hap4‐ and Met4‐activation sites (Deckert and Struhl, 2001), we hypothesize that in media without galactose addition, Gal80 is capable of affecting the recruitment of coactivators or corepressors for factors other than Gal4. Overall, our results provide an explanation of the slow growth phenotype of the gal80 strain, suggesting that deletion of this central regulator has far reaching implications, most notably breaking of the coupling between Ribosomal proteins and RNA processing modules, and the blocking of Hap4‐dependent activation of the Respiration module.
Response to hyperosmotic stress
In response to hyperosmotic stress, yeast cells activate a combination of signaling pathways and transcriptional programs (Hohmann, 2002). We applied our analysis framework to a set of 129 expression profiles obtained in experiments that tested the response of S. cerevisiae to varying levels of osmotic stress in strains knocked out for Hog1, Ssk1 and Ste11, three important proteins in the HOG pathway (O'Rourke and Herskowitz, 2004). The response to high levels of osmotic stress is widespread and involves at least one‐fifth of the yeast genome. We found that this massive response can be dissected into finer transcriptional programs that govern specific modules (Figure 4A). For example, modules #232 (Ribosomal proteins, 145 × 269) and #524 (RNA processing, 76 × 211) are strongly repressed in 0.5 M KCl. In the wild type, repression peaks at 20 min and is alleviated in a HOG1‐dependent manner after 40 min. This joint effect was noted before, based on standard clustering analysis. Using the compendium, we uncover a refined regulatory program. In module #524, the hog1 and ssk1 strains exhibit reduced repression in the presence of 0.5 M KCl (paired t‐test; hog1/wt, P<10−20; ssk1/wt, P<10−14; Supplementary Figure 4A), but no reduction is observed for ste11 (ste11/wt, P<0.14). Derepression by a hog1/ssk1 knockout is also noticeable in a medium containing 0.125 M KCl, (hog1/wt, P<10−28; ssk1/wt, P<10−20; Supplementary Figure 4B), and the effect is almost identical for the two knockouts (hog1/ssk1, P<0.002). Our analysis thus suggests that in medium/low osmotic shock, an Ssk1/Hog1‐transmitted signal represses the RNA processing module activity, whereas during high osmotic shock, a Hog1‐independent pathway is repressing the module additively to the Ssk1/Hog1‐mediated effect (Figure 4B and C).
Similar decomposition of the general stress response into components is possible for the set of stress‐induced genes. Two of the transcriptional modules that are activated in general stress conditions (and specifically in the 0.5 M KCl experiments) are module #536 (Respiration, regulated by Hap4) and module #1215 (Gluconeogenesis). Interestingly, while the response of both modules is remarkably similar in the early phases of the osmoregulation program (0–40 min), only the Respiration module shows a strong secondary induction after 60 min (Figure 4B). Examination of the expression of the HAP4 gene, which is generally coupled to the module's expression level (Supplementary Figure 3), also reveals an increase after 60 min (Figure 4D), supporting the hypothesis that module #536 undergoes two consecutive inductions, one via some common mechanism (which also affects module #1215) and a second that occurs later and is facilitated by the increased levels in Hap4 expression. This second wave of regulation is the adaptive response of yeast cells that have recovered from the osmotic shock, in preparation for further growth.
Analysis of the behavior of module #985 (Ergosterol biosynthesis, 18 × 69) provides another example for the power of the integrative approach. A clear Hog1‐dependent repression is observed. This result is in sharp contrast to the general ESR pattern, in which only derepression is Hog1 dependent. Previous work has shown that ergosterol‐related genes respond strongly to osmotic shock (Hohmann, 2002). Our analysis suggests that their repression directly depends on Hog1 through an unknown signaling pathway that does not involve Ssk1 or Ste11.
After almost a decade of microarray‐based experiments, a revision of the paradigm for their computational analysis is appropriate. In this work, we have introduced a new method for the simultaneous analysis of new high‐throughput data sets given a large compendium of diverse functional data. We have shown that the integrative approach greatly extends our understanding of the regulation of biological processes and allows the decomposition of seemingly global responses into characterized regulatory programs of specific biological modules. The methodology we envision (Figure 5) relies on a growing compendium of public data sets and on our robust algorithms for revealing biological correlations present within these data. Given the data of a new study, its integration with the large body of prior data allows us to recast the new experiments in terms of (a) the behavior of already characterized modules and (b) new modules that are discovered for the first time upon the addition of the new data. Using this approach, backed by appropriate community effort for modules nomenclature (e.g., based on Gene Ontologies), the results of high‐throughput experiments will be easier to assess and share, as it will be clear what in the new experiments is new and what confirms previously published evidence. We are constructing an interactive web interface that will provide the infrastructure for this suggested methodology (www.cs.tau.ac.il/~rshamir/simba).
In this paper, we have focused on the analysis of yeast data. Functional genomics resources are available for many other model systems, and are rapidly accumulating in repositories such as GEO (www.ncbi.nlm.nih.gov) and ArrayExpress (www.ebi.ac.uk). Recently, a set of literature‐based and preprocessed gene sets were used to analyze a large cancer‐related data compendium. The integration of data was shown to be synergistic, even across different cancer types (Segal et al, 2004). Our methodology allows integrated analysis together with the discovery of new modules, making it an effective approach for the routine analysis of new high‐throughput data. Finally, recent studies have shown that transcriptional modules are sometimes highly conserved among species (Stuart et al, 2003; Bergmann et al, 2004; Tanay et al, 2004a). Having established deeper understanding of this conservation, it will be desirable to seek further integration of functional data across different species.
Materials and methods
We used data from 60 publications encompassing 1767 conditions. The complete list of publications and experiments is available on Supplementary website. Data were downloaded from papers' web supplements. For Affymetrix array experiments, we divided each condition's profile by a common reference condition (typically the zero time point of the experiment; see Supplementary website for more details). For other experiments, we used the normalization reported in the original papers.
In the SAMBA framework, each experiment defines one or several properties. For example, a gene expression experiment can be transformed into four properties, representing strong upregulation, weak upregulation, weak downregulation and strong downregulation in the tested condition. We assign genes with properties by applying translation functions that map experimental values to probabilities of having a property. For example, a gene with high gene expression readout in a condition X will be assigned with the property ‘strong upregulation in condition X' with high probability. The notion of property is very flexible and can accommodate diverse sources of data. For example, protein interaction data can be transformed into properties of the form ‘interacting with protein X’ and phenotypes can be transformed to properties of the form ‘mutant is slowly growing on medium Y’. See Tanay et al (2004a, 2004b) for more details. We optimized the performance of SAMBA by testing the effect of changes in the parameters of translation functions, and selected parameters that were robust to addition of new data and are thus expected to provide good results as the compendium size increases. Note that for generating protein interaction properties, we used properties for proteins with at least 15 targets and discarded all others as they bias the statistical model. SAMBA detected statistically significant biclusters including half (133 out of 265) of the protein interaction properties. The other properties were either too distinct to be correlated with other properties or were too noisy for the statistical stringency of the model. All the parameters defining the translation functions we used are available on our website.
We applied the SAMBA 2.0 program to the entire compendium with standard parameters. The program searches the combined data set and outputs a set of modules, each of which is a set of genes that are correlated in a set of properties. Each gene may be part of several modules, allowing us to reveal multiple functions for it. Similarly, each property may belong to several modules, allowing us to associate it with different biological processes. The modules generated were then subjected to additional analysis. We associated biclusters with functional annotation terms using SGD GO associations and functional enrichment tests as previously described (Tanay et al, 2004b). We searched for enriched cis‐elements in all bicluster gene sets using promoters including 600 bp upstream of each ORF, as described (Tanay and Shamir, 2004). Visualizations of modules and the effects of specific experiments on them are available on our prototype website (www.cs.tau.ac.il/~samba/). More information on the algorithms and their parameters is available on Supplementary website.
Given a target data set and the compendium, we derive the set of responding modules as those that contain at least one property from the target data set. SAMBA adds a property to a module, its genes having significant and correlated levels over the property; therefore, using this approach, we extract only modules that significantly respond in the analyzed experiments. We can then study the behavior of the responding modules in the entire analyzed data set.
Testing the significance of changes in modules' mean expression
To evaluate changes in the mean expression of a module between two conditions, we used a standard two‐tailed paired t‐test. To test the significance of an induction or repression of a module in a single condition, we performed two‐tailed two‐sample unpaired t‐tests comparing the module's genes and the entire genome. To compute compendium trends of module #524 and #232, we used standard best linear fit and computed the standard deviation of the bias of samples from the linear curve. We also computed statistics using nonparametric tests with similar results. The above hypothesis testing procedures were used after employing SAMBA, as additional tests of claims on the regulation of specific modules.
AT was supported by a Horvitz complexity fellowship. MK was supported in part by the ISF, the Recanati Fund and the Israeli Ministry of Health. RS holds the Raymond and Beverly Sackler Chair for Bioinformatics at Tel Aviv University, and was supported by the Israel Science Foundation (Grant 309/02).
Supplementary Material [msb4100005-sup-0001.doc]
Supplementary Figures [msb4100005-sup-0002.pdf]
- Copyright © 2005 EMBO and Nature Publishing Group