Here, we develop computational methods to assess and consolidate large, diverse protein interaction data sets, with the objective of identifying proteins involved in the coupling of multicomponent complexes within the yeast gene expression pathway. From among ∼43 000 total interactions and 2100 proteins, our methods identify known structural complexes, such as the spliceosome and SAGA, and functional modules, such as the DEAD‐box helicases, within the interaction network of proteins involved in gene expression. Our process identifies and ranks instances of three distinct, biologically motivated motifs, or patterns of coupling among distinct machineries involved in different subprocesses of gene expression. Our results confirm known coupling among transcription, RNA processing, and export, and predict further coupling with translation and nonsense‐mediated decay. We systematically corroborate our analysis with two independent, comprehensive experimental data sets. The methods presented here may be generalized to other biological processes and organisms to generate principled, systems‐level network models that provide experimentally testable hypotheses for coupling among biological machines.
Experimental biology has provided significant insight into the individual subprocesses involved in eukaryotic gene expression: transcription, pre‐mRNA capping, splicing and polyadenylation, mRNA export, and translation. However, only recently has the extensive coupling among these subprocesses been recognized (Maniatis and Reed, 2002; Orphanides and Reinberg, 2002), and much work remains to elucidate the interactions between molecular machineries involved.
To study coupling between gene expression subprocesses in yeast, we chose to take advantage of the large amount of protein interaction data available. High‐throughput protein interaction assays are a potentially rich source of information to identify mechanisms that facilitate cooperation and communication between individual members of multiprotein complexes. While the sheer size and scope of these data sets render them a valuable resource, care must be taken because of the high error rates and biases inherent in high‐throughput data.
We developed a three‐part computational strategy to infer coupling among gene expression machineries from available interaction data that minimizes the impact of data set errors and biases, as outlined in Figure 1. First, we integrated source data sets to create a comprehensive, weighted protein interaction network (Figure 1A). As densely connected proteins are likely to correspond to molecular machineries, we clustered the resulting protein interaction network to suggest such groupings (Figure 1B). Finally, to evaluate and present hypotheses regarding coupling among gene expression subprocesses, we searched for intercluster network motifs—patterns in the arrangement of links and clusters—that are signatures of biological coupling between protein complexes (Figure 1C).
We collected 13 yeast data sets for our analysis of the gene expression process, but our method can be generalized to any number of interaction data sets, biological processes, and organisms. Among the challenges of integrating a variety of diverse data sets are the varying coverage and quality of each one and the lack of a comprehensive gold standard. To address these difficulties, we developed a relative data set quality (RDQ) score based on measures of pairwise mutual data set overlap.
This concept of evaluating quality based solely on mutual comparison has been used by search engines for ranking web pages (Page et al, 1998). We weighted the contribution of each data set to the total network by its RDQ, so that the weight of each network edge reflects the calculated reliability of its source data. We compared different RDQ calculation methods, and selected one that both appropriately penalized data sets corrupted by false positives and corresponded to independent reliability criteria.
Because false positives and negatives nevertheless persist in the integrated network, we also derived for each pair of proteins a novel pairwise clustering coefficient (CC). Our CC definition provides a measure of the local, weighted network neighborhood around a pair of proteins, including pairs lacking a direct link. Heuristically, for each pair of proteins, links to common neighbors increase the CC, whereas links to uncommon neighbors indicate promiscuous binding and decrease the CC. The CC thus presents a powerful metric toward the identification of network regions that correspond to physical protein complexes. We then developed a method, based on the k‐means algorithm, to cluster proteins within the integrated network using their CC‐weighted links. The parameters used for both CC calculation and for our clustering algorithm were independently selected based on the ability to discern biologically interpretable clusters.
Within this network of interconnected protein clusters, supervised identification of motifs corresponding to known patterns of process coupling yielded hypotheses about coupling within the gene expression pathway. Our three motif types were as follows: direct coupling, identifying strong links between separate clusters; cluster‐mediated coupling, identifying small clusters that link two larger clusters; and adaptor‐mediated coupling, identifying proteins that may belong to either of two clusters, such as scaffolding linker proteins or proteins that shuttle between complexes and transiently associate with each. Ranking all occurrences of these motifs in the network yielded a prioritized list of experimentally verifiable biological hypotheses.
Our analysis was corroborated with independent, experimental protein interaction data from new, more comprehensive genome‐wide complex precipitation data sets. Interactions defining higher ranking direct coupling motifs were drastically more likely to correspond to links in higher quality validation sets, as were links within identified clusters.
Results recapitulated previous knowledge about gene expression mechanisms and suggested new possibilities. Several of our clusters were found to correspond to well‐characterized biological complexes, such as the spliceosome, the mRNA cleavage/polyadenylation factors, and the chromatin remodeling machineries SAGA, Swi/Snf, ISWI, and RSC (Figure 3A and B). Other clusters corresponded to functional modules that may represent sets of conditionally associated or interchangeable parts with shared interaction partners (Figure 3B and C), suggesting the architecture and coordination for dynamic, macromolecular structures such as the basal transcription machinery.
Cluster analysis also suggests specific hypotheses, such as the possibility of coupling among elongation, capping, and splicing machineries through cap‐binding proteins associated with elongation complexes and poised to interact with splicing components (Figure 3D). Involvement in motifs may suggest functional roles for unknown proteins: for example, the unknown proteins YJL015Cp and YDR154Cp are implicated in mRNA quality control through cluster‐mediated coupling of mRNA degradation to elongation, in a mechanism that possibly involves recruitment at transcription initiation (Figure 3E). A number of motifs support coupling of export with transcription termination, elongation, and chromatin modification: the export factors Pab1p and Crm1p, for example, are implicated in adaptor‐mediated coupling of export to transcription termination (Figure 3F). Consistent with previous results, adaptor‐mediated and direct coupling motifs identify the export factors Yra1p and Sub2p, respectively, in linking mRNA export to transcriptional elongation (Figure 3F and G). Interestingly, motifs involving the chromatin silencing proteins Sir2p, Sir4p, and Rap1p suggest a possible coupling between silencing and mRNA export (Figure 3H). Finally, a number of motifs suggest the coupling of translation and nonsense‐mediated decay (NMD) factors with nuclear events, including co‐transcriptional recruitment of NMD factors Upf1p and Nmd5p at the nuclear pore (Figure 3I), possible nuclear mRNA circularization facilitated in part by direct coupling between Cdc33p and Tif4632p (Figure 3F), and cluster‐mediated coupling of translation and NMD to transcription and mRNA export (Figure 3G).
Besides providing hypotheses regarding coupling of gene expression subprocesses, this work introduces several techniques for manipulation and analysis of interaction data. As more data become available, they may be integrated into the analysis, allowing ongoing evolution of the accuracy of the model. Finally, our method is readily extendable to other biological processes and organisms and, thus, provides a novel, useful strategy to elucidate the molecular basis of coordination of cellular events.
Coupling among subprocesses of the gene expression pathway in yeast can be inferred from multiple, large‐scale protein‐interaction data sets.
We integrated diverse protein interaction data sets and identified significant coupling among protein clusters involved in distinct subprocesses
Two new independent, high‐throughput protein interaction data sets corroborate the identification of clusters and of significant coupling links between them.
Our approach may be generalized to identify molecular interactions among coupled subprocesses in other cellular contexts.
Gene expression is a stepwise process involving distinct cellular complexes that carry out each subprocess, including transcription, pre‐mRNA capping, splicing and polyadenylation, mRNA export, and translation. Recent studies have revealed both physical and functional interactions between many proteins involved in separate subprocesses (Maniatis and Reed, 2002; Orphanides and Reinberg, 2002; Curr Opin Cell Biol 2005; 7: 239–339). This coupling is thought to improve quality, efficiency, and timing. The tight coupling of gene expression machines can also facilitate rapid changes in gene expression patterns (Misteli, 2001). For example, the carboxy‐terminal domain of RNA polymerase II is able to sequentially associate with diverse components of mRNA processing machinery, thus localizing key components to the nascent pre‐mRNA in a coordinated fashion (Proudfoot et al, 2002; Bentley, 2005).
Although the experimental evidence for the coupling of gene expression subprocesses is mounting, systematic searches for individual components that facilitate this coupling have just begun to be conducted (Burckin et al, 2005). Recent advances in high‐throughput experimental technologies have led to the creation of genomic‐scale protein–protein interaction data sets, which carry with them the possibility of discovering many new functional coupling relationships. However, these data sets also bring challenges due to their large size, high false‐positive rates of reported interactions (Edwards et al, 2002), and experimental biases leading to variable coverage of the proteome. One approach to ameliorate the biases and gaps inherent to each source is to combine multiple data sets, allowing a survey of experimental techniques and conditions (Bader and Hogue, 2002). However, this approach risks masking true coupling connections with the accumulation of noise from false positives (Gerstein et al, 2002). As with any automated analysis of large data sets, the challenge lies in discerning prioritized and biologically interpretable results.
Here, we present a multistep approach to identify coupling between gene expression subprocesses, designed to overcome the above challenges. We needed to (1) select, assess, and combine data sets, (2) cluster the interaction network to suggest protein groupings, and (3) find motifs that indicate coupling among the clusters. To avoid skewing disparate steps in analysis to observations of the final results, we did not choose parameters based on the overall final result. Rather, our guiding principle was to select parameters at intermediate steps to satisfy appropriate, objective criteria. For example, criteria for data integration include minimizing the impact of adding data sets with high false‐positive rates to the combined data set; criteria for computing protein clusters include maximizing functional consistency.
For step 1, we focused our analysis on yeast to take advantage of the large amount of available yeast protein interaction data and of the high degree of conservation of yeast gene expression mechanisms to their well‐studied mammalian counterparts. The construction of interactome networks from diverse data sources is an important area of active research (Grigoriev, 2003; Lee et al, 2004; Li et al, 2004). Our collection of 13 protein interaction data sets (Table I) aims for as complete a survey of publicly available interaction data as reasonably possible. First, we included the results of high‐throughput, whole‐proteome screens (Table I, data set IDs 1–4); the currently available data sets use the yeast two‐hybrid and protein complex immunoprecipitation techniques. We supplement these with data sets of computationally predicted protein interactions (Table I, data set IDs 5–7). Finally, we included results from traditional, hypothesis‐driven experiments described in the literature (Table I, data set IDs 8–13). These last sets were collected from the interaction repository database at the Munich Information Center for Protein Sequences (MIPS) (Mewes et al, 2002) with high‐throughput experiments (1–4) removed. Proteins not linked to proteins functionally annotated to gene expression were excluded from the final data set (Supplementary information 1 and 2).
Current methods for relative data set reliability assessment include comparisons to ‘gold standards’ (Jansen et al, 2003), expression profiles, and functional conservation (Deane et al, 2002). Multiple data sets can be integrated in several ways, such as the Bayesian (Jansen et al, 2003) and random forest (Qi et al, 2005) methods used in other studies. Because we wanted to be able to flexibly adapt to growing numbers of data sets without reliance on a gold standard, we developed a relative method for determining data set reliability based solely on mutual comparison. We then cumulated evidence for each individual network edge/protein pair interaction in proportion to the computed reliability of its source.
For step 2, multiple methods have been developed to identify ‘modules’ in protein interaction networks, including clustering the interaction network alone (Samanta and Liang, 2003; Spirin and Mirny, 2003; Tornow and Mewes, 2003), or introducing orthogonal classes of data (Yeger‐Lotem and Margalit, 2003; Lee et al, 2004; Tanay et al, 2004; Gunsalus et al, 2005). We chose to use the unsupervised k‐means algorithm, which is intuitive to understand and interpret, and is commonly used to cluster continuous‐valued data, and modified it to work on network data. Finally, for step 3, biologically significant motifs within networks have been defined by methods such as the least or most connected subgraphs (Vazquez et al, 2004), or discovered using metrics such as recurrence frequency (Shen‐Orr et al, 2002). Motivated by these approaches, we developed a method to identify motifs of coupling among clusters, rather than among individual proteins. Our approach used a supervised method to find instances of motifs modeled on known patterns of biological coupling, rather than empirically discovering de novo coupling motifs.
Summary of method
We began by consolidating diverse binary protein interaction data sets into a single network, where nodes represent proteins and edge weights reflect the accumulation of evidence supporting their interaction (Figure 1A). First, each of the 13 data sets was formatted, if necessary, to indicate only the presence or absence of a pairwise protein interaction. Each data set was then reduced to include only interactions among the 2100 proteins that may be involved in gene expression, based on annotations in Gene Ontology and augmented by MIPS data (Figure 1A(1); Supplementary information 1 and 2; Supplementary Table S1). We then assessed the reliability of each data set to compute a relative data set quality (RDQ) score, based on pairwise comparisons of its mutual overlap with every other data set (Figure 1A(2); Supplementary information 5; Supplementary Table S2). This concept of evaluating quality based solely on mutual comparison has been used by search engines for ranking web pages (Page et al, 1998). Notably, as this method relies on checks and balances among data sets, RDQ evaluations perform better when larger numbers of data sets with diverse biases are compared. Each data set contributes to the final, integrated network, by adding links appropriately weighted by the corresponding RDQ (Figure 1A(3) and Supplementary information 3).
To reduce the weights of links due to false positives, and to reconstruct links missed due to false negatives, we generalized the pairwise clustering coefficient (CC) (Goldberg and Roth, 2003) to weighted networks. Our CC formula is a measure of the local, weighted network neighborhood around a pair of proteins, including pairs lacking a direct link (Figure 1A(4) and Supplementary information 8). Heuristically, for each pair of proteins, links to common neighbors increase the CC, whereas links to uncommon neighbors indicate promiscuous binding and decrease the CC.
To find molecular machineries corresponding to clusters of densely connected proteins in our weighted network, we developed a modification of the k‐means clustering algorithm (Figure 1B). First, k initial ‘centroid’ proteins are randomly chosen, and each remaining protein is assigned to the centroid to which it is most strongly linked based on cumulative path weight, defined as the sum of CC‐weighted links along the shortest path to the centroid. Each centroid thus defines a discrete cluster of associated proteins (Figure 1B(1) and 1B(2), right), and Supplementary information 10). We then (1) reassign the centroid of each cluster to the protein of maximal average CC to all other proteins within the cluster (hence, ‘centroid’) (Figure 1B(2), right), (2) reassign proteins into clusters about the new set of centroids (Figure 1B(2), left), and (3) iterate (1, 2) until convergence (Figure 1B(3)). We repeated this procedure with many sets of randomly chosen initial centroids and chose the clustering that maximized CC‐weighted paths to centroids for subsequent steps (Figure 1B(4)). We then annotated this final set of clusters based on functional enrichment in gene expression subprocesses (Figure 1B(5) and Supplementary information 12) (Tavazoie et al, 1999; Wu et al, 2002).
Finally, we searched the network for three types of motifs, or patterns in the arrangement of links and clusters that reflect observed manners of coupling between protein complexes (Figure 1C). Direct coupling identifies strong links between proteins tightly grouped within separate clusters, such as the interactions between subunits of the Mediator complex and the basal transcriptional machinery (Kuras et al, 2003) (Figure 1C(1)). Cluster‐mediated coupling identifies small clusters that link two larger clusters, such as the exon junction complex coupling splicing and export machineries in mammals (Reed, 2003) (Figure 1C(2)). Adaptor‐mediated coupling identifies proteins that may belong to either of two clusters, such as scaffolding linker proteins or proteins that shuttle between complexes and transiently associate with each (Figure 1C(3)). For example, the yeast transcription termination and polyadenylation machineries are thought to be linked by the protein Ctk1p, which is involved in both processes (Kim et al, 2004). Our method ranked all instances of these motifs in the network to present a prioritized list of experimentally verifiable biological hypotheses.
A novel algorithm for protein data set reliability calculation
Current approaches to assessing the quality of high‐throughput data sets generally determine the rate of false positives and/or false negatives of each test data set with respect to a benchmark ‘gold standard’ (Jansen and Gerstein, 2004). A typical benchmark has been the MIPS data set, a repository for protein interaction data from the literature (Mewes et al, 2002). Our challenge was to develop a method for assessing RDQ scores of our data sets independent of such a gold standard. Intuitively, an RDQ score for a data set can be determined as a function of how well it corroborates with each of the other data sets, modulated by the RDQ scores of each of the corroborating data sets. Thus, RDQ scores can be computed by starting with an initial vector of RDQ scores, iteratively corroborating data sets with each other, and dynamically updating their RDQ scores until the scores converged. In fact, given a matrix M reflecting the pairwise corroboration of the data sets, this converged RDQ weight vector is given by the principal eigenvector of M. Given a set of RDQ scores wi, reflecting the relative trust of each data source with respect to the others, the individual data sets of protein interactions Si (represented by matrices with binary entries) can be combined by simple matrix addition to form a final graph S=∑wiSi (Figure 1A).
We compared RDQ scores generated using three different measures of data set overlap (Supplementary information 5) by their ability to down‐weight data sets with noise due to false positives (Supplementary Figure S1 and Supplementary information 6), a problem pervasive in some genomic‐scale data sets (von Mering et al, 2002). The set of RDQ scores chosen according to this criterion also rewards data sets enriched in links between functionally related proteins (Supplementary Figure S2 and Supplementary information 7). Based on these scores, the MIPS affinity and MIPS co‐purification data sets were shown to be the most reliable, whereas the Rosetta fusion and paralog‐predicted data sets were the least reliable, as might be expected (Table I).
As discussed above, the pairwise CC measures the local neighborhood cohesiveness around a pair of proteins by assessing relative connectivity to common and distinct neighbors, allowing us to account for false negatives and compensate for false positives (Goldberg and Roth, 2003). Our challenge was to generalize the concept of CC to weighted graphs to incorporate link weight as well as topology. We developed and compared a range of CC formulas and selected the formula with scores that reflect the strongest correlation to functional relationships between proteins. By this measure, the selected CC scores significantly outperform even the original network weights (Supplementary Figure S3; Supplementary Table S3; Supplementary information 8 and 9).
Biological interpretation of clusters generated
We used a modification of the k‐means deterministic clustering algorithm (Hartigan, 1975) to partition our network of 2100 proteins (Supplementary information 10). This algorithm produces k discrete clusters of proteins linked by strong CC weights and is well suited to the identification of protein complexes corresponding to hypothesized molecular machines. We surveyed different numbers of desired final clusters, k, and found that k=70 consistently generated biologically interpretable clusters (Supplementary information 11), with an average of 30 proteins per cluster. The resulting 70 clusters used for further analysis (Figure 2A(ii) and Supplementary Table S5) were annotated based on enrichment in our nine defined subprocesses of gene expression (Figure 2A(iii) and Supplementary Table S6). Out of the 70 clusters, 48 showed significant enrichment in one or more subprocesses of gene expression (P‐value<0.05; see Materials and methods), and all nine subprocesses were represented by at least one cluster (Figure 2C). Among these 70 clusters, ∼25% were densely interconnected as expected. However, ∼50% of the clusters were sparsely linked internally but contained proteins linked through a small number of common binding partners (some of which appear in different clusters). This was due to the use of pairwise CC scores in the clustering process, which assigns high scores to pairs of proteins having common neighbors regardless of the existence of a direct link between them.
Dense internal linkage.
Many of the densely internally linked clusters were found to correspond to well‐characterized biological complexes in yeast (Figure 3A), the members of which are highly similar to their counterparts in mammalian complexes. For example, 64% of the known yeast spliceosomal proteins were automatically grouped together in cluster #21 (abbreviated as C21). Other co‐clustered complex members include the mRNA cleavage/polyadenylation factors (C26), subunits of the CCR–NOT complex (C27), and the chromatin remodeling machineries SAGA, Swi/Snf, ISWI, and RSC (C1, Figure 3B, left).
Sparse internal linkage.
Many of the more sparsely internally linked clusters reconstruct known functional modules that may represent sets of conditionally associated or interchangeable parts that bind to common partners. For example, our clustering grouped the general transcription factor (GTF) TFIIA in cluster C8 together with subunits of the GTF TFIID despite lack of interaction in the original data sets (Figure 3B). Indeed, the interaction between one of these TFIID subunits with TFIIA has just recently been validated experimentally (Robinson et al, 2005). Interestingly, the subunits of TFIID are all either in the sparsely interconnected cluster C8 or in the densely interconnected cluster C1. The interconnected subunits in C1 represent a TFIID core, which includes the yeast TBP homolog Spt15p, subunits common to TFIID and SAGA (TAFs), and two other subunits central to the TFIID assembly (Taf1p/Taf2p) (Auty et al, 2004). Spt15p is shown to bind each TFIID subunit by an interaction in at least one of our data sets, suggesting that the mutually unconnected subunits in C8 may conditionally associate with the TFIID core in C1. This is consistent with the fact that Spt15p is sufficient for basal transcription (Kim et al, 1994), whereas different TFIID subunits are required for transcription of distinct gene subsets (Walker et al, 1997). Similarly, use of the CC co‐clustered the seven DExD/H‐box helicases (Figure 3C), which independently carry out similar biological roles (oligoribonucleotide unwinding) in nearly every step of mRNA processing (de la Cruz et al, 1999), in the sparsely interconnected cluster C43.
Co‐clustering due to direct physical interaction or links to common external partners is consistent with subprocess coupling. The co‐clustering of the cap‐binding protein Cbp20p with splicing factors in cluster C21 due to physical interactions, for example, suggests coupling of capping and splicing (Figure 3D). Cbp20p is a common binding partner for co‐clustering of the elongation protein Rlr1p, the cap‐binding protein Cbp80p, and other proteins in the sparsely interlinked cluster C17, suggesting the coupling of capping and elongation. Indeed, Cbp20p and Cbp80p have been shown to be involved in splicing experimentally (Colot et al, 1996; Lewis et al, 1996; Hirose and Manley, 2000). Together, these results suggest that Cbp20p and Cbp80p (along with Cbc33p, another cap‐binding protein in C21), bound to the nascent 5′ cap, are associated with elongation complexes and poised to interact with splicing components. This is consistent with previous studies showing coupling among elongation, capping, and splicing machineries (Maniatis and Reed, 2002; Orphanides and Reinberg, 2002).
Intercluster interaction motifs
To identify potential coupled protein machineries, we identified cluster pairs or triples with interactions that satisfy motifs identified as hallmarks of process coupling. To measure the degree to which cluster pairs represent distinct (yet potentially coupled) protein complexes, rather than single molecular machineries artificially separated by our clustering process, we developed a pairwise cluster separability score (Supplementary information 13). For a pair of clusters, the cluster separability score is simply the sum of the individual k‐means clustering scores divided by the clustering score of their merger; the bigger the ratio, the more ‘natural’ the separation of the clusters. We imposed a threshold so that only cluster pairs with separability score in the top 50% (thus more likely to represent distinct machineries) were searched for coupling motifs. From these, we identified and ranked 2029, 517, and 276 direct, cluster‐mediated, and adaptor‐mediated coupling motifs, respectively (Figure 1C and Supplementary Tables S7–S9), for further biological investigation. Rankings were based on network link strength and topology to prioritize motifs by potential biological relevance. Top 25‐ranked coupling motifs and cluster protein compositions are visualized in Supplementary Figure S4. The robustness of identification and ranking of motifs to the selection of an optimum clustering run is evaluated in Supplementary information 14.
Previous studies have led to the establishment of yeast complex precipitation data sets using tandem‐affinity purification followed by mass spectrometry (Gavin et al, 2002; Ho et al, 2002). From these, we inferred pairwise protein interactions as described in Supplementary information 2. New, more comprehensive genome‐wide complex precipitation data sets were recently obtained in which proteins were identified using gel‐free liquid chromatography mass spectrometry (LCMS) or matrix‐assisted laser desorption/ionization (MALDI) mass spectrometry (Krogan et al, in preparation). A database of pairwise protein interactions was derived for each method in which the interactions were quantified by confidence or reliability scores. Because these data sets were not used to generate our model of coupled protein clusters, they provide an opportunity to corroborate our analysis with independent, experimental protein interaction data.
For each of the two independent data sets, this corroboration can be illustrated by the fold enrichment: the ratio of the number of interactions in the data set that define direct coupling motifs to the number of these interactions defining direct coupling motifs in a randomized model (Supplementary information 15). For a complete analysis, we carried out this corroboration after applying four different thresholds on confidence scores in the independent data sets (Figure 4). More stringent thresholds, which presumably select higher quality data, provide a better fit to our model by demonstrating greater fold enrichments. Fold enrichment was improved by considering successively higher ranked coupling links in our model. Using the most stringent threshold, enrichments as high as 350‐fold were observed for the top 1% of direct coupling links; the top 30%, however, still showed a ∼50‐fold enrichment. The peak around 15–25% is likely due to the fact that many direct interactions ranked in the top 8–13% have equal ranking score and are ranked arbitrarily; of these, many may be matched by false negatives in the corroborating data sets, especially the LCMS set.
Similar corroboration was performed for interactions defining cluster‐mediated and adaptor‐mediated motifs, and significant, although lower, fold enrichments were obtained (Supplementary Figure S5). Furthermore, interactions from the independent data sets with highest stringency thresholds applied were approximately seven times more likely to fall within clusters, two times more likely to occur between coupled clusters, and three times more likely to occur between non‐coupled, co‐annotated clusters in our model compared to randomized models. Interestingly, comparable fold enrichments were found using either the LCMS or the MALDI data sets. Therefore, independent data quantitatively support the predictive power of our approach, both in the formation of coupled, functionally annotated clusters and especially in the identification and ranking of coupling links between them.
Top‐ranking results confirm and infer coupling among gene expression machineries
The distribution of motifs ranked in the top 30% of their respective motif pattern among pairs of gene expression subprocesses is shown in Figure 2B. In the paragraphs below, we assess the biological significance of coupling interactions indicated by top‐ranked motifs with existing literature and suggest potential new coupling mechanisms in the contexts of chromatin dynamics, quality control, and mRNA export.
Pre‐mRNA quality control via coupling of mRNA degradation to transcription
While accurately transcribed and processed mRNAs are targeted for export, abnormal pre‐mRNAs are degraded due to transcriptional quality control mechanisms. Studies suggest that quality control machinery may associate with nascent pre‐mRNAs as soon as transcriptional elongation begins, and travel with RNA polymerase to the termination and polyadenylation site (Minvielle‐Sebastia and Keller, 1999). The TREX (transcription and export) complex could provide a potential scaffold to facilitate this surveillance. Notably, the TREX complex also provides a specific example of differences in coupling mechanisms between yeast and mammals (Masuda et al, 2005), as its recruitment to nascent mRNA is coupled to transcription in yeast and to splicing in mammals.
The top‐ranking cluster‐mediated coupling motif implicates C54 as a mediator between the mRNA degradation cluster C40 and the elongation cluster C15, which includes members of TREX (Figure 3E). In particular, the unknown essential protein YJL015Cp, linked to the elongation cluster C15, and the unknown protein YDR154Cp, linked to the mRNA degradation cluster C40, are candidates for involvement in quality control. As all the proteins in sparsely interlinked cluster C54 interact with the transcription initiation protein Srb2p as a common binding partner, this suggests their potential recruitment to sites of transcription initiation. Overall, the existence of this motif suggests that the switch between transcription initiation and elongation is subject to quality control.
Alternate methods of recruitment of the mRNA export machinery to nascent RNA
Nuclear export factors are recruited to nascent pre‐mRNA during various steps of RNA processing and transcription (Reed, 2003; Tange et al, 2004; Aguilera, 2005; Darzacq et al, 2005; Reed and Cheng, 2005). Coupling of export to RNA processing provides a quality control mechanism to prevent the export of incorrectly processed mRNAs (Tange et al, 2004). The correct completion of the processing steps of polyadenylation (involved in transcription termination) and splicing, for example, has been shown to be necessary for efficient export in metazoans (Reed, 2003; Sommer and Nehrbass, 2005). Coupling of export to transcription, on the other hand, may poise the machinery for timely and efficient transport. Export is coupled to elongation for pre‐mRNAs lacking introns in mammals (Lei et al, 2001); this is thought to be a major mechanism in yeast, where over 96% of genes are free of introns. In addition, new results indicate the coupling of export to chromatin modification at the initiation of transcription (Rodriguez‐Navarro et al, 2004). Our motifs reveal additional support for coupling of export with transcription termination, elongation, and chromatin modification as shown below.
Coupling of mRNA export and transcription termination
First, clusters C31 and C30 both co‐cluster termination and export factors, and Hrp1p in C30 is in fact annotated to both processes (Figure 3I). Second, an adaptor‐mediated coupling motif (ranked 7th) (Figure 3F, right) links the core termination cluster C26 to the adaptor protein Pab1p, a poly(A)‐binding protein in C43, annotated to export, translation, splicing, and mRNA degradation. This coupling role for Pab1p, a member of the termination complex CF IA, is consistent with previous studies in yeast showing that mRNA export is inefficient in CF IA mutants (Hammell et al, 2002). Third, another adaptor‐coupled motif (ranked 21st) (Figure 3F, top right) links the transcription termination and export module C31 to the adaptor protein Crm1p, an mRNA and protein export factor. The coupling between transcription termination and mRNA export revealed by our analysis is experimentally supported by the observation that export proteins participate in termination (Jensen et al, 2001; Hammell et al, 2002).
Coupling of mRNA export and transcription elongation
A direct coupling motif (top 2%‐ranked; Figure 3F, bottom) involves the interaction between the mRNA export factor Sub2p (C43) and the putative transcription elongation factor Tho1p in elongation‐annotated cluster C19. Also, an adaptor‐mediated coupling motif (top 30%‐ranked; Figure 3G) links the export protein Yra1p in C70 to C27, containing members of the CCR–NOT elongation complex, thus implicating Yra1p in transcription‐coupled export as well. This is consistent with previous observations of elongation‐coupled export in which the export proteins Sub2p and Yra1p associate with the THO elongation complex to form TREX, the yeast transcription/export complex (Reichert et al, 2002; Strasser et al, 2002). In fact, Tho1p is functionally similar to the TREX component Tho2p (Piruat and Aguilera, 1998). Moreover, Sub2p was shown to be essential for export of intronless genes in Drosophila (Gatfield et al, 2001), and was implicated in coupling elongation and export in a previous genome‐wide study (Burckin et al, 2005).
Coupling of export and chromatin silencing
The silencing protein Sir4p is assigned to the export‐annotated cluster C30 (Figure 3H), based primarily on its physical interaction in the complex data set with Kap60p, an mRNA export factor (Liu et al, 1999) that serves as a common binding partner for the members of C30. Top‐ranked direct coupling highlights the interactions of Sir4p with two of its known binding partners in clusters enriched in chromatin silencing and DNA‐binding proteins: the top direct‐coupling motif links Sir4p to the silencing protein Sir2p in C18, and the fourth‐ranked direct‐coupling motif links Sir4p to the telomere‐silencing protein Rap1p in C11. Notably, heterochromatic DNA regions are often found associated with the nuclear periphery in the proximity of nuclear pores (Laroche et al, 2000), and Sir2p and Sir4p have previously been localized to the nuclear periphery as well (Huh et al, 2003). Although the significance of this localization was previously thought to be a direct functional requisite for gene silencing, recent studies in yeast are inconsistent with this hypothesis (Gartenberg et al, 2004). The physical coupling between export and silencing could have other explanations, such as the existence of proteins that participate in both processes, or even localization of silenced regions in proximity to the nuclear pore in order to facilitate export of transcripts from nearby euchromatic regions.
Coupling translation and nonsense‐mediated decay with other gene expression processes
Nonsense‐mediated decay (NMD) is a translation‐dependent quality control mechanism for recognizing and degrading mRNAs containing premature termination codons. NMD in mammalian cells is generally thought to occur in the cytoplasm, although recent studies suggest that it may also occur in the nucleus (Brogna et al, 2002; Iborra et al, 2004a). Evidence that nuclear translation may occur in yeast exists as well. For example, a potential nuclear pioneer round of translation in yeast is suggested by the binding of the cap‐binding complex Cbp20p/Cbp80p to mRNA in the nucleus (Ishigaki et al, 2001). In addition, recent studies have shown that the yeast NMD factor Upf1p associates with the nuclear pore, consistent with the possibility that NMD occurs as the mRNA exits the nucleus (Nazarenus et al, 2005). However, evidence against nuclear translation in yeast was provided by the observation that blocking export prevents NMD (Kuperwasser et al, 2004). While this observation was interpreted in favor of cytoplasm‐only NMD, an alternative explanation is that nuclear NMD only occurs when coupled to mRNA export. For example, a ‘pioneer round’ of translation could occur as the mRNA exits the nuclear pore. Recognizing that nuclear translation in both mammals and yeast is controversial (see Iborra et al, 2004b; Dahlberg et al, 2004 for contrasting views), we have identified coupling motifs consistent with the possibility that NMD may be coupled to pre‐mRNA processing in yeast.
NMD is initiated upon recognition of an untranslated coding mRNA sequence downstream of a termination codon (Hilleren and Parker, 1999; Maquat, 2002). In mammals, a complex of NMD proteins mark coding mRNA sequences by assembling on the exon junctions of newly spliced mRNA to form exon junction complexes (Le Hir et al, 2000). By contrast, in yeast, NMD proteins assemble on downstream elements of the coding mRNA (Ruiz‐Echevarria et al, 1998). Ribosomes participating in a pioneer round of translation in both yeast and mammals are thought to displace the NMD proteins from mRNA. Ribosomes stalled at premature termination codons interact with the NMD complex assembled on downstream mRNA coding sequences, an interaction thought to lead to recruitment of mRNA degradation factors (Czaplinski et al, 1998).
Coupling of translation and NMD to transcription and mRNA export.
A direct interaction motif (ranked 16th and validated by copurification data; Figure 3I) involves Nmd5p (C31), a protein previously implicated in the NMD process (He and Jacobson, 1995), and the nuclear pore protein Nup159p (C30), consistent with the possibility that NMD occurs at the nuclear pore. Nup159p is known to be involved with the final stage of mRNA export as a docking site for proteins that interact with mRNPs exiting into the cytosol, such as the DEAD‐box helicase Dbp5p (Schmitt et al, 1999; Weirich et al, 2004). Surprisingly, Dbp5p and Nmd5p interact physically with transcriptional elongation factors Dst1p (a TFIIS homolog) (Albertini et al, 1998) and Ssl1p (a TFIIH subunit) (Estruch and Cole, 2003), respectively. This suggests co‐transcriptional recruitment of factors involved in possible NMD processes at the nuclear pore. Coupling of transcription and NMD is consistent with observations that NMD proteins copurify with several different PolII subunits (NJ Krogan, unpublished data). Nmd5p is known to interact with the key NMD protein Upf1p (Czaplinski et al, 1998), and may thus be involved in concentrating Upf1p at sites of pioneer‐round translation and NMD. These sites may be nuclear, since in human cells at least, Upf1p has been shown to shuttle to the nucleus (Mendell et al, 2002).
Coupling of transcription and translation, and nuclear mRNA circularization.
Translation initiation in eukaryotes requires the circularization of mRNA through the binding of cap‐ and poly(A)‐binding proteins with translation initiation factors at the loop junction (Ishigaki et al, 2001). The second‐ranked direct coupling motif in our analysis suggests that this may occur in the nucleus (Figure 3F). Here, the cap‐binding translation initiation factor Cdc33p (C21) binds the poly(A) tail‐binding protein Tif4632p (eIF‐4F in mammals, C26) (Tarun et al, 1997), also involved in translation initiation (Goyer et al, 1993; Lang et al, 1994). The motif is consistent with the possibility that Cdc33p and Tif4632p associate with the mRNA cap and tail, respectively, then with each other to lead to the circularization of mRNA in the nucleus. Yeast mRNAs associated with either the nuclear Cbp20p/Cbp80p or the cytoplasmic mRNA cap‐binding translation initiation factor Cdc33p (eIF‐4E) have been shown to be subject to NMD (Gao et al, 2005). Yeast Cdc33p has, furthermore, been found to occur in the nucleus as well (Lang et al, 1994). Thus, unlike in mammalian cells, both Cbp20p/80p and Cdc33p‐bound mRNA may mediate a nuclear pioneer round of translation coupled to NMD in yeast.
Although both proteins were previously characterized as cytoplasmic, Cdc33p and Tif4632p are clustered in predominantly nuclear clusters due to strong physical interactions. This suggests that they could be involved in or recruited through nuclear gene expression events. Experimental results indeed support the presence of Cdc33p in the nucleus in both yeast (Lang et al, 1994) and mammals (Wilkinson and Shyu, 2002). The clustering of Cdc33p in the spliceosome cluster C21 is consistent with the possibility that its recruitment to mRNA is coordinated with splicing, as observed for other cap‐binding proteins (Lewis et al, 1996). Moreover, this motif suggests that the exchange of the (mostly nuclear) Cbp20p/Cbp80p cap‐binding translation initiation complex for the (mostly cytoplasmic) Cdc33p occurs in the yeast nucleus in a manner coordinated with splicing and nuclear circularization. This exchange was previously suggested to be nuclear in mammals (Wilkinson and Shyu, 2002). Finally, the poly(A)‐binding Tif4632p belongs to the transcription termination cluster C26, perhaps indicating loading onto the nascent poly(A) tail in a manner coordinated with termination.
Coupling of translation and NMD to transcription and mRNA export.
The second‐ranked cluster‐mediated coupling motif involves the mediating cluster C70, identified as a translation module because it is enriched in protein components of the ribosome and ribosome processing factors (Figure 3G). The motif couples C13, a transcription elongation and capping cluster, with C30, an mRNA export cluster. The RNA polymerase subunit Rpb9p in C13 interacts with Spr6p, a protein of unknown function and localization in the mediating cluster, suggesting the participation of Spr6p in coupling transcription to translation in the nucleus. In the mediating cluster, the tRNA synthetase GluRS further links to another protein necessary for translation, the tRNA delivery protein Arc1p, in the export cluster C30. This could be explained in several ways. The first is the possible co‐export of NMD proteins with other RNA‐associated proteins, leading to a perceived coupling motif. These proteins could also be directly involved in mRNP packaging for export. Other simple mechanisms that would explain why these diverse components may be brought in proximity with each other include nuclear ribosome biogenesis or retrograde tRNA transport. Finally, the coupling motif may suggest something beyond simple juxtaposition, presenting candidate proteins for the coupling of transcription to translation, and then from translation to mRNA export. This is strengthened further by the membership of the key mRNA export protein Yra1p, a protein previously implicated in coupling export to transcription, in the mediating cluster C70.
Our analyses suggest an extensive coupling of nuclear gene expression events, including the potential pioneer round of translation and NMD, tightly coupled to nuclear events from transcription to export through the nuclear pore. Coupling NMD to events that precede exit into the cytoplasm is a plausible possibility that presents significant benefits in efficiency, such as reducing the production of truncated polypeptides in the cytoplasm. The dedication of a nuclear fraction of translation and NMD machinery to quality control furthermore frees up the cytoplasmic fractions for mass protein production (Iborra et al, 2001). Furthermore, while the mammalian nuclear translation initiation factor 4A‐like factor eIF4AIII is required for NMD, its cytoplasmic homolog eIF4A1/II is not—but is required for bulk translation (Ferraiuolo et al, 2004; Palacios et al, 2004; Shibuya et al, 2004). The yeast homolog of eIF4AIII is Fal1p, but to our knowledge its possible role in nuclear translation has not been addressed. Regardless, investigation of the mechanisms suggested by the motifs described above may offer further insight into the orchestration of translation and NMD.
Eukaryotic gene expression, once viewed as a stepwise process involving distinct cellular machines, is now generally viewed as a series of highly coupled subprocesses (Ares and Proudfoot, 2005). While both biochemical and genetic experiments have identified functional interactions among a limited number of components of gene expression complexes, the immediate need for a systematic search for proteins involved in coupling has been recognized (Hieronymus and Silver, 2004). Our goal was to confirm and predict proteins coupling gene expression machines in yeast, and to extrapolate where possible to mammalian genomes. To accomplish this, we have created a modular and extensible framework to integrate large sets of data and generate a priority list of human‐interpretable and readily testable hypotheses, each of which may be difficult or impossible to predict by manual inspection. We sought to develop a general method that is applicable to all organisms, although here we made use of yeast protein interaction data because they are by far the most extensively available databases. We recognize that the increased complexity of mammalian gene expression clearly involves coupling mechanisms that are distinct from those in yeast. Still, our analysis over the yeast data and the development of computational methods provide the framework for future analyses of coupling of gene expression in mammalian cells.
Comparison of protein interaction data set integration and network clustering methods and outcomes is an ongoing challenge (Gerstein et al, 2002; Hart et al, 2005). We took a principled approach to identifying potentially long‐range coupling among distinct machines given noisy data. Significantly, two of the methods presented are novel contributions that should be applicable to other research applications: (1) our RDQ method of interaction data set quality calculation introduces a principled way to evaluate data sets independent of reference data and (2) the development of a pairwise CC for weighted graphs significantly improves upon existing ways to measure local neighborhood cohesiveness in an interaction network.
There are several limitations of our approach. First, we incorporate diverse data sets that include a broad spectrum of cell and experimental conditions, and therefore lack the ability to distinguish behaviors and protein interactions specific to various cell states such as meiosis, which involves degradation of the nuclear membrane. Second, the RDQ method presented here may be used to evaluate any number of data sets of similar type (i.e., protein interaction); however, if it were desired to extend the method to include dissimilar and uncorrelated data sets such as mRNA coexpression (Jansen et al, 2002), their RDQ values must be evaluated separately to avoid underweighting due to lack of data set overlap. Third, in our data representation, each protein is represented exactly once, a simplification that overlooks complicating factors such as copy number and appearance of each individual protein at various locations in the cell. Finally, our approach necessitates demarcation of discrete clusters, although the composition of protein complexes may be dynamic. Other approaches to data clustering (Samanta and Liang, 2003; Spirin and Mirny, 2003) may help address these challenges, and as methods for comparing clustering results are developed (Hart et al, 2005), this step in our approach may be improved.
As in all computational analyses, care must be taken in interpreting the results through a lens of informed biological skepticism. Coupling motifs identified in the static network model may be due to true coupling (spatiotemporal conjunction along with functional cooperation), but may also be due to simple spatial conjunction, or multiple interactions of possibly multiple copies of a single protein. For example, in the coupling predicted for translation or NMD, alternative explanations include coupling of cytoplasmic translation with transcription through co‐export of factors with mRNPs—an example of conjunction without cooperation. In an example addressing multiple independent interactions, cluster #23 includes two distinct protein subsets: one nuclear and one cytoplasmic. These are likely related to the two spatially and functionally distinct roles of their common binding partner Sas10p: chromatin silencing (Kamakaka and Rine, 1998) and ribosomal processing (Dragon et al, 2002).
Our analysis includes binding partners of proteins annotated to the gene expression pathway. This allows the possibility of identifying functional roles or coupling links for proteins in non‐nuclear cellular locations (for a review of surprising nuclear roles for cytoplasmic and membrane proteins, see Benmerah et al, 2003). However, top‐ranking coupling motifs implicating these diverse proteins may also stem from spurious interaction data or ideosyncracies of the analysis. As a monomer, for example, the actin protein Act1p plays a role in gene expression as a catalytic part of the chromatin‐modifying machineries INO80 (Shen et al, 2003), Swr1 (Mizuguchi et al, 2004), and NuA4 (Galarneau et al, 2000). In our analysis, the elongation and chromatin remodeling‐annotated cluster #20 contains 49 proteins, which all bind actin according to at least one of the protein interaction data sets used in this study. The cluster additionally contains extranuclear proteins such as cofilin, twinfilin, and myosin, which interact with actin in the context of its ubiquitous role as a structural polymer. The assignment of these actin polymer‐related proteins to cluster #20 and the significant coupling links to the actin protein that they define result from the disparate functions of actin in various contexts, a critical difference that cannot be captured by our model.
Despite these limitations, we were able to confirm known coupling in yeast, such as that of transcription and RNA processing with export. Our results also predicted highly significant connections among the processes of transcription, translation, mRNA export, and NMD, providing a computational blueprint for composition and organization of machinery in the gene expression pathway. In addition, we used new, independently generated biological data to verify both the formation of functional clusters in the network and the identification of significant links between them. Thus, the key product of this automated and objective coupling analysis is a set of high‐confidence predictions that provide a prioritized agenda for experimental validation.
New data sets and annotations can be readily incorporated into our analysis to allow ongoing improvements in the precision and accuracy of the coupled cluster map and predictions derived from it. Recent studies of protein interaction networks indicate a high degree of conservation in network structures that are not apparent from other genome features (Sharan et al, 2005), suggesting that our results in yeast may be extrapolated to other organisms. Above all, our methods are general and may be applied directly to data from studies on other organisms, as well as to study cellular pathways and processes in addition to the gene expression pathway.
Materials and methods
We measured pairwise data set overlap by defining M(g, h) as the percentage of data set g covered by data set h, or zero where g=h. RDQ values are given by the entries of the vector XR, given by solving the equation MXR=λRXR, where λR and XR are the dominant eigenvalue and the corresponding (principal) eigenvector, respectively, of M (Supplementary information 5).
Our formula measures the sum total weights of edges to each other and to neighbors in common (to account for false negatives), normalized by the sum total weights of all edges containing either protein (to compensate for false positives) (Supplementary information 8).
k‐means network clustering
k initial centroids were randomly chosen. Each node was assigned to the centroid to which it had the greatest link weight, and each centroid was chosen to maximize the cluster score. The total network score was defined as the sum of cluster scores
where k is the number of clusters, ri, ci, and nij are the size, centroid, and the jth member, respectively, of the ith cluster, and CC(x,y) gives the pairwise CC between nodes x and y. Whenever a node has a CC score of zero with all of the current centroids, it is assigned to the centroid separated from it by the shortest path along links in the interaction network, determined using Dijkstra's algorithm (Cormen, 2001). Iterations of reassigning nodes to centroids and reassigning centroids to clusters were performed until convergence of the total network score. Convergence of total network score to within four digits was used in place of absolute convergence. The clustering process was repeated 70 times and the resulting clustering with highest total network score was selected.
Functional enrichment of clusters
The hypergeometric P‐value (Tavazoie et al, 1999; Wu et al, 2002) was used to quantify the enrichment of each cluster in proteins annotated to categories corresponding to either the subprocesses of gene expression or GO‐Slim categories. The P‐value was calculated as follows:
where G is the total number of proteins in the network, C the number of proteins in the cluster, n the number of proteins annotated to the tested category, and k the number of proteins in the cluster annotated to that subprocess. For each cluster, the expected value of the number of proteins in each category was calculated as well. When the actual number of proteins in a category was less than the expected value, the calculated P‐value instead reflected the statistically significant lack of enrichment, and was replaced by 1–P to correct for this fact.
The cluster separability score was defined as
for a pair of clusters i and j, where si, ci, and nir are the size, centroid, and the rth member, respectively, of the ith cluster, and l is the cluster formed by merging clusters i and j and finding its new centroid.
We thank Natalie Thompson and Thanuja Premwardena for extraction of validation data, Naoko Tanese for comments, George Church, Aviv Regev, and June Oshiro for helpful discussion, and Guocheng Yuan, Nicole Francis, and Barbara Wold for critical reading of the manuscript. We also thank the anonymous reviewers at Nature MSB for valuable and constructive critiques. We thank the Bauer Center for Genomics Research for its generous support of LFW, SJA, and KM.
KM, SJA, LFW, and TM designed the study and analyzed the results. KM and MS implemented the computational framework. NJK, AE, and JFG supplied the validation data. KM, SJA, LFW, and TM wrote the paper.
Supplementary Materials [msb4100045-sup-0001.pdf]
Supplementary Table S12 as Excel file [msb4100045-sup-0002.xls]
Supplementary Table S11 [msb4100045-sup-0003.xls]
Supplementary Table S12 [msb4100045-sup-0004.htm]
Supplementary Table S5 [msb4100045-sup-0005.htm]
- Copyright © 2006 EMBO and Nature Publishing Group