Transcription regulation has been responsible for organismal complexity and diversity in the course of biological evolution and adaptation, and it is determined largely by the context‐dependent behavior of cis‐regulatory elements (CREs). Therefore, understanding principles underlying CRE behavior in regulating transcription constitutes a fundamental objective of quantitative biology, yet these remain poorly understood. Here we present a deterministic mathematical strategy, the motif expression decomposition (MED) method, for deriving principles of transcription regulation at the single‐gene resolution level. MED operates on all genes in a genome without requiring any a priori knowledge of gene cluster membership, or manual tuning of parameters. Applying MED to Saccharomyces cerevisiae transcriptional networks, we identified four functions describing four different ways that CREs can quantitatively affect gene expression levels. These functions, three of which have extrema in different positions in the gene promoter (short‐, mid‐, and long‐range) whereas the other depends on the motif orientation, are validated by expression data. We illustrate how nature could use these principles as an additional dimension to amplify the combinatorial power of a small set of CREs in regulating transcription.
Transcription regulation plays a critical role in the development, complexity, diversity, and homeostasis of all living organisms (Davidson, 2001; Levine and Tjian, 2003), as transcription is the first step in the universal pipeline of biological information flow from genome, where all genetic programs are stored, to proteome, through which these programs are executed. As a result, understanding the principles underlying the quantitative control of the transcriptional process constitutes a fundamental objective of quantitative biology, yet these remain poorly understood. Although transcription can be controlled at different levels (e.g., chromatin structure level), at the most fundamental level, first discovered by Jacob and Monod (1961), the production of transcripts of a given gene is determined by complex combinatorial interplay of cis‐regulatory elements (CRE) (or motifs) present in the gene's promoter region and associated regulatory proteins (transcription factors (TFs)) present in the cell. Therefore, because TFs are gene products themselves, transcription of a gene is fundamentally regulated by the set of motifs present in its promoter. The principles that govern transcription regulation can thus be defined by a quantitative description of how motif strength—that is, the motif's influence on gene expression—depends on promoter context.
In spite of major efforts aimed at identifying motifs in different species using a variety of approaches and analyzing their precise influence on gene expression (McGuire and Church, 2000; McGuire et al, 2000; Stormo, 2000; Bussemaker et al, 2001; Pilpel et al, 2001; Sudarsanam et al, 2002; Guhathakurta et al, 2002a, 2002b; Beer and Tavazoie, 2004; Siggia, 2005; Tompa et al, 2005; Xie et al, 2005), little is known about the principles by which a gene's motifs translate into an expression level. In other words, quantitative effects of motifs on gene expression as a function of their promoter context remain poorly understood. Here we present a deterministic mathematical strategy, the motif expression decomposition (MED) method, that provides a framework for deriving principles of transcription regulation at the single gene level. The main feature of the MED method is that, unlike other methods used to measure the effect of motifs on gene expression (Pilpel et al, 2001; Beer and Tavazoie, 2004), MED provides a metric to infer from genome‐wide expression data both the context‐dependent influence of each motif on gene expression and the levels of activity of each TF under a set of environmental conditions. In addition, it operates on all genes in a genome without requiring any a priori knowledge of gene cluster/module membership or manual tuning of parameters.
Applying MED to yeast Saccharomyces cerevisiae transcriptional networks using a combined gene expression data set covering 255 conditions involving different environmental stresses (Gasch et al, 2000) and multiple stages of the cell cycle (Spellman et al, 1998), we found that motif strength can have a complex dependence on the motif's geometry—one of the attributes of promoter context—such as distance from the translation start site or motif orientation. We identify four classes of regulatory principles, all of which were validated by expression data (Figure 3). These are short‐range, mid‐range, long‐range, and orientation‐dependent effects of motifs with respect to gene expression levels. In addition, we illustrate how evolution can use these principles as an additional dimension to amplify the combinatorial power of a small set of CREs in regulating transcription (Figure 4) using the PAC and RRPE motifs as an example.
We present a deterministic mathematical framework for deriving principles of transcription regulation
We identify four classes of regulatory principles, all of which were validated by expression data; and they are short‐range, mid‐range, long‐range, and orientation‐dependent effects of motifs on gene expression levels.
We illustrate how evolution can use these principles as an additional dimension to amplify the combinatorial power of a small set of CREs in regulating transcription using the PAC and RRPE motifs as an example.
Transcription is the first step in the universal pipeline of the biological information flow from genome to proteome. Accordingly, the regulation of transcription is critical for the development, complexity, and homeostasis of all living organisms (Davidson, 2001; Levine and Tjian, 2003). Although transcription can be regulated at different levels (e.g., chromatin structure level), one fundamental level, first discovered by Jacob and Monod (Jacob and Monod, 1961), is that the production of transcripts of a given gene is governed by a complex combinatorial interplay of cis‐regulatory elements (CREs) (henceforth referred to as motifs) present in the gene's promoter region, and associated transcription factors (henceforth referred to as regulators) present in the cellular environment. Because regulators are gene products, their productions in principle are also controlled by motifs. Therefore, transcription of a gene is fundamentally regulated by the motif set present in such gene's promoter, acting as the gene's condition‐independent signal receivers, and the set of functions describing the dependency of motif strength—the quantitative level of motif's influence on gene expression–on promoter context constitutes the set of principles of transcription regulation.
Major efforts have been made in identifying motifs in different species using a variety of approaches (McGuire and Church, 2000; McGuire et al, 2000; Guhathakurta et al, 2002a, 2002b; Siggia, 2005; Tompa et al, 2005; Xie et al, 2005). Of those organisms, the yeast Saccharomyces cerevisiae has gained the most attention owing to the availability of multiple yeast genomes and high‐quality mRNA. In fact, many methods developed for finding motifs and determining condition‐dependent motif (or associated transcription factor) activity have used yeast as the model organism (Roth et al, 1998; Tavazoie et al, 1999; Bussemaker et al, 2000, 2001; Hughes et al, 2000; Wang et al, 2002; Conlon et al, 2003; Liao et al, 2003; Segal et al, 2003; Gao et al, 2004; Pritsker et al, 2004; Tompa et al, 2005) (also see Siggia, 2005 for a more complete list of references). However, less attention has been paid to the effects of motifs on gene expression as a function of their promoter context, and such effects remain poorly understood. Works by Pilpel et al (2001) and Sudarsanam et al (2002) studied the effect of motif cooccurrence on gene expression by measuring the degree of coexpression within the set of genes containing motif combinations of interest. Although their work could infer the combinatorial effects of motif–motif interactions on gene expression, it did not address how such effect is influenced by other factors that determine the properties of the promoter context such as geometric constraints. A recent study by Beer and Tavazoie (2004) began to take geometric features into account by way of a Bayesian network model of yeast expression profiles in order to learn the effect of motif position and orientation on gene expression. Although this later approach works quite well, it does not consider the individual expression patterns of each single gene, but instead analyzes the expression profiles of gene clusters, a process that can potentially cause loss of information and may not be suitable for modeling genes in the genome that do not belong to any well‐defined cluster. Because the common assumption underlying these works is that coexpression implies coregulation, these approaches are limited by the need to detect motif influence from statistically aggregated expression data rather than from individual genes, and this typically restricts their application to subsets of genes with large gene expression signals, or those in predefined clusters, or with specific promoter properties. Furthermore, although metrics for measuring the degree of gene coexpression using expression coherence (Pilpel et al, 2001; Sudarsanam et al, 2002) or average of pairwise correlation (Beer and Tavazoie, 2004) employed in these works can infer the effects of motifs on gene expression well, such metrics do not provide a direct quantitative measure of motif influence on gene expression.
In this article, we present a deterministic mathematical strategy, the motif expression decomposition (MED) formalism, whose framework provides just such a quantitative measure—motif strength. MED operates on all genes in the genome of a particular organism under consideration, and assigns a strength to each motif in the promoter of each individual gene, without depending on averaging or clustering of gene expression profiles. Motif strength as a function of promoter context can then be derived using the concept of gene ensemble and gene ensemble instance illustrated in Figure 1 and discussed below. To demonstrate the method, we applied MED to the yeast S. cerevisiae transcriptional networks. We identified four functions describing four different ways that motifs can quantitatively affect gene expression levels, and validated these predicted functions by expression data. We will show examples where the computed measure of motif strength can be used to dissect the appearance of motif synergy in the yeast S. cerevisiae transcriptional networks.
Results and discussion
The MED computational framework for deriving principles of transcription regulation
From the physical standpoint, the effect of a given motif on gene expression—motif strength—must depend on its context such as its exact sequence, geometry (i.e. location or orientation), and cooccurrence with other motifs, simply because these parameters underlie the physical nature of the complex combinatorial interactions between motifs and regulators at the atomistic level for regulating transcription. Similar to the concept of the potential of mean force in statistical mechanics (McCammon and Harvey, 1987), each of these attributes of the motif context can be considered as a reaction coordinate along which the observed motif strength—a multivariable function—can be projected on. To this end, we propose the concept of gene ensemble and gene ensemble instance (Figure 1) as a way of describing quantitatively the relationship between motif strength and its context. A gene ensemble is defined as a collection of genes containing a specific motif set of interest, whereas one of its instances comprises the subset of genes in such collection containing the motif set that fits a specific promoter context, which can be motif's geometry, sequence, multiplicity, cooccurrence with other motif set, etc., or combination of these. Within this conceptual framework, a function representing the dependency of motif strength on its context in the promoter can then be readily established from the average motif strength of each gene ensemble instance a posteriori from the motif strength derivation process. To calculate the strength of each motif in each individual gene promoter, we determine the extent to which each motif contributes to the expression level of each gene it regulates using equation (1), which reflects Jacob and Monod's fundamental transcriptional model, without assuming motif context a priori, and the by means of a matrix decomposition technique. These two steps together constitute the framework of the MED formalism. The detailed description of the MED method is presented in the Materials and methods section.
Transcriptional regulatory principles derived from S. cerevisiae transcriptional networks
We applied MED to yeast S. cerevisiae transcriptional networks with a combined gene expression data set covering 255 conditions involving different environmental stresses (Gasch et al, 2000) and multiple stages of the cell cycle (Spellman et al, 1998). We used crossvalidation (see Materials and methods) as an unbiased way to measure MED's ability to fit the biological data contained in the data set. We obtained an average correlation coefficient of 0.52 (Figure 2 , blue diamond) between predicted and actual expression for all 5719 genes (Figure 2, blue curve). To put this number into perspective, a previous study using a Bayesian network analysis on the same expression data set reported an average correlation coefficient of 0.51 on a subset of 2587 genes in 49 expression clusters (Beer and Tavazoie, 2004). On this same gene subset, MED achieves an average correlation coefficient of 0.72 (Figure 2, red curve and red circle). However, direct comparison is complicated by the fact that, unlike MED, Beer and Tavazoie do not reconstruct individual gene expression patterns, but rather consider only the profiles of 49 gene clusters, and then assess the correlations between genes' actual cluster profiles and the cluster profiles predicted by their Bayesian network. Nevertheless, this latter result clearly shows that the more fine‐grained MED approach does achieve a good fit to the expression data without overfitting (see Materials and methods section). Furthermore, we also compared MED to the multiple regression method (Bussemaker et al, 2001; Beer and Tavazoie, 2004) in a similar manner as above. We obtained the corresponding average correlation of 0.14 for all 5719 genes and 0.22 for 2587 genes. MED's better performance in the latter comparison is expected, as our model introduces variables (which, when solved, correspond to our motif strengths) where the multiple regression method uses constants (i.e. number of motif instances), and thus MED should generate better fit against gene expression profiles.
To demonstrate the proof‐of‐concept that MED is capable of deriving principles of transcription regulation, in this study we chose to focus primarily on motif position and orientation with respect to the start codon, two geometric constraints known to play a role in gene coexpression (Beer and Tavazoie, 2004). We found that motifs do not always have the same level of influence on the gene expression simply owing to their presence in the gene promoter, nor exert the largest influence on the gene expression when they are near the start codon in yeast, but rather follow a function of a complex shape. Here we illustrate existence of four functions of motif strength (Figure 3 ), distinguishing themselves by their extrema with respect to motif position and orientation, and describing four different ways that motifs can affect gene expression levels given their geometric context. The first function describing the dependency of motif strength on motif position has the maximum when the motif is within 150 bp from the start codon (Figure 3A and B), henceforth referred to as the so‐called short‐range type. The second (Figure 3C and D) and third (Figure 3E and F) functions, also describing the dependency of motif strength on motif position, have the maximum when the motif is spaced at an intermediate (150–300 bp) or longer (300–450 bp) distance from the start codon, and henceforth referred to as the so‐called mid‐range and long‐range types, respectively. Unlike the first three functions, the fourth function (Figure 3E and F) describes the dependency of motif strength on motif's relative orientation with respect to the start codon rather than the position, henceforth referred to as the so‐called orientation‐dependent type. In order to derive this kind of function, one needs to establish the functions of motif strength for a given motif's both orientations.
The PAC and RRPE motifs (Tavazoie et al, 1999; Hughes et al, 2000), which are found in promoters of genes encoding ribosomal proteins, are examples of the short‐range motifs (Figure 3A). Both of these motifs exhibit significantly higher average motif strength within 150 bp of the start codon than would be expected from randomized expression data (PShuffling≪0.01), and significantly higher than at positions further upstream (PWilcoxon<10−16 for PAC, <1.83 × 10−5 for RRPE) (see Materials and methods section for definitions of PShuffling and PWilcoxon). To validate this form of regulatory principle, we computed the corresponding function describing the dependency of the degree of gene coexpression, as measured by the average pairwise expression correlation (Beer and Tavazoie, 2004) (e.g., average of expression correlation coefficients of all gene pairs in a given gene set), on motif position for the same set of gene ensemble instances (Figure 3B) using the expression data. The comparison between these two functions shows that MED's prediction of regulatory principles for these short‐range motifs agrees very well with the experimental data, despite some discrepancy while PAC is far from the start codon. As shown in Figure 3A and B, the PAC motif retains its strength at such a far distance, whereas the degree of gene coexpression of such PAC‐containing ensemble instances becomes insignificant. However, our further analysis shows that the loss of correlation in such PAC‐containing gene ensemble instances arises because the genes split into clusters with anti‐correlated expression profiles, whose average correlation is therefore close to zero (see Supplementary information 1). This illustrates how MED's analysis of motif strength reveals different information about gene expression than can be obtained from average correlations.
The MCB motif (Koch et al, 1993), which plays a role in DNA synthesis and replication during the S1 phase of the cell cycle, is an example of the mid‐range motif (Figure 3C). Unlike the PAC and RRPE motifs, this motif achieves the greatest strength when it is further upstream, spaced between 150 and 300 bp (PWilcoxon<1.15 × 10−4; PShuffling≪0.01). Furthermore, the MCB motif also exhibits a small degree of orientational effect around the position of its maximum strength; hence, it may also weakly belong to the orientation‐dependent motif type (PWilcoxon<0.1). These predicted forms of regulatory principles are validated by expression data (Figure 3D) in the similar manner as being carried out for the PAC/RRPE motifs. Note that although the core of the MCB motif (ACGCGT) is invariant and palindromic, its full sequence we used in this work is not, as there is a slight non‐palindromic signature in its flanking bases (see Figure 3C for sequence logo), and individual instances of this motif often deviate from the palindrome. For example, MCB exact sequences like AGACGCGTAA, CAACGCGTAA, and CGACGCGTAA, which have the top ScanACE scores (16.17, 15.93, and 15.82, respectively), are clearly not palindromic. Therefore, MED's ability to distinguish MCB orientation‐dependent behavior is entirely due to the non‐palindromic signature induced by flanking bases, so that any orientation effect detected by MED suggests a possible role for these flanking bases in motif function.
Finally, the RAP1 motif (Lascaris et al, 1999), which controls the production of ribosomal proteins, is an example of the long‐range and orientation‐dependent motif (Figure 3E). Unlike the MCB motif, RAP1 acquires the largest strength when it is even further upstream, spaced between 300 and 450 bp (PWilcoxon<7.90 × 10−7; PShuffling≪0.01). In addition, it has a clear preferential orientation for regulating gene expression almost over the entire promoter length (PWilcoxon<4.84 × 10−8). As with the PAC, RRPE, and MCB motifs, these predicted forms of regulatory principles by RAP1 are also validated by expression data (Figure 3F).
To cope with both a myriad of environmental conditions and the internal complexity of cellular functions, eukaryotes are known to employ combinatorial strategies to generate a variety of expression patterns from a relatively small set of regulatory motifs (Kellis et al, 2003; Levine and Tjian, 2003). The combinatorial potential has been understood primarily in terms of motif cooccurrence and synergy. However, the transcriptional regulatory principles described here suggest several avenues of research into how nature may also exploit motif geometry as another dimension of combinatorial power for regulating transcription. For instance, given the observation that PAC motif strength varies along the length of the promoter, we foresee an experiment that explores the effect of PAC motif location on a reporter gene in relation to the hypothesis that reporter expression level should vary as indicated in Figure 3A in conditions for which the PAC‐binding protein is predicted to be active. It would also be of great interest to look for evidence that shifts of PAC motif location have actually been selected over the course of evolution. While one possibility is to look for PAC location shifts in gene promoters in related yeast strains, this is complicated by the difficulty of assessing the functional significance of any observed shifts, as individual genes may have different requirements for control by PAC that themselves changed over evolutionary history, while the PAC transcriptional regulatory principle of Figure 3A may have changed as well. Therefore, a clearer initial (if indirect) path towards exploring the possible evolutionary significance of motif location might be to perform a high‐throughput competitive growth experiment (Winzeler et al, 1999; Giaever et al, 2002) looking for differential fitness effects of large numbers of engineered mutations in PAC location across many PAC‐containing genes. Finally, it would be of great interest to elucidate genetic and biochemical mechanisms by which motif geometry controls transcription, for instance, by identifying protein domains or cofactors that might be responsible for establishing the distance at which a motif has maximal effect. In some cases, however, including PAC, this may need to await identification of the proteins that bind these motifs.
To further illustrate the potential use of motif geometry by nature for regulating transcription, we use MED to dissect the appearance of synergism between the PAC and RRPE motifs (Pilpel et al, 2001; Sudarsanam et al, 2002; Beer and Tavazoie, 2004). This notion of synergism was based on the higher coherence of gene expression in the gene ensemble containing both, compared to the PAC‐only and RRPE‐only ensembles (Supplementary information 2). However, MED analysis, as shown in Figure 4A , shows a surprising fact that in the PAC/RRPE‐containing gene ensemble, whereas the strength of PAC decreases with distance from the start codon, the strength of RRPE is close to zero at every distance. Analysis of variance (ANOVA) (see Materials and methods) of PAC motif strengths obtained from PAC/RRPE‐containing gene ensemble supports these observations, finding a significant main effect of PAC position on PAC motif strength (PANOVA‐PAC=2.6 × 10−5), no effect from RRPE position (PANOVA‐RRPE=0.96), and no interaction (PANOVA‐PAC‐‐RRPE=0.52). A similar analysis of RRPE strengths finds insignificant or marginal effects (PANOVA‐PAC=0.19, PANOVA‐RRPE=0.1, and PANOVA‐PAC–RRPE=0.18). These findings indicate that there is no actual synergistic effect between the two motifs, an analysis consistent with the reported inability by Das et al (2004) to detect PAC–RRPE interaction in a cell cycle data set also used in this work. Instead, the appearance of synergism in expression level is sufficiently explained by the proximity of the PAC motif to the start codon (Figure 4 and Table 1) in the PAC/RRPE‐containing gene ensemble, and does not require invoking any functional interaction. Similarly, proximity of the PAC motif to the start codon also explains an apparent synergy in the order of the two motifs (Figure 4B and D). Even for the motif arrangement that favors the closeness of PAC to the start codon (an arrangement associated with a high degree of gene coexpression), MED analysis shows that RRPE possesses no significant strength (Figure 4B). As for why the RRPE motif behaves almost the same way as PAC in terms of contributing its influence to the gene expression in general (Figure 3A) but does not have a significant role on the expression of PAC–RRPE‐containing gene ensemble, recent work by Tanay et al (2005) observes that the PAC motif is relatively younger than the RRPE motif and also provides evidence for mechanisms by which one motif can replace another by passing through an intermediate stage in which both are present in gene promoters. We hypothesize that the PAC motif may have been evolved to be better suitable for S. cerevisiae than the RRPE motif, and to smoothly assume RRPE's functional role in genes containing both of them in this way. These two examples clearly show that nature could use motif geometry as an additional dimension to regulate transcription.
We have demonstrated a novel mathematical strategy for deciphering principles of transcription regulation using S. cerevisiae as a model system. We identify four regulatory principles that motifs obey in order to regulate transcription. These principles reveal the complexity of how a motif can exert its influence on gene expression beyond its mere presence, absence, or closeness to the start codon. In addition, we have also illustrated an example showing how nature could exploit geometry as another means for regulating transcription, hence increasing the combinatorial power of a relatively small set of motifs. With the emergence of new research paradigms in modern biology, where the process of biological research begins with a system‐level theoretical prediction followed by experimental validation (Gilbert, 1991), we believe that MED can play an important role in fostering the development of biological theory necessary for explaining how regulatory motifs can control transcription. Furthermore, with the technology recently available to allow the high‐throughput synthesis of oligomers (Tian et al, 2004), we foresee a new research direction aiming at engineering new and improved biological systems with desired properties. To this end, we believe that MED can be a valuable tool for such bioengineering process by providing necessary knowledge and parameters regarding motif's behaviors.
Materials and methods
The MED method
MED is composed of two main steps. In the first step, each individual gene is analyzed for the strength of each motif in its promoter without taking into account any information about motif's context. The way in which such motif strength is derived in MED is based on the Jacob and Monod's model of transcription (Jacob and Monod, 1961), whereby the log ratio expression level of a gene is a function of a motif set present in its promoter and regulators' activities in the cellular environment (see equation (1)). The outcome of this step consists of two matrices: a matrix of motif strength, where each element represents the condition‐independent strength of each motif in each gene promoter; and a matrix of regulator activity, where each element represents the global proxy activity of each regulator under a particular environmental condition. In the second step (the regulatory rule deduction step), regulatory principles are derived from the matrix of motif strength using the gene ensemble concept as illustrated in Figure 1.
Step 1: Derivation of motif strength.
For a given gene g, let Ωg be a set of motifs occurring in its promoter; then its log ratio expression level Egc in a specific environmental condition c can be approximated using the following :
where Mgj represents strength of motif jth on the expression level of gene g and Ajc represents a global proxy for the regulator activity associated with motif jth under condition c. Unlike previous works (Bussemaker et al, 2001; Gao et al, 2004), where the matrix element Mgj is a known constant and equal to the number of instances that motif jth occurs in the promoter of gene g or ChIP log ratio for transcription factor jth binding to the promoter of gene g, MED optimizes both Mgj and Ajc to best fit the expression data. Therefore, for all genes and conditions, equation (1) becomes
where E is an m genes by n conditions expression matrix, M is an m genes by k motifs matrix of condition‐independent motif strengths, and A is a k regulators by n conditions matrix of condition‐dependent global proxy activity of regulators for k motifs. Note that if a particular motif jth does not exist in the promoter of gene ith, then the matrix element Mij of the above matrix M is zero and remains so. The problem posed in equation (2) becomes a matrix decomposition problem. This portion of the MED algorithm consists of the procedure for decomposing the data matrix E into a product of matrices M and A uniquely using the motif–gene relationship as constraints (see proof in Supplementary information 4). The procedure we employed here is based on the factor analysis (Anderson, 1984; Gifi, 1990; Paatero et al, 2002; Liao et al, 2003) with Tikhonov regularization (Tikhonov and Arsenin, 1977) imposed on the matrix M to ensure uniqueness. To compute matrices M and A:
a. Initialize the non‐zero elements of the motif matrix M using a weighted sum of the number of motif instances if motifs are represented in the position‐specific weight matrix form, or simply the number of motif occurrence in each gene promoter (see Supplementary information 5 for details).
b. Given E, M, and let E be the product M·A, use least squares (see Supplementary information 6 for exact formula) to find the matrix A, the current global proxy activity of regulators for all motifs with current estimate of matrix M that minimizes
c. Normalize the matrix A in such a way that each row has unit norm.
d. Given E and A computed in (c), find the optimal strength Mgj for each motif jth in the promoter of each gene g that minimizes
for each gene g=1, …, m. Alternatively, note if a predefined target Mgi* for the strength of motif jth in the promoter of gene g is known a priori, one may wish to use the following instead:
e. Repeat step (b) with the newly computed matrix M until convergence condition is met.
In the above algorithm, steps (b) to (d) are sufficient to ensure a unique solution M and A from the expression matrix E (see proof in Supplementary information 4). In equations (4) and (5), the second term is critical for producing a unique matrix M regardless of the linear dependency or near linear dependency of the rows of matrix A (see proof in Supplementary information 4). It can also be used to constrain the strength of motif jth in the promoter of gene g to a predefined value Mgi* if such value is known a priori. Although the parameter λ can be chosen using more sophisticated methods (Shock, 1984; Engl and Neubauer, 1985; Guacaneme, 1988; Wahba, 1990), in this work it is chosen in such a way that it does not noticeably affect the test error computed from crossvalidation (Supplementary information 7). We used equation (4) to compute the motif strength and λ was set to a scalar value of 10−4, although it can be a vector quantity in general for weighting motifs in different gene promoters differently. The convergence criterion used in this work is the total variance of the residual matrix defined in Supplementary information 7. Note that, as each motif has its own binding strength to regulators and hence having its own scale of influence on gene expression, only relative motif strengths of the same motif across different instances of gene ensemble are meaningful for comparison purposes. Finally, equation (1) can be extended to include the nonlinear term accounting for the motif–motif interactions (Supplementary information 8) and the MED formalism shown above can still be applied transparently.
Step 2: Deduction of regulatory principles.
We construct the gene ensemble containing a specific motif set of interest, partition this ensemble into instances based on the specific promoter properties of interest, and calculate the average motif strength and standard error across these instances (Figure 1) using motif strength data obtained from the previous step. Regulatory principles can then be derived from the relationship between motif strength and its context in the promoters (or constraints). Apart from the geometric constraints illustrated in this work, other constraints could include motif multiplicity (number of motif instances in a promoter), spacing, exact motif sequence, motif–motif cooccurrence, or any combination of these. Note that as the space between transcription start site and translation start site is usually fixed in yeast S. cerevisiae (Hurowitz and Brown, 2003), it is equally good to choose either one of them as the origin for geometric constraints. For convenience, we chose the latter. As for the discretization of the promoter length into bins for projecting motif strength in deriving distance‐based regulatory principles, the choice of bin size (i.e. how many base pairs in each bin) is a non‐trivial task. A large bin size will effectively bury all important signals, whereas a small bin size will allow noise to be manifest. Therefore, the goal in choosing a good bin size should be to choose the one that maximizes extractable signals contained in the data set as possible whereas minimizing noise. In this work, we used a bin size of 150 bp, which seems to be an optimal one, for deriving data presented in Figure 3 and the PWilcoxon values confirm our choice of the bin size.
We used a combined gene expression data set obtained from environmental stresses (Gasch et al, 2000) and cell cycle (Spellman et al, 1998) with a total of 255 conditions. Ideally, we want to use motifs that are derived directly from the ChIP‐chip data without depending on the clustering in the gene expression space (Harbison et al, 2004); however at the time of this work, such data were not available. Therefore, we used 62 DNA regulatory motifs, represented as position‐specific weight matrices, that were generated using literature (37 motifs) and the multiple sequence alignment program AlignACE (Roth et al, 1998) (25 motifs) as described previously (Roth et al, 1998; Hughes et al, 2000; Pilpel et al, 2001). We used ScanACE (Roth et al, 1998; Hughes et al, 2000; Pilpel et al, 2001) to find motif occurrences in promoter regions up to 1000 bp upstream. The expression data matrix E was centralized to remove column and row means.
To analyze the performance of MED, we used crossvalidation, in which we partitioned the expression data matrix into 100 blocks, each of which consists of 20% of random genes and 5% of random conditions (of these genes). For each run, we left out one of these blocks and trained the model on the remaining data. This allowed us to use gene expression data on all 255 conditions (but only across 80% of the genes) in order to compute matrix A in step (b) of the MED algorithm, and likewise, to use information on all the genes (but only across 95% of the conditions) to compute the motif matrix M in step (d) of the MED algorithm. Upon convergence, we then used the resulting matrices M and A to predict gene expression of the block of 20% genes and 5% condition the model has not been trained on. This process was repeated for each of the 100 blocks, each time predicting expression on the block of data that was left out, in order to obtain a complete expression matrix, each element was predicted by this crossvalidation scheme. The result presented in Figure 2 was computed by plotting the histogram of correlation coefficients between predicted and actual expression. Although we have a large number of parameters (i.e. the total number of non‐zero elements in matrices M and A), we still have roughly 40 times more data points in our data set, and at each step of the algorithm we only fit a small number of parameters. In addition, crossvalidation ensures that the model performance is always tested on data that were not used to train the model. We also repeated the whole crossvalidation procedure as outlined above 10 times. Each time, all the rows of the input data matrix E were randomly permuted. We obtained the average and standard deviation (s.d.) of these 10 average correlation coefficients of 0.0014 and 0.0057 for all 5719 genes, respectively, making the average correlation coefficient derived from real data about 91 s.d. away. For the subset of 2587 genes early work used (Beer and Tavazoie, 2004), we obtained the corresponding average and s.d. of −0.0012 and 0.0101, respectively, making the corresponding average correlation coefficient derived from real data about 71 s.d. away. These results are shown as the black square in Figure 2. Therefore, the results obtained from crossvalidation to measure MED's predictive power are without the risk of overfitting to the training data.
To further ensure that the type of each motif presented in this work is statistically significant in addition to the degree of gene coexpression, we performed two additional statistical tests: one is the Wilcoxon rank sum test (Wilcoxon, 1945; Lehmann, 1975) and the other is from the 100 random shuffling of complete gene expression profiles. In the Wilcoxon test, we determined if the motif strength at the position of extremum is statistically different from the motif strengths elsewhere. For the MCB and RAP1 motifs, we also determined if the strength of a motif oriented along one direction is statistically different compared to that of reversed direction. The level of statistical significance in the Wilcoxon rank sum test is measured by the Wilcoxon P‐value (PWilcoxon). In the random shuffling test, we permuted all elements of the expression matrix E 100 times, generating 100 expression matrices Ei, i=1, …, 100, for computing the strengths of each motif presented in this work. In this test, we determined if the strength of a motif at a particular promoter position obtained from the actual expression data is statistically more significant than the corresponding one derived from the random shuffling of expression data. The level of statistical significance in this test is measured by the P‐value (Pshuffling): the fraction of motif strengths obtained from the random shuffling of expression data larger than the corresponding one obtained from the actual expression data. Note that, as there are 100 random shuffling runs, the smallest P‐value, Pshuffling, attainable in this test is 0.01 if no assumption is made about the distribution of motif strengths derived from the randomly shuffling of expression data. However, as shown in Supplementary information 10 and Supplementary Figure SF6a–c, the P‐values for these observed motif strengths can be much smaller than 0.01 owing to the Chebyshev's inequality (PChebyshev) (Abramowitz and Stegun, 1972), as the computed motif strengths of the PAC, RRPE, MCB, and RAP1 motifs at the promoter location of their extremum derived from the actual data are far away from the mean of the distribution of the corresponding ones derived from the random shuffling of expression data (by at least 28 s.d.).
We also performed two‐way ANOVA on PAC and RRPE motif strengths derived from the PAC/RRPE‐containing gene ensemble using the MatLab's anovan command (MatLab) using the model=‘full’ and default ss‐type parameters. Two factors were specified, PAC distance from start codon, whose P‐value is denoted as PANOVA‐PAC, and RRPE distance from start codon, whose P‐value is denoted as PANOVA‐RRPE, where each consisted of three levels corresponding to the distance bins in Figure 4. The P‐value for the interaction between these two factors is denoted as PANOVA‐PAC‐‐RRPE.
Competing interest statement
The authors declare that they have no competing financial interests.
We thank Nikos Reppas, Zhou Zhu, Xiaoxia Lin, Dana Pe'er, Saeed Tavazoie, Eric Siggia, and Joel Bader for critical reading of the manuscript. We thank John Aach for critical reading of the manuscript and useful suggestions on statistical tests. We are indebted to George M Church for his guidance and support of this work. Dat H Nguyen acknowledges support from the Alfred P Sloan and US Department of Energy Postdoctoral Fellowship in Computational Molecular Biology and Bioinformatics, and travel fellowships provided by the National Science Foundation Institute for Pure and Applied Mathematics at UCLA. George M Church was supported by US Department of Energy GTL Grant No. DE‐FG02 02ER63461. PD was supported by PhRMA/Harvard CEIGI grant, and is currently supported by an LDRD grant at Lawrence Livermore National Laboratory.
Supplementary Material [msb4100054-sup-0001.pdf]
Supplementary Material [msb4100054-sup-0002.zip]
- Copyright © 2006 EMBO and Nature Publishing Group