The complementarity of gene expression and protein–DNA interaction data led to several successful models of biological systems. However, recent studies in multiple species raise doubts about the relationship between these two datasets. These studies show that the overwhelming majority of genes bound by a particular transcription factor (TF) are not affected when that factor is knocked out. Here, we show that this surprising result can be partially explained by considering the broader cellular context in which TFs operate. Factors whose functions are not backed up by redundant paralogs show a fourfold increase in the agreement between their bound targets and the expression levels of those targets. In addition, we show that incorporating protein interaction networks provides physical explanations for knockout effects. New double knockout experiments support our conclusions. Our results highlight the robustness provided by redundant TFs and indicate that in the context of diverse cellular systems, binding is still largely functional.
We show that backup in regulatory networks partially explains why the overwhelming majority of genes bound by a particular transcription factor are not affected when that factor is knocked out.
Factors whose functions are backed up by redundant paralogs show a fourfold decrease in the agreement between their bound targets and the expressions levels of those targets.
Protein interaction networks explain why many of the knockout targets are not physically bound by the knocked out factor.
Double knockout expression experiments show that when both the factor and its paralog are knocked out, the agreement between binding and knockout data increases significantly, supporting our conclusions.
Many successful studies in systems biology focus on integrating complementary datasets to model systems in the cell. Several computational methods have been developed and applied to combine mRNA expression data and protein–DNA interaction data (using DNA‐binding motifs, ChIP‐chip experiments, or both) (Lee et al, 2002; Liao et al, 2003; Beer and Tavazoie, 2004; Yeang et al, 2005; Ernst et al, 2007). These methods assume that transcript levels are largely driven by binding of transcription factors (TFs) to DNA leading to either expression or repression of the bound genes. Indeed, by some estimates close to 60% of binding sites are actively driving expression of their bound genes (Gao et al, 2004).
This assumption was recently challenged by several studies that compared the set of genes bound by a TF with the set of genes affected when that factor is knocked out or knocked down. One of the earliest reports of this phenomenon involved the yeast cell cycle (Horak et al, 2002). Using ChIP‐chip experiments, researchers looked at the set of genes bound by 11 TFs and concluded that complementary knockout experiments did not affect the same set of genes. In mouse, it was reported that only 11% of those genes that were differentially expressed after glucocorticoid dexamethasone injection were also bound by the glucocorticoid receptor (Phuc Le et al, 2005). An estrogen‐response study in human reported that 6% of E2‐induced genes were bound by ERα, and 13% of ERα‐bound genes were regulated by E2 (Kwon et al, 2007). A human study in which p63 was depleted led to similar conclusions (Yang et al, 2006).
Although the above experiments looked at only one, or few, TFs, a recent study in yeast examined the overlap for the entire set of TFs and surprisingly concluded that the overlap was even smaller than the overlaps reported above for individual factors. In a comprehensive analysis of the agreement between binding and knockout experiments, 269 budding yeast TFs were knocked out one at a time (Hu et al, 2007), and the differentially expressed gene targets were compared with the protein–DNA binding data generated previously for 188 of those TFs (Harbison et al, 2004). It was determined that only 3% of bound genes were affected by the knockout, and similarly only 3% of affected genes were bound by the corresponding TF. Although this low overlap is statistically significant, the percentage is still very low. A possible explanation for this small overlap, which was examined by Hu et al, is that indirect regulatory interactions can explain at least one side of this overlap (why genes affected by a knockout are not directly bound by that TF). However, their analysis of these indirect binding effects, in which pathways of protein–DNA binding interactions were allowed as supporting evidence for the knockout effects, resulted in negligible improvements to the overlap and its significance.
To further examine these findings and to determine whether the expression and TF‐gene binding interaction datasets are indeed complementary, we undertook a systems approach by studying the dependence of their agreement on the TFs’ homology relationships and on the protein interaction network context of the TF. As we show, both play a major role in the low overlap. Accounting for these contexts increases both the percentage overlap and its significance, indicating that the difference may be explained by backup mechanisms used when cells lose specific TFs.
Results and discussion
P‐value threshold analysis
In both the binding and knockout studies, a P‐value threshold was used to identify significant genes. We examined the sensitivity of the overlap to these P‐value thresholds by testing P‐value cutoffs from 10−0.05 to 10−10 (see Materials and methods). For each P‐value, we calculated the overlap and its significance using the hypergeometric distribution (Figure 1). The range of P‐value thresholds from 0.008 to 0.001 yielded the most significant overlaps. Thus, we looked more closely at P‐values of 0.001 and 0.005, which have been used in the past for binding and expression data (Lee et al, 2002; Bar‐Joseph et al, 2003). The P‐value threshold of 0.005 generally yields slightly better overlap than the threshold of 0.001, and all overlap values reported hereafter are based on a P‐value threshold of 0.005 unless otherwise noted.
Each of the two papers used a different method to compute the P‐values, which may contribute to their disagreement. To test whether this influenced the results, we computed the overlap between the two datasets based on rankings rather than P‐value cutoffs. We sorted the TF‐gene interactions by P‐value in both datasets and selected the first k (where k ranges from 1 to 1000) from each dataset. The overlap significance peaks when the top 56 interactions per TF are taken to be significant (P‐value of 10−52, Supplementary Figure 1). This significance value is much worse than that of the threshold‐based method (P‐value 10−124), suggesting that the rank‐based method introduces substantial noise because high‐ranking interactions may still have insignificant P‐values for some factors. We also tried several other methods for selecting lists of bound and affected genes but these did not improve the overlap (Supplementary Tables I–III).
The above computations highlight the importance of the results presented by Hu et al, indicating that they cannot be explained by issues related to the analysis of the data but are rather likely to represent specific biological phenomena.
Cleaning the data
To lessen the extent to which experimental and biological noise affected the disagreement between the knockout and binding data, we cleaned the datasets in several ways (Supplementary Tables IV–VI; Supplementary Figures 2 and 3). We first removed genes that were affected by the knockout of a large number of TFs (see Materials and methods). We termed these ‘general KO genes’ because they are likely responding to the general stress of the knockout experiments rather than the specific TF deletions, and thus are not expected to be bound by the deleted TFs. In addition, we restricted the set of TF‐binding targets to those with sequence motifs conserved in two other species (Harbison et al, 2004).
After these cleanup steps, the agreement between the two datasets increases to 6.7% of binding data and 4.5% of knockout data (P‐value of 10−133 versus the original P‐value of 10−114). As above, we examined different P‐value cutoffs for the cleaned dataset and found that the results did not improve when using the stricter threshold of 0.001. Similarly, knowledge regarding the function of the TFs did not improve the overlap. Specifically, we found that TFs that are expected to be active only in YPD had the same average overlap as the entire set of TFs (Supplementary Table VII).
Redundancy explains binding interactions absent from the knockout data
We next tested whether redundancy can help explain the small overlap observed. Following Kafri et al (2005) we used BLASTP to identify gene pairs with varying levels of homology. We divided the set of TFs into four groups: those with a paralogous TF with an E‐value of E‐20 or less, between E‐20 and E‐10, between E‐10 and E‐3 and those with no homolog at E‐3 or less (see Materials and methods). TFs with the most similar paralogs had no overlap between their binding and knockout data. In contrast, those with the least similar paralogs had an overlap of more than 12%, nearly twofolds higher than the average overlap. The other groups followed a similar trend in which the overlap increased as the similarity to the closest paralog decreased (Figure 2).
To further test our finding that redundancy impacts the expression outcome we used Pfam, which focuses on the binding domain only, as a measure of similarity and obtained similar division into four groups. As with the BLASTP value, for groups with similar paralogs, the overlap was lower than for those with more distant homologs (4 versus 10%, Supplementary Figures 4 and 5).
Another component that may impact how well one TF can compensate for the loss of another TF is shared protein–protein interactions (PPIs) (Reguly et al, 2006). We divided each of the homology groups defined above based on the percentage of protein interaction partners the TF shares with another TF in that homology group (see Materials and methods). Similar to the trend we saw using sequence homology, within each group the overlap decreases as the percentage of shared PPIs increases (Figure 2; Table I). For TFs with the least similar homologs and the fewest shared interactions, we observed an overlap greater than 13%.
For all BLASTP E‐value thresholds, TFs that shared a larger portion of PPI with their paralog had lower binding overlap. This indicates that putative paralogs with many common PPI are better able to compensate for the deletion effects (Table I).
Protein interaction networks provide physical support for knockout effects
To help explain the other direction (why genes affected by a knockout are not bound by the TF), we used interaction networks. Recent studies have shown that knockout effects can be mediated by PPI networks as well (Yeang et al, 2005; Workman et al, 2006). Thus, we constructed a network that includes both PPI and protein–DNA edges (Supplementary information). We considered a gene affected by the knockout to be explained by the network if (1) the TF directly binds the gene or (2) there is a path leading from the TF to another TF that directly binds the gene. For the indirect result we vary the maximum path length (number of edges from the initial TF to the last TF). As can be seen in Figure 3, using a path length of 2 leads to an overlap of 22% while significantly increasing the P‐value of the overlap (from 10−133 to 10−211). Path lengths greater than 2 increased the percentage of the overlap but reduced the P‐value indicating that we are likely overfitting the data. Randomization tests and further analysis using different sets of PPI data confirmed the significance of the increase in overlap due to the PPI network (Supplementary Tables VIII and IX; Supplementary Figures 6 and 7).
As a further test, we repeated the PPI analysis using data from expression and binding experiments in human cells studying the TF p63 (Yang et al, 2006). A genome‐wide TF‐gene binding dataset is not available for human TFs so we used an analysis (Xie et al, 2005) to construct an approximate binding dataset for 71 TFs. As with yeast, the overlap greatly increased when using a network with a path length of 2 (P‐values 10−15 and 10−5 compared with 10−12 and 10−2 for the original data, Supplementary Figure 8). Randomization tests of the human PPI network also indicated that the improvement is significant (Supplementary Table X).
To further validate our results regarding the backup mechanisms used in regulatory networks, we collected expression data from three double knockout experiments involving pairs of factors we predicted could compensate for the loss of each other (Fkh1‐Fkh2, Yhp1‐Yox1, Ace2‐Swi5, all from the E‐20 set, Supplementary Table XI). We also carried out new experiments for an additional pair (Pdr1‐Pdr3, also in the E‐20 set). As predicted, when the paralogous partner is not present to compensate for the effect of a single knockout, the overlap of the knockout and binding data increases significantly. For example, the overlap between genes affected by the single knockout of Yhp1 and Yox1, two cell cycle TFs, and the genes bound by these factors is 0 and 1%, respectively (both are not significant). In contrast, the overlap for the double knockout and the binding targets of Yhp1 and Yox1 is 8 and 9%, respectively (P‐values of 10−4 and 10−7.5). Similar results were obtained for the other two double knockouts we collected (Supplementary Tables XII). For our Pdr1‐Pdr3 double knockout experiment, we again observed a large increase in the percentage of overlap for Pdr1 compared with the single knockout experiment. The overlap increased from 1% (not significant) to 19% (P‐value of 10−5). For Pdr3 we saw a large increase in binding percentage (from 0 to 4%) though this overlap is still not significant (Supplementary Table XII). Thus, these experiments support our claim of backup provided by these pairs of factors and can also provide clues to the mechanisms used as we discuss below.
Mechanisms leading to TF redundancy
A subset of the homologous TFs we identified bind to an overlapping group of targets, and thus it is not surprising that knocking out one of them has small effect on the expression of its targets. One such example is the two homologous TFs involved in methionine metabolism, Met31 and Met32 (Blaiseau et al, 1997). These TFs have a large overlapping set of target genes (>60%), and neither has any target genes that are differentially expressed after deletion. Another example is the two forkhead TFs, Fkh1 and Fkh2. These only bind a partially overlapping set of target genes. However, it has been shown (Hollenhorst et al, 2001) that the binding of Fkh1 to Fkh2 targets is enhanced in the absence of Fkh2 and vice versa, suggesting that a compensation can occur beyond the common targets as predicted by our findings.
This type of compensation may happen due to competition between the two TFs that is resolved in the absence of one of them. Another possibility is that the activity of one TF is enhanced in the absence of its homolog due to a feedback mechanism between the two TFs (Kafri et al, 2005). To check this idea, we looked at the expression levels of the TFs believed to be compensating for the knockout (most similar based on BLASTP). As expected, we have not found any example in which the expression level of the homologous TF was significantly decreased (Supplementary Table XIII). However, a significant increase was observed in only a few cases. Thus, it appears that these changes are mainly driven by post‐transcriptional events, perhaps by the protein interaction networks mentioned above.
One of the first studies looking at the overlap between knockout and binding (Horak et al, 2002) hypothesized that redundancy may be part of the disagreement. In that paper, the authors conclude that ‘in some cases targets are not significantly affected, presumably because of transcriptional redundancy’. However, it was hard to substantiate this claim without a comprehensive knockout and expression data. The recent availability of such data allowed us to show that redundancy (both in sequence and in interactions) indeed plays a major role in the disagreement between the two types of data. Our results suggest that paralogs can compensate for the loss of TFs providing backup mechanisms and robustness for eukaryotic cells.
Materials and methods
Naming conventions and gene synonyms
Many of the overlap calculations are sensitive to the manner in which yeast gene names are mapped from their standard name to their systematic name (also known as the ORF name) and vice versa. Although Harbison et al (2004) provides a list of ORF names for the standard names used in their experiments, the gene name mapping is constantly evolving, and their list was out‐of‐date at the time the knockout experiments were run. Therefore, we relied on the SGD (http://www.yeastgenome.org) gene name mapping with the manual addition of 13 retired mappings that appeared in older datasets we used. Any TF targets that could not be mapped to an SGD ORF name were ignored, as were ORFs on the mitochondrial chromosome or the 2‐μm plasmid.
For a given TF t, we define the set of genes significantly bound by t to be GB and the set of genes significantly affected by the knockout of t as GK. The binding overlap B and knockout overlap K are calculated in the following manner:
We use the hypergeometric distribution, also known as the one‐tailed version of Fisher's exact test, to calculate a P‐value for the overlap of the binding and knockout targets:
where is the choose function of n and k, GA is the set of all possible genes targets in the binding or knockout datasets, and o is the size of the overlap. When calculating the P‐value for the entire network of TFs or a subset of TFs sharing some property, we replace GB in the above equations by IB, the set of TF‐gene binding interactions and similarly replace GK with IK, the set of TF‐gene knockout effects.
Varying P‐value thresholds
Data for Figure 1 were generated by considering the set of P‐value thresholds defined by:
At each threshold, the interactions in the binding and knockout datasets with significance less than or equal to the threshold were obtained, and the overlap and its significance were calculated. For each threshold, the expected binding and knockout overlaps were calculated using the formulas:
where th is the threshold, t is a TF in the set of all TFs F, GA is the set of all genes in the binding and knockout datasets, GB(th,t) is the number of genes bound by t at threshold th, and GK(th, t) is the number of genes affected by the knockout of t at threshold th.
To purge the knockout data of instances in which a gene was differentially expressed due to non‐specific effects instead of targeted regulatory mechanisms, we removed genes that were affected by the knockout of a large number of TFs. Specifically, we eliminated 161 ‘general KO genes’ that were differentially expressed in 20 or more TF knockout experiments at a P‐value of 0.001, which we consider to be genes affected by the stress of the deletions. Of the 14 427 knockout interactions at this P‐value, 4920 were removed. As any TF‐gene binding interaction involving one of the removed general KO genes is guaranteed to no longer have a corresponding knockout effect, we removed these same genes from the binding dataset.
The presence of a sequence motif can provide additional evidence that a protein–DNA binding interaction is functional, especially if that motif is conserved in other species. Therefore, to reduce noise we used Harbison et al's alternate version of the binding dataset in which binding interactions not supported by a motif that is conserved in at least two other yeast species have been removed (Harbison et al, 2004). Many of the original 203 TFs do not have a known or conserved motif and were removed from subsequent analysis. Of the 102 remaining TFs, 97 were also knocked out by Hu et al (2007). This version of the binding data contains interactions observed in both YPD and non‐YPD conditions.
To obtain a list of putative paralogs, the FASTA sequences for the 188 TFs present in both the original binding dataset and the knockout dataset were obtained from SGD (http://www.yeastgenome.org). Next, the NCBI netblast program (http://www.ncbi.nlm.nih.gov/blast/download.shtml) was run on the sequences with the parameters ‘‐p blastp ‐d refseq_protein ‐b 200000 ‐u ‘saccharomyces cerevisiae’[Organism]’ to use BLASTP (Altschul et al, 1997) to search the RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq) database for up to 200 000 sequences from Saccharomyces cerevisiae proteins. The default values were used for all other netblast parameters. The netblast results were filtered so that only TF–TF pairs between unique TFs remained, in which a TF was considered to be any of the 284 factors present in the either the binding or the knockout dataset. We post‐processed the pairs of putative paralogs to separate them into distinct subsets so that pairs belonging to a set defined by a particular threshold (e.g. E‐3) did not also appear in the set of pairs defined by a stricter threshold (e.g. E‐10). Any TF that was not a member of the E‐20, E‐10 or E‐3 paralog set was placed in the set of TFs without a paralog. The complete assignment of TFs to paralog sets can be found in Supplementary Table XIV (Excel file).
For the Pfam‐based putative paralogs, we assigned one or more Pfam domains to a TF by aligning each single sequence with a domain from the Pfam database (a library of HMMs, version 22.0) (Finn et al, 2006) using the HMMER software package (2.3.2 release) (Durbin et al, 1998). We only considered those alignments involving Pfam domains classified as DNA‐binding domains. Moreover, Pfam domains were assigned to one of the three different E‐value thresholds, E‐8, E‐3 and E‐1, yielding three sets of putative paralogs and a fourth set of TFs without a paralog. As with the BLASTP‐based paralogs, we removed paralogs that already belonged to a stricter set from the less strict sets so that the subsets contain distinct TFs.
For a pair of TFs T1 and T2, we define the percentage of shared PPI to be
where S is the percentage of shared PPI and PT is the set of PPI partners for TF T. The min(*) function guarantees the resulting percentage of T1's protein interaction partners are shared by T2 and vice versa. For each TF that did not have a putative paralog, we calculated shared PPI using approximate paralogs, which are TF–TF pairs with a BLASTP E‐value greater than E‐3 but less than 10.
For the PPI network analysis, we used two different sets of reported PPIs. The first was a literature‐curated PPI dataset (Reguly et al, 2006) and the second is the BioGRID dataset (version 2.0.48) (Stark et al, 2006) that includes data from high throughput interaction studies. For BioGRID, we removed all genetic interactions as well as those inferred from co‐localization. In the main text, we report the results for the literature‐curated dataset. Results for the BioGRID dataset are reported in the Supplementary information.
RNA extraction and labeling for expression profiling
The double KO strain YYA100 (pdr1Δ∷KanMx6, pdr3Δ∷His3Mx6) was kindly provided by Florian Zwolanek (Schuller et al, 2007). Cultures were grown to OD600 1.0–1.5 and total RNA was extracted using MasterPure Yeast RNA Purification Kit (Epicentre Biotechnologies). cDNA synthesis and labeling was performed as in http://www.md.huji.ac.il/units/tzabam/microarray/Labeling.htm. In brief, cDNA was generated with oligo‐dt primers (2 μg) (Amersham Biosciences) from total RNA (20 μg) using the reverse transcription enzyme superscript II (Invitrogen). The reverse transcription reaction was carried out at 42°C for 2 h with aminoallyl‐dUTP. Removal of unincorporated aa‐dUTP and free amines was carried out using Microcon YM‐30 (Millipore) filters according to the manufacturer's recommendations. Coupling of aminoallyl labeled cDNA to Cye dye esters was performed in 0.1 M sodium carbonate buffer (pH 8.6) for 1 h at room temperature. Removal of free dyes was accomplished with Qiagen PCR purification columns (Qiagen Ltd). The labeling was then quantified using a ND‐1000 spectrophotometer (Nanodrop Ltd). Equal amounts of both samples were mixed and concentrated by speed vacuum for 1 h.
Double spotted microarrays containing 6240 Yeast ORFs printed as cDNA (+ controls, total 6.4 K spots), manufactured by the Genomics Centre, University of Toronto, were pre‐hybridized in 5x SSC, 0.1% SDS, 1% BSA for 45 min. The probes were resuspended in hybridization buffer (25% formamide, 5x SSC, 0.1% SDS, 0.4 μg/μl Yeast tRNA) and applied to the slides. Hybridization was carried out overnight at 42°C in a hybridization chamber (Corning). Arrays were scanned using GenePix 4000B scanner (Axon Instruments) with settings adjusted to obtain a similar green and red overall intensity. The resulting images were analyzed using GenePix pro 4.0 (Axon Instruments). The experiment was done in duplicates using dye swapping. Microarray data have been deposited at the ArrayExpress database under accession number E‐MEXP‐2150.
We thank Karl Kuchler for the Pdr1/3 double KO strain. We also thank Jason Ernst for helpful discussions about this work. This work was supported in part by NIH grant 1RO1 GM085022 and NSF CAREER award 0448453 to ZBJ. In addition, this material is based on work supported under a National Science Foundation Graduate Research Fellowship to AG.
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary Results and Methods
This file includes Supplementary Results as referenced in the main text. The Supplementary Methods describe the techniques used to obtain the findings reported in the Supplementary Results and also provide additional details that were omitted from the Materials and Methods section of the main text. [msb200933-sup-0001.pdf]
BLASTP‐defined paralog sets
This table contains the assignment of TFs to paralog sets defined using the BLASTP E‐value and percentage of shared PPI along. In addition, it provides the binding and knockout overlaps for each TF. Only TFs for which we have conservation‐supported binding data and knockout data are included. For TFs without a putative paralog, we calculated shared PPI using approximate paralogs. [msb200933-sup-0002.xls]
YPD and non‐YPD GO terms
This table contains the list of GO terms that were used to assess whether a TF is active only in YPD or if it may be active in non‐YPD conditions as well. [msb200933-sup-0003.xls]
Source data for figure 1 [msb200933-sup-0004.txt]
Source data for figure 2B [msb200933-sup-0005.txt]
Source data for figure 3B [msb200933-sup-0006.txt]
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2009 EMBO and Nature Publishing Group