A prime goal in systems biology is the comprehensive use of existing high‐throughput genomic datasets to gain a better understanding of chromatin organization and genome function. In this report, we use chromatin immunoprecipitation (ChIP) data that map protein‐binding sites on the genome, and Hi‐C data that map interactions between DNA fragments in the genome in an integrative approach. We first reanalyzed the contact map of the human genome as determined with Hi‐C and found that long‐range interactions are highly nonrandom; the same DNA fragments are often found interacting together. We then show using ChIP data that these interactions can be explained by the action of the CCCTC‐binding factor (CTCF). These CTCF‐mediated interactions are found both within chromosomes and in between different chromosomes. This makes CTCF a major organizer of both the structure of the chromosomal fiber within each individual chromosome and of the chromosome territories within the cell nucleus.
Recent progress in high‐throughput sequencing has opened new avenues in studying genome structure and its implications on gene regulation. Lieberman‐Aiden et al (2009) recently presented the first contact map of the human genome. They obtained this map using Hi‐C, a method enabling the examination of the spatial proximity of DNA fragments in the nucleus. These results confirmed the existence of chromosome territories and showed that open and closed chromatin compartments are spatially segregated. They determined the distribution of the genomic distances between interacting DNA fragments within chromosomes and proposed that this distribution is compatible with a fractal globular organization of the chromosomal fiber.
Another interesting outcome of Hi‐C experiments is that the interactions are highly nonrandom; the same DNA fragments are often found to interact with each other. In this study, we address this question in detail and question whether these specific interactions can be explained by the action of the CCCTC‐binding factor (CTCF). CTCF is a highly conserved protein from fly to human and was recently presented as the ‘master weaver’ of the genome (Phillips and Corces, 2009). Chromosome conformation capture (3C) techniques have highlighted its role in organizing long DNA loops within chromosomes at specific loci (Phillips and Corces, 2009; Zlatanova and Caiafa, 2009; Ohlsson et al, 2010). Evidence of CTCF‐mediated intra‐ and inter‐chromosomal interactions has also been obtained using 4C (an advanced 3C technique) on the mouse Igf2/H19 locus (Kurukuti et al, 2006; Ling et al, 2006; Zhao et al, 2006). In addition to this architectural role, this versatile protein is found to be involved in gene regulation (Phillips and Corces, 2009; Zlatanova and Caiafa, 2009; Ohlsson et al, 2010). Over 13 000 CTCF‐binding sites (CTCF sites) on the human genome have been identified using chromatin immunoprecipitation (ChIP) on Chip, enabling the characterization of the specific binding sequence (Kim et al, 2007). This library of binding sites has been enriched using ChIP followed by deep sequencing (ChIP‐Seq; Barski et al, 2007) and computational predictions (Xie et al, 2007), yielding an extensive inventory of over 40 000 locations (Bao et al, 2008). We set out to determine whether fragments found to interact in the Hi‐C experiments are associated with a CTCF site. We show that the presence of CTCF sites is highly correlated with the ability of fragments to make strong interactions, both within the same chromosome and between different chromosomes.
Results and discussion
We based our analysis on a Hi‐C experiment (Lieberman‐Aiden et al, 2009) conducted using human lymphoblastoid cell line (GM06990). The restriction enzyme used (HindIII) cuts the human genome in ∼800 000 fragments; ∼37 000 of which bear at least one CTCF site. The experimental procedure yields an inventory of binary interactions between all the fragments. Eight million interaction reads were produced. Almost all the fragments in the genome are found in at least one of these reads and some fragments are found in many interaction reads.
The first question we have addressed is whether observing the same interaction many times in the experiment confers nonrandomness. To answer this question, we first noticed that the results from the experiment can be represented by a network in which each node is a DNA fragment, and each link represents an interaction between two fragments (Figure 1A). Taking any two random nodes from this network, they can be either unlinked, or linked by one link, or even linked by many links. The number of links emanating from a node is called the node degree. Nodes with a high degree correspond to fragments, which are found to interact a lot in the experiment, and we can expect that such high‐degree nodes will have many links in common. To statistically quantify the significance of the number of interactions between two fragments (i.e. the number of links between two nodes), we created samples of randomized networks (n=100), which preserve the linkage characteristics of the original network, that is, the number of nodes, links, and node degrees. We subsequently inspected the number of interactions between any two nodes arisen due to pure chance and contrast that to the actual observed value. We observe significantly higher numbers of interactions between nodes in the observed data than those in the randomized networks. Figure 1B shows the distributions of the number of links between each nodes pairs for both the actual network and the randomized networks. The two distributions are found to be significantly different (Kolmogorov–Smirnov test, P<2.2 × 10–16). The same difference is found when considering only interchromosomal interactions (Supplementary Figure 1). This means that the nonrandomness of an interaction between two fragments is not only due to the genomic proximity between those two fragments. At this point, we decided to test the hypothesis that these nonrandom interactions are due to specific factors, the most widely known being CTCF.
We therefore set out to determine whether the fragments that are found in many interaction reads are more likely to have a CTCF‐binding site. We took the following approach:
First, we removed all binary interactions from the data that were present only once as some of these may well be attributed to noise in the experiment.
Second, we set a threshold n and considered only the fragments that are present in at least n interaction reads.
Lastly, for each value of n ranging from 1 to 100, we computed the corresponding number of fragments (Figure 2A, black line) and the percentage of those fragments that contain at least one CTCF site (Figure 2B, black line).
From Figure 2A, we estimate that about 200 000 fragments are found in at least two interaction reads; however, only ∼100 are found in at least 100 reads. Interestingly, the decline in the number of fragments for increasing n is not monotonic but clearly has two different components: a fast one for n<10 and a slow one for n>10. In other words, two different kinds of fragments can be distinguished: strongly interacting fragments that correspond to the slow component and weakly interacting fragments that correspond to the fast component. Strong interactions can either result from a stable interaction in a subpopulation of cells or a weaker, but more frequent interaction in a majority of cells. We then computed the proportion of fragments containing CTCF‐binding sites for increasing n and found that strongly interacting fragments are enriched in CTCF sites with respect to weakly interacting fragments (Figure 2B, black line). As n becomes higher than 20, the percentage of fragments containing CTCF reaches ∼40%. These results strongly support the proposed role of CTCF as a major factor in mediating long‐range interactions among distant DNA elements (Phillips and Corces, 2009; Zlatanova and Caiafa, 2009; Ohlsson et al, 2010) and show that hundreds of such interactions are formed within the nucleus of human lymphoblastoid cells.
We then repeated the same analysis considering only interchromosomal interactions. The results are presented in Figure 2A and B with green lines. Out of the ∼200 000 fragments found to interact with another fragment, ∼100 000 are involved in interchromosomal interactions (Figure 2A, green line). The same high proportion of interchromosomal interactions holds for the strong interactions found in the Hi‐C experiment. To verify whether these strong interchromosomal interactions are mediated through CTCF, we computed the percentage of fragments containing CTCF sites involved in these interactions (Figure 2B, green line). We observed that as n increases, the percentage of fragments containing CTCF sites continues to increase eventually reaching ∼60%. These results suggest that strong interchromosomal interactions found in the human genome can be mediated by CTCF. These results point toward CTCF being a key interactor in mediating chromosome–chromosome interactions and in organizing chromosome territories in the cell nucleus.
The genomic coordinates of CTCF‐binding sites that we used to compute these correlations come from three different human data sets (Supplementary Table I). These data sets were obtained from different cell types and using different modus operandi. As shown in Figure 3A, the two experimental data sets (Barski et al, 2007; Kim et al, 2007) have an overlap of about 50%, whereas the computationally predicted positions (Xie et al, 2007) for CTCF sites have weaker correlation with experimentally determined positions. To check whether these three data sets contribute differently to the correlation we observed (Figure 2B), we computed the proportion of fragments containing CTCF‐binding sites for increasing n for each data set separately (Figure 3B). To our surprise, only one (Barski et al, 2007) of these three data sets account for all the observed correlation. This difference might be explained either by the technique used (ChIP‐Seq versus ChIP‐on‐Chip or computational predictions) or by the difference in cell type used in different experiments (Supplementary Table I). In fact, it is likely that both happen. First, differences in CTCF sites have been reported between fibroblast and erythroid cell lines by using the exact same protocol (Hou et al, 2010). Lymphoblastoid cells on which interactions were determined (Lieberman‐Aiden et al, 2009) are more closely related to the CD4+ T lymphocytes used in the ChIP‐Seq analysis (Barski et al, 2007) than to the fibroblast cells used in the ChiIP‐on‐Chip experiment (Kim et al, 2007). Second, deep sequencing that allows probing of the entire genome is used both in Hi‐C and ChIP‐Seq, whereas ChIP‐on‐Chip is only suitable to probe positions predetermined by the oligomers that are found on the microarray. We noticed that many interacting fragments were found on regions that were not covered by the microarray used in the experiment by Kim et al (2007).
To contextualize the correlation we found between strongly interacting fragments and the presence of CTCF, we repeated the same analysis with other DNA‐binding factors. First, we used six ChIP‐Seq data sets from two factors known to activate transcription (SRF and GABP) in three different cell lines: HeLa cells, lymphoblastoid cells and liver carcinoma cell (Valouev et al, 2008). The results presented in Figure 4A do not show similar correlation compared with the one seen with CTCF. This suggests that the correlation we found with CTCF is not simply due to the matching of experimental conditions between ChIP‐seq and Hi‐C protocols. Second, we mapped on the genome all 132 known DNA‐binding factors that have a specific consensus binding sequence longer than 15 bp (see Materials and methods section) and used the top 50 000 genomic coordinates to repeat the same analysis for each factor.
Three conclusions can be drawn from the results (Figure 4B):
None of these factors’ presence on a fragment correlates with the ability of this fragment to interact strongly as much as CTCF presence, as determined in the experiment of Zhao et al (2006), does.
For most of the factors, this correlation is comparable to the same correlation computed for a random 20 bp sequence (Figure 4B, in black). This is also the case for the correlation computed with the consensus sequence for CTCF determined from the experiment by Kim et al (2007). This is consistent with the fact that the experimentally determined positions for CTCF from Kim et al (2007) data didn't correlate with strong interactions.
For some of the factors, this correlation is greater than the same correlation computed for a random sequence. The three factors for which the correlation is the highest are: HNF4, PPARγ and Freac4. Interestingly, these three transcription factors are all known to be expressed in lymphocytes (Su et al, 2004; Jo et al, 2006; Humphreys et al, 2009) and to activate gene transcription. This agrees with the concept that these transcription factors would trigger the formation of transcription factories that recruit active genes, thus mediating strong interactions between the different genes expressed in lymphocytes.
Lastly, we looked in more detail at the eight strongest interactions detected in lymphoblastoid cells (Table I). Seven of these interactions are found between fragments that both contain CTCF sites (for more details on the number of pairs of interacting fragments having both CTCF sites see Supplementary Figure 2). One interaction involves only one fragment containing CTCF sites. Five of these interactions are interchromosomal interactions. We observed that some fragments (such as chr10:11 184, chr4:14 220) are found to interact with multiple fragments on different chromosomes. These fragments contain several CTCF sites (nine for chr10:11 184 and four for chr4:14 220), suggesting that CTCF can mediate the formation of chromosomal hubs of interactions across chromosomes. Analyzing the 8 interactions listed in Table I, we identified one hub gathering fragments from chromosomes 1, 3, 4, 10, and 19. This hub involves centromeres and telomeres, suggesting that repeat sequences have a central function in genome folding and ordering as proposed by Kumar et al (2010). Many examples of repetitive DNA sequence clustering have indeed been reported (de Laat and Grosveld, 2007).
In conclusion, our results show that the Hi‐C data can be used together with ChIP data to characterize the role of CTCF as the master weaver of the human genome and to identify chromosomal hubs of interactions and factors participating in the formation of those hubs.
Materials and methods
Randomization of the interaction network
To create a randomized network, we used a random rewiring procedure on the original network described as follows:
For each node A, we rewired each emanating edge.
For each of these edges (A, B), we picked a random node B′ (A ≠ B′) in the network and rewired the edge to connect A to B′.
If B′ was different from B, B′ had one extra edge and B had one less edge. We then randomly removed an edge connecting B′ to A′ (A′ ≠ B) and created a new edge connecting A′ to B.
Repeat steps 1–3 until all edges have been rewired.
After each rewiring run, we inspected the pairs of nodes that were connected in the original network and gathered the number of interactions found between them in each randomized instance of the network. We can then compare the number found in the original network to the average of the randomized networks (see Supplementary Figure 3 for a schematic of the procedure).
Computing the correlation between strongly interacting fragments and CTCF‐binding sites
Hi‐C data sets were downloaded from Gene Expression Omnibus; GEO accession: GSE18199. CTCFs genomic locations were taken from CTCFBSDB (http://insulatordb.uthsc.edu/help.php).
Fragments obtained from Hi‐C were labeled according to the presence of CTCF‐binding sites. The CTCF‐binding sites that span multiple contiguous fragments were assigned to each of those fragments. The percentage of fragments involved in at least n interaction reads was then computed for each n. We tested for two possible biases in our analysis: the fragments lengths and the repetitive sequences (see Supplementary Figures 4 and 5).
Computing the correlation between strongly interacting fragments and other transcription factor binding sites
The GABP and SRF data sets were downloaded from Gene Expression Omnibus, GEO accession: GSE8489. Position‐specific scoring matrices (PSSM) were downloaded from TRANSFAC release 10.2 (Wingender et al, 1996).
We selected all human transcription factors which have a consensus sequence longer than 15 bp. This resulted in a total of 132 transcription factors. Each TF matrix was used to find binding sites on the human genome (assembly hg18) using PATSER (Hertz and Stormo, 1999), and the top scored 50 000 matches were retained and mapped on the fragments. This resulted in 29 634 to 46 778 fragments containing at least one TF‐binding site.
The random hypothesis was assessed using 10 random sequences (with the same GC content as the human genome). The black line on Figure 4B presents the average and s.d. value of the percentages obtained using those 10 random sequences.
We thank Rob White for careful reading of the paper. This study was supported by funds from EMBO and ANR (ASTF 277‐2009 and ANR‐09‐PIRI‐0024). IL and PL acknowledge financial support from the EC FP7 SOCIALNETS project, 217141 and Queens’ College, Cambridge.
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary Materials File #1
Supplementary table S1, Supplementary figures S1–5 [msb201079-sup-0001.doc]
↵† Joint first authors
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2010 EMBO and Macmillan Publishers Limited