While the number and identity of proteins expressed in a single human cell type is currently unknown, this fundamental question can be addressed by advanced mass spectrometry (MS)‐based proteomics. Online liquid chromatography coupled to high‐resolution MS and MS/MS yielded 166 420 peptides with unique amino‐acid sequence from HeLa cells. These peptides identified 10 255 different human proteins encoded by 9207 human genes, providing a lower limit on the proteome in this cancer cell line. Deep transcriptome sequencing revealed transcripts for nearly all detected proteins. We calculate copy numbers for the expressed proteins and show that the abundances of >90% of them are within a factor 60 of the median protein expression level. Comparisons of the proteome and the transcriptome, and analysis of protein complex databases and GO categories, suggest that we achieved deep coverage of the functional transcriptome and the proteome of a single cell type.
An inventory of the building blocks of a biological system is a prerequisite for a systems‐wide understanding of its functions. For human genes this was enabled by the sequencing of the human genome, which yielded the unexpected result that the genome is comprised of a mere 20 000 protein‐coding genes (Clamp et al, 2007). In contrast, the number of distinct transcripts has increased drastically due to the development of very deep—‘next generation’—shotgun sequencing of transcriptomes, termed RNA‐Seq (Mortazavi et al, 2008; Wang et al, 2009). Depending on the nature of the data and analysis criteria (Guttman et al, 2010; Haas and Zody, 2010; Trapnell et al, 2010), transcripts of between 8000 and 16 000 protein‐coding genes expressed from a single cell type can be detected.
High‐resolution mass spectrometry (MS)‐based proteomics has improved at a rapid pace in recent years (Aebersold and Mann, 2003; Mallick and Kuster, 2010; Schwanhausser et al, 2011). These advances had allowed us to quantify an essentially complete proteome of the model organism yeast as judged by comparison with genomic tagging methods (de Godoy et al, 2008). In mammalian systems, in contrast, our depth of analysis in single cell types has typically been limited to 4000–6000 protein groups (proteins distinguishable by identified peptides) (Graumann et al, 2008; Lundberg et al, 2010; Wisniewski et al, 2009a). Here, we set out to explore a human proteome in the depth achievable with current technology and to compare it with the corresponding transcriptome.
Results and discussion
We chose to investigate HeLa cells, a human cervical carcinoma cell line, because it is widely used in research and because a cell line is a more homogeneous system compared with tissues. To achieve maximum proteome coverage while maintaining a reasonable measurement time, we investigated the effects of protein fractionation, proteolytic digestion, peptide fractionation and reverse phase chromatography on the number of proteins identified (Figure 1). We employed moderate fractionation at the protein level by gel filtration, digestion by three specific proteases, combined with pipette‐based prefractionation at the peptide level by strong anion exchange (Wisniewski et al, 2009a) before online LC MS/MS analysis in 4 h gradients with relatively long columns (40 cm, 1.8 μm bead material). Peptide MS spectra as well as fragment MS/MS spectra were measured with high resolution and mass accuracy (Mann and Kelleher, 2008; Olsen et al, 2007; Olsen et al, 2009).
On the basis of initial results (‘Experiment 1’), we generated a data set (‘Experiment 2’)—involving 72 fractions and a total measuring time of 288 h—which is the basis of all subsequent discussion. All data files were analyzed together in the MaxQuant computational proteomics environment (Cox and Mann, 2008). A total of 2 337 336 high‐resolution fragmentation spectra, together with the corresponding high‐accuracy precursor masses, were submitted to the Andromeda search engine (Cox et al, 2011). Median peptide score was 121, with only 6% below a score of 60 (Supplementary Figure S1) and the average identification of the fragmentation spectra was 43%. Average absolute mass deviation of the precursors was 1.2 and 4.8 p.p.m. for the matched fragment masses. This identified and quantified 163 784 peptides that have unique amino‐acid sequence at a false discovery rate (FDR) of 1%, many of them fragmented multiple times (seven on average). Of these, 84 051 were from tryptic digestion, 52 108 from LysC and 44 704 from GluC. From these data, MaxQuant identified 10 255 proteins with 99% confidence (Figure 1B; Supplementary Table S1), providing a lower bound of the number of proteins expressed in HeLa cells. Trypsin digestion produces peptides in an ideal size range for MS/MS and, consequently, it yielded the highest number of identifications. Of the proteins identified after LysC digestion, 85% overlapped with the trypsin data set, and the GluC data only added another 5.2% of novel identifications. Less than 5% of all proteins were only identified by one peptide. Taken together, the three proteases resulted in >24% median sequence coverage of identified proteins.
The 10 255 proteins were mapped to 9207 Ensembl‐annotated human protein‐coding genes (Hubbard et al, 2002). These genes were equally distributed across the different human chromosomes with most and least number of genes identified in chromosomes 1 and 21, respectively (Supplementary Figure S2; Supplementary Table S2). Further, the MS/MS spectra were searched against the ENSEMBL database together with the GENSCAN predictions. This led to >1900 peptides mapping only to the GENSCAN predictions and not to the known ENSEMBL genes. We provide a list of the highest scoring of these peptides, as they may point to as yet unannotated exons (Supplementary Table S3).
To compare the proteome with the transcriptome and to evaluate the completeness of our results, we performed RNA‐Seq on the same cells. Briefly, we acquired 50 million single‐end 76 bp cDNA reads on the Illumina GAIIx platform. Reads were mapped to the human reference genome sequence and assembled into 49 000 unique transcripts (Trapnell et al, 2010) that mapped to 16 554 different protein‐coding genes (Supplementary Table S4). The abundance of the non‐filtered data expressed as Fragments Per Kilobase of exon per Million fragments mapped (FPKM) shows a bimodal distribution (Figure 2A) where about 33% of the transcripts have low signals below one FPKM. When excluding transcripts expressed at abundances lower than one FPKM, the number of genes identified was reduced to about 11 000, and genes corresponding to hundreds of low abundance proteins identified by MS were lost (Figure 2A). We therefore excludedtranscripts for which the estimated abundance is lower than their 95% confidence interval (FPKM >Δ95 FPKM). Using this criterion, transcripts for 11 936 protein‐coding genes were detected including a considerable number of transcripts in the low abundance region for which no proteins were detected. These include many genes that are not expected to be functionally relevant in HeLa cells, such as olfactory receptors (Supplementary Figure S3). The distribution of protein abundance values is broader than the filtered mRNA abundance distribution but has the same general shape (Figure 2B). Recently, the bimodal distribution of the transcriptome has been investigated in detail. The transcripts in the left part of the distributions appear to be present at less than one copy per cell and often code for functions not represented in the cell type (Hebenstreit et al, 2011). Therefore, it is possible that many of these transcripts are not expressed as proteins. Together, the data suggest that the detected proteome covers a very large part of the transcripts coding for functional proteins.
We compared the transcriptome and proteome on the basis of the ENSEMBL gene annotation. For 94% of genes for which a protein was identified by MS, a corresponding mRNA was detected (Figure 2C). Analysis of membrane proteins and regulatory proteins is often challenging in proteomics but Gene Ontology (GO) analysis showed similar percentages of transcripts and proteins for these categories, demonstrating that there were no such biases in the proteomic data (Figure 2D). This is likely the result of essentially complete solubilization of the proteome in SDS in the FASP procedure (Wisniewski et al, 2009b) combined with the overall depth of analysis.
The MS signal of peptides identifying each protein can be used to estimate its absolute cellular abundance (de Godoy et al, 2008; Malmstrom et al, 2009; Silva et al, 2006) in a similar way that the FPKM is a proxy for the abundance of transcripts. To calculate the approximate abundance of each protein we used the iBAQ algorithm (Schwanhausser et al, 2011), which normalizes the summed peptide intensities by the number of theoretically observable peptides of the protein. These normalized protein intensities are translated to protein copy number estimates based on the overall protein amount in the analyzed sample. We obtained good agreement with independently determined absolute copy numbers of 37 HeLa proteins (Zeiler et al, 2011; Supplementary Table S5). FPKM‐based transcript abundance values correlate well with iBAQ‐based protein abundance values (Spearman's correlation 0.6; Figure 2E). The use of high‐resolution MS and RNA‐Seq may account for the fact that higher correlations between transcriptomes and proteomes are observed here than in previous studies (Cox and Mann, 2011; de Sousa Abreu et al, 2009; Maier et al, 2009), where technical imperfections in the quantification of both the proteome and the transcriptome are likely to have reduced their apparent correlations.
To assess the completeness of the detected proteome, we first inspected macromolecular complexes for which all core members are presumably functionally necessary. Most of such complexes, such as the proteasome, spliceosome, histone‐modifying complexes and respiratory chain complexes were completely represented according to the Corum protein complex database (Supplementary Figure S4A). Mean proteome coverage of all Corum complexes was >95%, slightly less than the corresponding transcriptome coverage (96.5%). Sarcoglycan–sarcospan complex (normally expressed in the muscle), SNARE complexes (abundant in neuronal tissue), ITGA2b–ITGB3 complex (normally expressed in platelets) were among the complexes with lower coverage (20, 40 and 50%, respectively), likely due to cell type specificity. Even though only 5% of our HeLa cell population was in mitosis, we covered 61 of 63 proteins in a reference set of cell cycle‐specific proteins (Jensen et al, 2006). Our data set also has a very high coverage of most metabolic pathways pertaining to basic cellular functions. Comprehensiveness of the proteome is difficult to determine by comparison with pathway databases because they contain cell type‐specific proteins. Nevertheless, judged against the coverage of pathways achieved by deep‐sequencing transcriptomics, the proteomics data were >90% complete (Supplementary Figure S4B). Together, the transcriptome and proteome data suggest that at least 10 000–12 000 genes are expressed in HeLa cells.
The iBAQ values determined above estimate the absolute amount of each protein, incorporating individual peptide signals in MS and normalized by the number of observable peptides of the protein. The 40 most abundant proteins comprised 25% of the proteome (Figure 3A; Supplementary Table S6) with filaminA, pyruvate kinase, enolase, vimentin and Hsp 60 contributing >1% each. The most abundant 600 proteins constitute 75% of HeLa cell proteome mass (sum of all iBAQ values). The individual contribution of each protein to the total mass in combination with the knowledge of number of cells in the initial sample was used to roughly estimate the absolute copy number of the proteins in HeLa cells. The ranked distribution of all individual proteins revealed that 90% of the quantified proteome is contained within a range of a factor of 60 above or below the median protein copy number of 18 000 molecules per cell (Figure 3B; Supplementary Table S7). The lower half of the proteome accounts for <2% of its total mass. The abundance distribution of the transcriptome is generally similar but its range is compressed compared with the proteome with 90% of the transcriptome contained in a 500‐fold expression range and 2000 transcripts accounting for 75% of the total transcriptome mass.
The protein abundance values can also be used to estimate the proportional contribution of any individual protein, protein complex and protein class to the total proteome. For example, ribosomes, which are encoded by only 1% of human genes and for which we identified 195 different proteins contributed 6% to total protein mass in our data (Figure 3C). Similarly, the actin cytoskeleton, as classified by GO (Ashburner et al, 2000) annotation, contributes four‐fold more to the proteome mass than expected from the number of genes and proteins and ‘protein folding’ is achieved by <2% of the identified proteome by numbers but requires 8% of proteome mass in line with the high abundance of heat‐shock and similar proteins (Figure 3D). In contrast, integral membrane proteins account for 25% of the genome but contribute much less to the transcriptome and the proteome (7.6% of total protein mass). This presumably reflects the often cell type‐specific functions of these proteins (Lundberg et al, 2010; Ramskold et al, 2009).
Structural proteins and proteins in basic cellular machineries are known to be much more abundant than regulatory proteins; however, the generality of this rule could not previously be evaluated. Ribosomal proteins indeed formed a tight cluster at the top end of the distribution of transcript and protein expression levels (Figure 3E). This was also true of the core components of the proteasome, but not its regulatory subunits, which were up to a factor of 100 less abundant. Interestingly, the abundance of cytoskeletal proteins extended over a broad range from the most abundant proteins and transcripts to the medium and low abundance parts of the distribution. Metabolic enzymes are likewise generally considered to be an abundant class of proteins, but we found that they extend over almost the entire distribution of the transcriptome and proteome expression (Figure 3F). Enolase was the protein with the highest expression value, while glycogen phosphorylase (muscle form) was expressed 100 000‐fold less at the protein level and 10 000‐fold less at the transcript level. Large differences in expression levels of different metabolic enzymes have also been observed in recent targeted proteomics experiments in yeast (Picotti et al, 2009). As expected, our data show that regulatory proteins such as protein kinases and transcription factors have, on average, lower expression than the structural proteins discussed above. However, each of these categories spans a large expression range and surprisingly many of their members are in the top 25% of the proteome. Allowing these and similar comparisons of estimated expression levels of individual proteins and protein classes, as well as the corresponding transcripts, our data can provide starting points for systems biological modeling of the cell.
RNA‐Seq already covers virtually the entire functional transcriptome. Ultra‐deep mapping of the proteome is now also becoming possible with proteins identifiable for nearly all transcripts with an expected biological function in the cell type. Thus, both transcriptomics and proteomics are approaching completeness. Given the rapid technological progress in both fields, we predict that the required depth of 10 000–12 000 genes will be routinely reachable soon.
Materials and methods
HeLa cells lysate
Cell pellets were flash frozen in liquid nitrogen and stored at −80°C. Cells were lysed in a buffer consisting of 0.1 M Tris–HCl, pH 8.0, 0.1 M DTT, and 2% SDS at 99°C for 5 min. After chilling to room temperature, the lysates were sonicated using a Branson type sonicator and then were clarified by centrifugation at 16 100 g for 10 min. Protein content was determined using a Fluorescence Spectrometer.
Protein fractionation by gel filtration
In all, 0.100 ml of the cell lysate containing 10 mg of total protein was loaded onto a Superdex 200 10/300 GL column (GE Healthcare Bio‐Sciences AB, Uppsala) equilibrated with TNS buffer composed of 0.1 M Tris–HCl, pH 8 buffer, 0.1 M NaCl and 0.2% SDS. Proteins were eluted with TNS buffer and 2 ml fractions were collected.
Protein digestion and peptide fractionation
Detergent was removed from the lysates and the proteins were digested with trypsin, LysC, or Gluc using the FASP protocol (Wisniewski et al, 2009b) using ultrafiltration units of nominal molecular weight cutoff of 30 000 (Cat No. MRCF0R030, Millipore). The eluted peptides were fractionated according to the previously described pipette tip protocol (Wisniewski et al, 2009a).
The peptides were purified on StageTips (Rappsilber et al, 2007). Eluted peptides were separated on a reverse phase C18 column (40 cm long, 75 μm i.d., 1.8 μm beads, Dr Maisch GmbH, Germany) using the EASY‐nLC system (Proxeon Biosystems now Thermo Fisher Scientific). MS analysis was performed using LTQ‐Orbitrap Velos instrument (Thermo Fisher Scientific; Olsen et al, 2009). Data were acquired in data‐dependent mode. The survey scans were acquired at a resolution of 30 000 at m/z=400 in the Orbitrap analyzer followed by up to 10 fragmentation events (HCD) in the collision cell. The fragment ions were also detected in Orbitrap analyzer resulting in high‐resolution and high‐accuracy fragmentation spectra.
Total RNA was extracted from HeLa cell pellets using the RNAeasy Mini Spin columns protocol from Qiagen and an elution volume of 50 μl. RNA quality (RIN 10) and quantity (∼1 μg/μl) were assessed using an Agilent RNA 6000 LabChip. The RNA extracts were stored at −80°C. The Illumina RNA‐seq sample preparation protocol and kit (RS‐100‐0801) as well as the Illumina Paired End library preparation protocol and kit (PE‐102‐1001) were used for library preparation. Briefly, total RNA was enriched for poly‐A tailed transcripts using magnetic beads with poly‐T oligonucleotide coating. The enriched RNA was fragmented into small pieces using divalent cations and elevated temperature (94°C, 5 min). RNA fragments were copied into cDNA using a reverse transcriptase and random priming (Invitrogen SuperScript II). Second‐strand synthesis was performed in the same reaction using RNaseH and DNA polymerase I. Overhangs were converted into blunt ends using T4 DNA polymerase (5′ overhang fill‐in) and Klenow DNA polymerase (3′–5′ exonuclease activity). A deoxyadenosine was added to the 3′ end of the blunt and phosphorylated DNA fragments using the polymerase activity of Klenow fragment. T4 DNA ligase was used to ligate forked adapters and a gel length selection performed (∼200 nt insert size). Molecules were then amplified with overhanging primers that extend the adapters to their final length required for the sequencing.
The library was sequenced on two Illumina Genome Analyzer IIx lanes following vendor instructions for Multiplex Single Read sequencing and using 76+7 cycles. Protocols were followed except that an indexed φX174 control library was spiked into each lane, yielding about 1% of sequencing reads per lane. The φX174 control reads were aligned to the corresponding reference sequence to obtain a training data set for the base caller Ibis (Kircher et al, 2009), which was then used to generate base calls and quality scores.
All RNA‐seq sequence data is available from the European Nucleotide Archive (ENA) under the study accession ERP000959, and from ArrayExpress under accession number E‐MTAB‐823. All mass spectrometric raw files are uploaded to TRANCHE and can be accessed using the following hash codes: Hela_01_trypsin;phajxUWNFSW8gBCd3Hash:dLuhvyddHELlkrXVJa1QYTHGOdFDttpFksh8iBqBT4kNyESmVFznzAtXe4qS+9OCtJ//9y7DfdlcEIotcGCerr/ytCUAAAAAAAAWwQ==;Hela_01_LysC;GRtGG4GkZoo6pYZEbyd0 Hash:r6G4xDnc8deuSSpRMDkYk7hJsjvuWrMFoJGenuTEdtYN3zMhGDXaOl/QheYipLUoe/37f1lrYS+GQhRgDH+K5gfKns4AAAAAAAAWNg==;Hela_01_GluC;34NGEzbCmXHXr09aPqOV;Hash:GGDWG1xveOYXVD5DkiSVybfbp41fzZzeNiDJdVCcOmmXaFjLTNdOzOIPO0aCXkvnInsZ2kO4hvq3WZ9IW+O8yenB+NQAAAAAAAAY7Q==Hela_02_trypsin;gfAYWK0ljixAdVddEQH5;Hash:6YBO0zZhlHORAXzJ;+UqC4i6tlnlLw5OAV5lOzkoW1dYVueWQD9M6k+4YvQ/43iE7kalH+3LPJT5wqq27TlG/zdXNJeAAAAAAAAAsfg==;Hela_02_LysC;hUU1ZRgB61kmdtEJHmX4;Hash:Bz9hlKJ5EaEq/rgoVH0+fHehRgTSaCcD2;879Q1JnJm3d9sFaCpNgFnPPZT9WFu5K5mXKz8o1B9qaK7WBFxdFPu2ThkAAAAAAAAPmA==Hela_02_GluC;qEFG57NWsYggbpjHmQ5H;Hash:LEqiT5pWYpusY/SWaXJw8A3GcRAspRucqyb6L/nKSG9AywRpBL8hkBn8r+sZP3fXTWC2PoLNmhOpqkbg6lQR63GHeyAAAAAAAAAftQ==.
Gene and transcript quantification
Raw reads of two sequencing lanes were combined, adapters trimmed and reads shorter than 70 nt, or with more than five bases below a quality score of 15 (PHRED‐scale), removed. The processed reads were aligned to the human reference genome (hg19/GRCh37 excluding additional haplotypes) using TopHat v1.0.13 (Trapnell et al, 2009) and transcripts and genes of the Ensembl (Hubbard et al, 2009) release 59 were quantified using Cufflinks v0.8.3 (Trapnell et al, 2010). This method allows up to 40 equally good mappings of a read. In cases where a read can be mapped to multiple transcripts, each transcript is assigned one per number of mappings in the quantification step. If >40 potential mapping locations are identified, then the read is not considered for quantification.
Raw files from MS analysis were processed using the MaxQuant computational proteomics platform (Cox and Mann, 2008) version 220.127.116.11. The peak list generated was searched against the IPI human database (ipi.HUMAN.v3.68.fasta) with initial precursor and fragment mass tolerance set to 7 and 20 p.p.m., respectively. Peptides with minimum of six amino‐acid length were considered with both the peptide and protein FDR set to 1%.
All MS data were mapped to gene identifiers obtained from Ensembl for comparison with the RNA‐seq data. For the quantitative analysis, the iBAQ intensity and the FPKM values were used for proteome and transcriptome data, respectively.
We thank Peter Bandilla for MS assistance, Birgit Nickel and Ayinuer Aximu for preparing the RNA‐seq library and Marlis Zeiler for helpful discussions. This work was funded by the Max Planck Society, by the European Commission's 7th Framework Program PROteomics SPECificat ion in Time and Space (PROSPECTS, HEALTH‐F4‐2008‐201648), by the Munich Center for Integrated Protein Science (CIPSM) and by the Knut and Alice Wallenberg Foundation.
Author contributions: NN performed the MS analysis, analyzed the data and wrote the manuscript; JRW designed the experiment and prepared the samples; TG analyzed the data and wrote the paper; JC analyzed the data; MK, JK and SP performed the RNA‐Seq and analyzed the data; MM initiated and supervised the work and wrote the manuscript.
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary figures 1–4 [msb201181-sup-0001.pdf]
Supplementary Table S1
The version of the supplementary data set originally posted online on 8 November 2011 did not contain the full information due to a formatting error. The full file was uploaded as of 20 December 2012. [msb201181-sup-0002.xlsx]
Supplementary Table S2 [msb201181-sup-0003.xls]
Supplementary Table S3 [msb201181-sup-0004.xls]
Supplementary Table S4 [msb201181-sup-0005.xls]
Supplementary Table S5 [msb201181-sup-0006.xls]
Supplementary Table S6 [msb201181-sup-0007.xls]
Supplementary Table S7 [msb201181-sup-0008.xls]
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2011 EMBO and Macmillan Publishers Limited