Large‐scale cancer genome sequencing has uncovered thousands of gene mutations, but distinguishing tumor driver genes from functionally neutral passenger mutations is a major challenge. We analyzed 800 cancer genomes of eight types to find single‐nucleotide variants (SNVs) that precisely target phosphorylation machinery, important in cancer development and drug targeting. Assuming that cancer‐related biological systems involve unexpectedly frequent mutations, we used novel algorithms to identify genes with significant phosphorylation‐associated SNVs (pSNVs), phospho‐mutated pathways, kinase networks, drug targets, and clinically correlated signaling modules. We highlight increased survival of patients with TP53 pSNVs, hierarchically organized cancer kinase modules, a novel pSNV in EGFR, and an immune‐related network of pSNVs that correlates with prolonged survival in ovarian cancer. Our findings include multiple actionable cancer gene candidates (FLNB, GRM1, POU2F1), protein complexes (HCF1, ASF1), and kinases (PRKCZ). This study demonstrates new ways of interpreting cancer genomes and presents new leads for cancer research.
Phosphorylation sites of human proteins are frequently mutated in cancer. Statistical analysis of phosphorylation‐associated single nucleotide variants (pSNVs) predicts novel cancer drivers and phospho‐mutation mechanisms in known cancer genes.
We designed the ActiveDriver method to identify significantly mutated signaling regions in proteins. ActiveDriver is complementary to standard frequency‐based methods of mutation significance and helps interpret rare, but site‐specific mutations.
Analysis of somatic mutations in 800 cancer genomes reveals dozens of known and novel cancer genes, including potential drivers that are apparent only when integrating multiple cancer types.
Pathway and network analysis identifies systems with significantly enriched pSNVs, including kinase modules and protein complexes.
Clinical data analysis identifies phospho‐mutations of TP53 that correlate with prolonged patient survival in ovarian and brain cancer. Kinase network analysis highlights multiple survival‐associated signaling modules with pSNVs.
A major goal of cancer research is to characterize somatic molecular alterations in tumor cells and identify systems driving cancer progression. These aberrations range from small DNA mutations to genomic copy number alterations, changes in chromatin structure and gene expression. To map such changes, international research consortia are collecting thousands of genome‐wide molecular profiles of dozens of cancer types (Collins and Barker, 2007; The International Cancer Genome Consortium, 2010). Tumor genome sequencing has revealed a complex landscape of somatic DNA mutations in cancers of multiple types and tissues, including breast, colon, lung, liver, brain, ovary, pancreas, and blood (Wood et al, 2007; Cancer Genome Atlas Research Network, 2008, 2011; Ding et al, 2008; Jones et al, 2008; Parsons et al, 2008; Puente et al, 2011; Totoki et al, 2011). Most identified somatic mutations are thought to be functionally neutral ‘passengers’, caused by the increased mutation rate in cancer cells, whereas relatively few driver mutations provide selective advantages to tumor cells and are responsible for tumor initiation, maintenance, progression, and metastasis. Discovery of cancer drivers will provide insight into the biology of tumor development, and reveal diagnostic or predictive markers and new avenues of therapy development.
While a number of drivers are well recognized (e.g., TP53, KRAS, and EGFR), due to their frequent mutation rate in tumors and biological characterization, they are not sufficient to explain the phenotypic diversity of cancer and many more drivers likely exist. Of particular interest are driver mutations that occur in multiple cancer types; for instance, the druggable BRAF V600E mutation in melanoma and hairy cell leukemia (Chapman et al, 2011; Tiacci et al, 2011). Targeted drug development for such mutations may lead to effective multi‐cancer therapies. However, most cancer mutations are not highly recurrent, and the observed long tail of infrequently mutated genes indicates the heterogeneous and complex nature of the disease. One likely explanation is that a cancer‐related cellular system can be modified in multiple ways to create a neoplastic advantage, and therefore different, rare mutations can have similar phenotypic effects. Comprehensive systems‐oriented analysis of integrated cancer data sets may therefore reveal novel, therapeutically relevant cancer driver genes, protein complexes, and pathways.
Tumor development involves a number of pathways with specific protein interactions and post‐translational amino‐acid modifications (Hanahan and Weinberg, 2011). In particular, protein phosphorylation is central in many hallmark cancer processes and is often misregulated in the disease. Phosphorylation is a dynamic, reversible post‐translational modification involving three major activities. Protein kinases act as writers by adding phosphate groups to serine (S), threonine (T), and tyrosine (Y) residues in substrate proteins, and phosphatases are erasers involved in dephosphorylation. Proteins with phosphorylated residue binding domains (e.g., SH2, PTB, 14‐3‐3) are readers that mediate context‐specific protein interactions (Lim and Pawson, 2010). Phosphorylation can also regulate proteins by causing a conformation change or interfering with interactions with other molecules. Products of known cancer genes are enriched in kinases such as EGFR and SRC, while others such as TP53 and CTNNB1 are regulated by phosphorylation (Morin et al, 1997; Chao et al, 2006). Phosphorylation is a prime target of drug development, in particular via protein kinases, and a growing class of phosphorylation‐related targeted agents such as EGFR inhibitors are now in clinical use (Hynes and Lane, 2005). Many focused studies and large‐scale mass spectrometry experiments have mapped thousands of phosphorylation sites in human proteins (Dephoure et al, 2008; Nagano et al, 2009; Van Hoof et al, 2009; Olsen et al, 2010). This information helps establish a link between phosphorylation signaling and genomic alterations that can be exploited to identify mutations that affect signaling systems in cancer.
We performed a comprehensive analysis of somatic cancer mutations affecting protein phosphorylation. We hypothesize that statistically unexpected mutation rates in phosphorylation‐specific regions within genes and pathways identify these as cancer drivers. We systematically analyzed 800 cancer genomes and thousands of somatic mutations in the context of phosphorylation sites, protein kinase domains, pathways, protein complexes, and kinase–substrate networks. We applied a novel, sensitive statistical model to find genes with unexpected phosphosite and kinase mutations, and carried out pathway and network analyses to identify significantly altered phosphorylation systems. We also detected network modules that significantly correlate with clinical outcome that may be helpful for diagnosis or prognosis.
Global analysis reveals enrichment of cancer mutations in phosphosites and signaling networks
To investigate alterations of phosphorylation signaling in cancer, we integrated multiple data sets of somatic cancer mutations, protein phosphorylation events, kinase–substrate interactions and clinical data (Figure 1A–C). We accumulated 73 872 experimentally determined phosphorylation events in 10 279 phosphoproteins from three public databases (Keshava Prasad et al, 2009; Dinkel et al, 2011; Hornbeck et al, 2012). We collected 10 900 missense single‐nucleotide variants (SNVs) from 793 samples of eight cancer types: serous ovarian adenocarcinoma (Cancer Genome Atlas Research Network, 2011), breast and colorectal cancer (Wood et al, 2007), pancreatic cancer (Jones et al, 2008), lung adenocarcinoma (Ding et al, 2008), glioblastoma multiforme (Cancer Genome Atlas Research Network, 2008; Parsons et al, 2008), chronic lymphocytic leukemia (Puente et al, 2011), and hepatocellular carcinoma (Totoki et al, 2011) (Figure 1D). To focus on phosphorylation‐associated mutations, we only considered SNVs that directly altered kinase domains or 15‐residue ‘phosphosites’ comprising a central phosphorylated S, T or Y residue and two 7‐residue flanking sequences. Overlapping phosphosites were merged into continuous regions, which covered ∼7% of the entire non‐redundant human proteome. This procedure highlighted 775 distinct genes containing 949 phosphorylation‐specific SNVs (pSNVs) in one or more cancer types (Figure 1E), covering 58% (459) of all studied cancer samples. This set includes 650 genes with phosphosite pSNVs, 189 genes with kinase domain pSNVs and 63 genes with both types of pSNVs (Supplementary Tables 1 and 2). We also compiled a kinase‐substrate network of 4823 interactions, 379 kinases, and 1879 substrates, involving about 11% (7778) of phosphosites (Supplementary Table 3).
Several global trends confirm the functional significance of phosphosite and kinase pSNVs in cancer. First, phosphosite variants are more frequent than expected, given the genome‐wide somatic mutation rate of studied cancer samples (949 observed, 751±27 expected, s.d., P=2.5 × 10−12, binomial test, Figure 1F). Second, kinase‐targeted phosphoproteins are enriched in mutations: 44% (284) of genes with phosphosite‐specific mutations are substrates of known kinases, while only 18% (118±10) are expected (P=1.0 × 10−54, Fisher's exact test). Consequently, 57% (2744) of kinase–substrate interactions are affected by pSNVs (Supplementary Figure 1). Third, pSNVs tend to affect topologically central and highly interacting genes in the kinase–substrate network (P=6.5 × 10−25 and P=7.0 × 10−32, Wilcoxon test, Supplementary Figure 2). While this could be due to ascertainment bias in network determination, centrality in our network likely indicates importance in cancer, as known cancer genes also show increased centrality and betweenness (P=2.3 × 10−13 and P=7.3 × 10−6, respectively). These results demonstrate the general extent of somatically altered phosphorylation signaling in tumor cells.
Focusing further on phosphosites, we observe 78 direct phosphosite mutations that replace the central S/T/Y residue with a non‐phosphorylatable residue (Supplementary Table 4). We also find 37 potentially phosphomimetic mutations introducing negatively charged aspartic and glutamic acid residues that physicochemically resemble residues with constitutive phosphorylation (Supplementary Table 5). Both lists are enriched in known cancer genes, suggesting that pSNVs disable tumor‐suppressing phosphorylation and activate signaling that promotes tumorigenesis (n=13, P=2.4 × 10−8 and P=3.3 × 10−12, respectively, Fisher's exact test). However, most pSNVs occur in phosphosite‐flanking regions (Figure 1F). These may enhance or disrupt sequence recognition and binding affinity of associated reader, writer or eraser proteins, and modify signaling systems while still maintaining the phosphorylation switch. Finally, genes with phosphosite pSNVs are enriched in known drug targets from the DrugBank database (n=109, P=2.5 × 10−11; Knox et al (2011)). Druggable genes with pSNVs cover nearly two‐thirds (284) of phospho‐mutated cancer samples and 36% of all cancer samples, indicating the potential for therapy development targeting affected genes.
ActiveDriver—a novel gene‐centric method to identify significantly mutated protein sites
We developed a novel statistical method, named ActiveDriver, to identify genes with significant pSNVs (see Materials and methods). This gene‐centric method is based on generalized linear regression and tests the following pair of hypotheses for a given gene and its phosphosite region. The null (expected) model states that the phosphosite region follows the same mutation rate as the gene sequence given its structured and disordered regions. The alternative model states that the phosphosite‐specific mutation rate is higher or lower than the gene‐wide mutation rate. The two models are compared with a likelihood ratio test that refutes the null hypothesis if a distinct mutation rate is required to explain pSNVs observed in the phosphosite region. We use a model‐based approach rather than a direct statistical test, as short phosphosites tend to involve low mutation counts and numerous considered sites would reduce statistical power due to multiple testing. To establish the significance of a gene in a cancer type, we multiply P‐values of all its significantly mutated phosphosites (P⩽0.05). A gene is deemed significant if at least one of its phosphosites displays unexpected mutation rates, while its non‐significant sites do not contribute to the final composite P‐value. Finally, we correct the composite P‐value for multiple testing with the Benjamini–Hochberg false‐discovery rate (FDR) and select genes whose P‐values exceed the significance of a certain threshold (FDR P⩽0.05).
ActiveDriver considers information on pSNV position within a phosphosite, protein structured and unstructured regions and cancer type specific mutation rate. First, the functional impact of each pSNV is determined by the number of adjacent phosphosites and their position relative to the mutation. Direct mutations of S/T/Y are likely to have stronger functional impact than immediate and proximal flanking mutations, and single pSNVs can affect multiple clustered phosphosites. We encode this information in three features of the alternative regression model: for every position s in the phosphosite region, we count (i) the number of phosphosites at s (zero or one), and the number of phosphosites (ii) within 1–2 residues from s, and (iii) within 3–7 residues from s. Second, post‐translational modifications are known to occur in unstructured (disordered) regions of proteins, and such regions are also believed to evolve more rapidly than structured regions. Therefore, we consider separate mutation rates for disordered and ordered protein regions, and encode this information as confounding factors in the null and alternative regression models. Third, cancer genomes of different types are biologically distinct and involve varying sample sizes (e.g., 4–186 samples, leukemia versus ovarian cancer), baseline mutation rates (e.g., 0.74–11 missense mutations per million amino acids, pancreatic versus lung cancer), and different mutation calling protocols. Cancer‐specific mutation rates are therefore compared in independent models, and corrected separately for multiple testing. Such data integration provides greater sensitivity than conventional statistical tests, as we show below.
ActiveDriver identifies known cancer genes with significantly mutated phosphosites
ActiveDriver analysis of pSNVs for nine separate cancer data sets resulted in 44 genes with significantly unexpected numbers of phosphosite mutations (FDR P⩽0.05, Figure 2A). We repeated the analysis with a merged collection of SNVs from all cancer types and found 14 additional phospho‐mutated genes.
The results are enriched in known cancer genes from the Cancer Gene Census and earlier review papers (n=15, P=1.1 × 10−11, Fisher's exact test), genes encoding kinase proteins (e.g., KDR, EGFR, ABL1, FLT4; n=9, P=3.8 × 10−5), and transcription factors (e.g., TP53, FOXO3, MLL, MET; n=15, P=3.0 × 10−4). The gene list also includes 13 drug targets (P=1.3 × 10−3), nine of which are known cancer genes (Knox et al, 2011). Ranked pathway enrichment analysis of ActiveDriver results using g:Profiler (Reimand et al, 2011) highlights relevant pathways and Gene Ontology (GO) categories, including cell motility, regulation of cell adhesion, regulation of epithelial cell proliferation, and blood vessel development (all FDR P⩽0.05, Supplementary Table 6). Thus, ActiveDriver‐identified genes are enriched in cancer hallmark processes.
Our predictions include 15 genes with well‐established roles in cancer biology. For instance, β‐catenin (CTNNB1) encodes a transcriptional co‐regulator and downstream target of the Wnt pathway involved in organism development (FDR P=9.8 × 10−4 from ActiveDriver). In the absence of Wnt signaling, CTNNB1 is phosphorylated by GSK3 kinase and continuously degraded. Wnt pathway activation blocks the phosphorylation of CTNNB1, leading to its accumulation in the nucleus and transcriptional activity (van Noort et al, 2002). Aberrant activity of CTNNB1 due to altered N‐terminal target phosphosites of GSK3 has been observed in colorectal cancer (Morin et al, 1997). In our data, all five SNVs in CTNNB1 are specifically linked to phosphorylation. ActiveDriver highlights the N‐terminal region of CTNNB1 that includes phosphosites S33, S37, T41 targeted by multiple kinases (CDK6, CHUK, CSNK1/2A/2B, GSK3A/3B, MAPK8, PRKACA; Figure 2B). These sites are affected by four pSNVs (two in lung and two in liver cancer), including three phosphorylation‐disabling mutations (T41A, S37F, S37F) and one potentially phosphomimetic mutation G34E. An additional pSNV in ovarian cancer (G555A) affects three phosphosites that are known targets of AKT1, AKT2 and PRKACA kinases. Consistent with previous observations (Morin et al, 1997), we propose that lung and liver cancer pSNVs in CTNNB1 disrupt its degradation and enable downstream transcriptional programs to benefit cancer progression. In this example, a small number of very specific pSNVs appear in multiple types of cancer and have a statistically significant pattern with a biologically meaningful interpretation. Our strategy is therefore useful for interpreting rare mutations.
About one‐third (21) of detected genes have phosphosite‐specific mutations in multiple cancer types (Figure 2A), potentially highlighting general cancer driver mechanisms. As integration approaches are less explored in cancer genomics studies, most of our additional findings from the merged data set analysis represent novel candidate cancer genes. Further, IRS1 (insulin receptor substrate 1) and GRM1 (glutamate receptor, metabotropic 1) are known drug targets (Knox et al, 2011) and therefore may be directly actionable for therapy development. KRAS is the only well‐recognized cancer gene in the merged analysis and is listed due to a less‐than‐expected number of pSNVs (FDR P=4.6 × 10−3 from ActiveDriver), indicating the role of phosphorylation signaling in its oncogenic activities. These results illustrate the utility of data integration from multiple cancer genomics projects.
ActiveDriver highlights phospho‐mutated candidate cancer genes FLNB, GRM1, POU2F1
Our findings also include genes whose cancer‐specific roles are less recognized. The highest‐ranking candidate cancer gene is filamin‐B (FLNB, FDR P=5.4 × 10−7 from ActiveDriver) that has been highlighted in the breast cancer genome project due to frequent mutations (Sjoblom et al, 2006). All four SNVs in breast cancer (A1565G, T703K, N663K, R566Q) alter phosphosite‐flanking regions, although evidence for phosphorylation events comes from large‐scale screens and no targeting kinases are currently known. FLNB is an intracellular signaling protein involved in organization of actin cytoskeleton as well as skeletal and neuronal development (Lu et al, 2007). In particular, knockdown of FLNB has been shown to inhibit VEGF‐induced angiogenesis (Del Valle‐Perez et al, 2010), a hallmark of tumor cells. Further, knockdown of filamin genes has been shown to reduce cell migration in cancer cell lines (Baldassarre et al, 2009), and germline variants observed in human developmental disorders have been linked to gain‐of‐function phenotypes manifested in increased F‐actin binding (Sawyer et al, 2009). We therefore speculate that the observed pSNVs may also act as gain‐of‐function mutations that enhance angiogenesis, invasion or metastasis of tumor cells.
The G‐protein‐coupled receptor GRM1 (glutamate receptor metabotropic 1) is identified in the merged data set analysis of pSNVs due to two flanking pSNVs in glioblastoma (R684C) and colorectal cancer (R696W) (FDR P=0.047 from ActiveDriver). The latter mutation directly flanks a phosphorylation site of protein kinase C α (PRKCA) at T695 that is involved in receptor desensitization in a feedback loop (Francesconi and Duvoisin, 2000). We propose that the pSNV disrupts inhibition of the receptor activity, leading to aberrant activation of tumorigenic processes such as growth, survival, and proliferation via the downstream phosphoinositide 3‐kinase (PI3K) pathway. Overexpression of GRM1 was shown to induce melanoma in mouse models (Pollock et al, 2003), and GRM1 mutations were linked to melanoma in a human genetic association study (Ortiz et al, 2007). The family of metabotropic glutamate receptors is also generally enriched in SNVs across all cancer samples (n=16 samples, P=3.2 × 10−3, Poisson exact test), suggesting the importance of GPCR signaling in tumor biology. GRM1 is potentially actionable, as it is the target of several drugs according to Drugbank (e.g., acamprosate, 4‐(1‐amino‐1‐carboxy‐ethyl)‐benzoic acid; Knox et al, 2011).
A final example is POU2F1 (FDR P=0.024 from ActiveDriver), a POU domain transcription factor and a cell‐cycle regulator that undergoes phosphorylation‐mediated inhibition of DNA‐binding activity during mitosis (Segil et al, 1991). POU2F1 is highlighted in our model for two pSNVs in glioblastoma (R296Q) and in breast cancer (S111F). The latter mutation directly modifies a phosphosite targeted by PRKDC kinase, involved in promoting cell survival in response to DNA damage (Schild‐Poulter et al, 2007).
The detection of many known cancer genes validates our approach and the additional highlighted genes, some known to be druggable, provide novel hypotheses for detailed, functional experiments.
ActiveDriver is complementary to frequency‐based methods of global mutation significance
Conventional cancer genomics analysis focuses on frequently mutated cancer genes instead of the long tail of genes with rare mutations. In contrast, our ‘gene‐centric’ model considers each individual gene and detects signaling sites whose mutations are unexpected given the gene‐wide mutation rate. We therefore find many genes that harbor infrequent, albeit highly specific mutations that are missed when considering mutation frequency alone (Figure 2C). In particular, our results include nine known cancer genes not found among the top 58 genes ranked by mutation frequency (median rank 319, Supplementary Figure 3).
State of the art global mutation assessment methods such as MutSig compute significance of gene mutations by comparing these with baseline mutation rates estimated from whole genomes or exomes (Banerji et al, 2012). To compare ActiveDriver with this general approach, we implemented a simple global strategy similar to MutSig using a standard binomial statistical test. This global strategy highlighted only the four most frequently mutated genes as statistically significant (TP53, KRAS, PTEN, EGFR; FDR P⩽0.05, Supplementary Figure 4). ActiveDriver identified many more, and also found seven genes with significant pSNVs that have less mutations than expected from the genome average of the corresponding cancer type (PRKCI, PHF3, KTN1, MLL, AKAP13, MET, CDK11A). Thus, genes with specific and significant phosphosite mutations would remain unseen in a global analyses.
We further re‐implemented ActiveDriver using binomial statistics in place of our disorder‐corrected regression model, with all other factors unchanged, to test the effectiveness of our model versus standard methods. This simplified strategy only found five highly mutated genes as statistically significant (EGFR, TP53, IDH1, KRAS, FLNB; FDR P⩽0.05), the first four of which are cancer genes (P=2.0 × 10−6, Fisher's exact test). As all these were also found with ActiveDriver, we conclude that modeling protein disorder and phosphosite position is important for estimating cancer gene significance. Our method is therefore more sensitive than standard approaches, as it highlights 11 additional cancer genes and many novel candidates with highly specific phosphosite mutations.
Phosphosite mutations of TP53 correlate with extended survival in ovarian cancer and glioblastoma
The tumor suppressor transcription factor TP53 is the top‐ranking gene in our pSNV analysis (167 pSNVs, FDR P=7.6 × 10−86 from ActiveDriver). It is an active phosphoprotein and a substrate of 43 kinases with 29 phosphosites in nine regions. ActiveDriver revealed a statistically significant mosaic of phosphosites and pSNVs (Figure 3A), including four hotspot regions of phosphosites that are enriched in pSNVs in eight cancer types.
To explore the functional and clinical significance of pSNVs, we studied the survival rates of corresponding glioblastoma and ovarian cancer patients. We found that phosphosite‐associated TP53 mutations significantly correlate to increased survival among ovarian cancer patients (log‐rank test P=7.2 × 10−3, Cox regression P=0.025, Figure 3B). The survival correlation is evident even in comparison to patients with wild‐type TP53, suggesting that such pSNVs might be beneficial for tumor suppression. A similar correlation between pSNVs and better prognosis is seen among glioblastoma patients with long‐term survival (>1 year), although its statistical significance is low due to small sample sizes (log‐rank test P=0.066, Cox regression P=0.17, Figure 3C). In agreement with these data, phosphorylation of T155, S183, S269, T284 by casein and aurora kinases has been associated with TP53 inhibition via post‐translational degradation and transcriptional repression (Bech‐Otschir et al, 2001; Wu et al, 2011). The highlighted pSNVs in 118 patients potentially inhibit phosphorylation of these sites, meaning that TP53 will no longer be degraded if mutated. Taken together, our data suggest a double‐negative mechanism in which phosphorylation‐mediated inhibition of TP53 is potentially disrupted by pSNVs, leading to reduced inhibition of TP53 tumor suppressor function and increased survival.
In contrast to the above mutation hotspots, phosphosites in TP53 termini appear as highly significant mutation deserts (three pSNVs observed, 67±8 expected, P=2.8 × 10−30 from ActiveDriver). The observed negative selection indicates the importance of these regions as tumorigenic signaling interfaces. The N‐terminus of TP53 is the interaction interface of its primary inhibitor MDM2, and TP53 phosphorylation of S15 and S20 inhibits MDM2 binding and leads to stabilization and activation of TP53 (Chehab et al, 1999). The absence of mutations in the region suggests that the sequence is required for successful docking of MDM2 and inhibition of apoptosis. This example demonstrates the utility of ActiveDriver in interpreting somatic mutations in known cancer genes.
Pathway analysis of phosphomutated genes reveals cancer hallmarks and predicts novel driver systems
Next, we performed a pathway analysis to find systems of functionally related genes with frequent pSNVs. We assumed that frequent recurrence of cancer mutations in the same biological system is unlikely unless the system is involved in cancer. We searched for pSNV‐enriched systems using GO categories (Ashburner et al, 2000), Reactome pathways (Matthews et al, 2009), and human protein complexes from the CORUM database (Ruepp et al, 2010). To reduce bias from extremely highly mutated genes, we excluded TP53 and EGFR, which together cover 39% (181) of the 460 phosphomutated tumor samples.
Our initial pathway analysis revealed nearly 2000 GO terms and pathways with statistically significant enrichment in mutated cancer samples (FDR P⩽0.05), many of which match known hallmarks of cancer (Hanahan and Weinberg, 2011) (Supplementary Figure 6). However, such broad themes are also expected in a pathway analysis of standard mutation frequency‐ranked gene lists. To identify functional categories specifically affected by phosphorylation mutations, we repeated the pathway enrichment analysis considering all SNVs and subtracted the list of 1590 categories detected in both SNV and pSNV analyses (all FDR P⩽0.05). This produced a final set of 400 phospho‐specific enriched categories that are not found by analyzing all SNVs (Supplementary Table 7).
The top 50 pSNV‐focused GO categories highlight multiple interesting functional themes (Figure 4A). The highest‐ranking categories are innate immune response (n=52 samples, FDR P=3.7 × 10−4, Poisson exact test), cytokines, and specific pathways for Toll signaling (n=20, FDR P=1.6 × 10−4, Figure 4B) and IκB‐NFκB cascade (n=36, FDR P=4.9 × 10−6), all of which highlight the importance of altered phosphorylation in immune signaling. Avoiding immune destruction and establishing tumor‐promoting inflammation are emerging hallmarks of cancer and important therapeutic target systems (Hanahan and Weinberg, 2011). While such pathways are generally expected to be enriched in cancer mutations, the majority of specific pSNVs are infrequent and therefore likely less understood. pSNVs appear in multiple cancer types and affect different components of the same system, suggesting that similar strategies could be applied for drug development. Mutations in some pathways may already be actionable due to existing drugs against known or candidate cancer genes (Figure 4C). For instance, 23 patients associated with the innate immune response category carry mutations in 17 druggable genes such as HCK, SRPK2, MAPK9, UBC and HLA‐A. Thus, pathway analysis of pSNVs may be useful in developing personalized drug treatments based on known drugs.
Our analysis also reveals 21 non‐overlapping protein complexes and 17 Reactome pathways with frequent pSNVs (FDR P⩽0.05, Figure 4C and D). The identified pathways are often related to cancer hallmarks such as cell‐cycle regulation, but not always. For example, the HCF1 complex involves six phospho‐mutated genes in ovarian, liver, and brain cancer patients: two transcriptional regulators (SP1, HCFC1), a histone methyltransferase (ASH2L) and three druggable heat shock proteins (HSP90AA1, HSP90AB1, HSPA8). HCF1 selectively modulates chromatin structure and promotes cell proliferation by transcriptional activation and repression (Wysocka et al, 2003). Mutations in this complex may therefore lead to aberrant expression of cell‐cycle genes and initiation or enhancement of malignant growth.
As another example, the ASF1 chaperone complex regulates chromatin assembly during DNA replication (Groth et al, 2005). The ASF1 complex is highlighted due to four component genes with five pSNVs: the tumor suppressor kinase CHEK2 and three H3 histones with pSNVs in tail regions (HIST1H3A, HIST1H3D, HIST1H3H). Interestingly, H3 histone phosphorylation at S10 has dual roles in chromatin condensation and transcriptional activation during different cell‐cycle phases (reviewed by Nowak and Corces, 2004), and we see two flanking pSNVs in this site (R9G, R3C). These mutations may also affect other modifications, in particular acetylation and methylation that act as regulatory switches of transcriptionally active chromatin structure. pSNVs in the ASF1 complex could therefore drive cancer using two mechanisms: by altering DNA replication to introduce further mutations in cancer genomes, and by modifying gene expression during cell cycle to increase proliferation.
In summary, functional analysis of pSNVs reveals known and putative cancer driver pathways and complexes, not apparent from standard global analysis of all somatic mutations. All identified complexes, functions and pathways are affected by mutations in multiple cancer types, indicating that our approach is useful for uncovering general mechanisms of tumor biology.
Network analysis of pSNVs reveals hierarchically organized cancer kinases
To identify additional signaling systems important in cancer, we analyzed pSNVs in the kinase–substrate network. The network contains information from many proteomics experiments and is therefore complementary to the known pathways and complexes analyzed above. Multi‐kinase signaling systems are enriched in mutations, as 74% of 725 kinase–kinase phosphorylation interactions are altered in cancer (P=1.9 × 10−25, Fisher's exact test). These are expected to include driver mutations, as alterations of inter‐regulator signaling likely affect multiple downstream processes.
To study this further, we constructed kinase‐centric network modules containing three types of proteins: a central kinase, its downstream substrates and upstream kinases. The modules for all interacting kinases were studied with pSNV enrichment tests to identify frequently phospho‐mutated signaling systems. This analysis revealed 59 kinase‐centric modules with surprisingly high pSNV frequency (FDR P⩽0.05; Figure 5A; Supplementary Table 8). Many central kinases are associated with cancer, validating our analysis method (n=15, P=1.4 × 10−11, Fisher's exact test). Further, many kinases are also known drug targets, suggesting that they are potential avenues for therapy development (n=23, P=2.7 × 10−10).
We grouped the modules according to frequency of pSNV mutations among upstream and downstream interaction partners and revealed a hierarchy of five distinct kinase classes. ‘Master kinases’ are not highly phospho‐mutated themselves, but have numerous pSNVs in downstream substrates (e.g., SRC, AKT1, ABL1). ‘Middle manager kinases’ include pSNVs in the gene of the central kinase as well as its upstream and downstream interaction partners (STK11, MET, ITK). ‘Local kinases’ mostly have pSNVs in genes of downstream substrates (CDK6, ERBB2), while ‘managed kinases’ are characterized by upstream pSNVs (FLT4, BCR). ‘Self‐employed’ signaling systems of KDR and MAPK6 mostly involve mutations in genes of central kinases. Thirteen modules involve pSNVs in feedback loops where the mutated protein is both upstream and downstream of the central kinase.
Analysis of kinase‐centric modules provides phosphorylation‐associated interpretation of recurrent mutations in well‐established cancer genes. For example, the majority (56%=29) of EGF receptor (EGFR) missense point mutations associate with phosphorylation and appear in a tissue‐specific, mutually exclusive pattern (Figure 5B). The kinase domain of EGFR is characterized by 15 lung cancer pSNVs (FDR P=4.9 × 10−8 from ActiveDriver), including the well‐studied L858R mutation that affects EGFR autophosphorylation and is used as a clinical marker for therapeutic outcome (Sharma et al, 2007). In contrast, 14 SNVs in glioblastoma are associated to EGFR phosphorylation sites, suggesting that the mutations play a role in post‐translational activation of this oncogene. Eleven pSNVs alter the residue A289 that directly flanks a novel putative phosphosite T290 (Ruse et al, 2008) in the extracellular domain of the protein (FDR P=8.5 × 10−16 from ActiveDriver, Figure 5C). While extracellular phosphorylation mechanisms are generally poorly understood, a recent study has characterized a novel family of secreted kinases with a role in human disease (Tagliabracci et al, 2012). Further study of the role of phosphorylation at this site may help explain the mechanism of highly recurrent glioblastoma mutations.
The master kinase PKC‐zeta controls a frequently phospho‐mutated tumor suppressor network
Protein kinase C zeta (PRKCZ) is the master kinase with the strongest enrichment of pSNVs among its interaction partners (n=30 samples, FDR P=1.2 × 10−6, Poisson exact test; Figure 5D). This kinase functions in the PI3K and MAPK pathways and is involved in multiple cellular functions, including cell cycle and proliferation, cell polarity, NFκB signaling and inflammation. While these functions relate to hallmark cancer pathways, the direct role of Protein Kinase C zeta in tumor biology is less established.
Our network analysis suggests that the kinase is involved in tumor‐inhibiting signaling systems, as it directly phosphorylates tumor suppressors PTEN and STK11 and may indirectly affect signaling of TP53 and LATS1 via STK11. Altogether, the PRKCZ‐STK11 network module involves 44 pSNVs in 19 genes and 7 cancer types. When considering the additional 157 pSNVs in TP53, every fourth tumor in our data set involves aberrant signaling in this system. Mutation‐ranked functional enrichment analysis of genes in the network reveals categories such as cell cycle (FDR P=5.0 × 10−5 from g:Profiler), cell differentiation (FDR P=1.0 × 10−10), and protein phosphorylation (FDR P=3.8 × 10−29). These functions are consistent with the proposed roles of PRKCZ. In particular, the master kinase network involves a number of frequently mutated downstream kinases that demonstrate the extent of altered signaling in tumor cells. The network is also enriched in specific cancer‐related signaling pathways such as WNT (FDR P=1.2 × 10−3), MAPK (FDR P=0.023), mTOR (FDR P=7.0 × 10−15), and NGF (FDR P=5.5 × 10−9). PRKCZ has 11 human paralogs and the significant enrichment of pSNVs and SNVs indicates the importance of PKC signaling in cancer (n=15 pSNVs, P=7.6 × 10−13 and n=27 SNVs, P=1.2 × 10−9, respectively). While PRKCZ is not a well‐known drug target, multiple inhibitors are available for its immediate upstream kinase PDPK1.
Heuristic network search in the kinase–substrate network identifies signaling modules associated with increased patient survival in ovarian cancer
To explore the clinical relevance of our findings, we studied correlation of pSNVs with patient survival data available in the ovarian cancer genome project (Cancer Genome Atlas Research Network, 2011). First, we repeated the pSNV enrichment analysis for pathways and networks separately for ovarian cancer, and interrogated the resulting 552 GO terms, pathways, and protein complexes for correlations with survival. This revealed seven significant GO categories, however, all are related to survival‐associated pSNVs in TP53 (Figure 3B), and no significant correlations were found after removing TP53 from the data set.
To discover modules in the kinase–substrate network that correlate with clinical outcome, we designed a local network search algorithm, called HyperModules, that extends concepts from earlier methods (Chuang et al, 2007; Reimand et al, 2008; Altmae et al, 2011). HyperModules starts from a single mutated ‘seed’ protein and searches its interaction neighborhood for paths of length two to find other proteins that correlate with a given clinical variable (e.g., survival). Paths are merged into larger modules if they improve the correlation, and the search stops once no further improvements are found. Module correlation with survival is assessed with an age‐weighted Cox proportional hazards regression model and a likelihood ratio test. We also compute the significance of our findings by evaluating the distribution of Cox P‐values expected from the network. Statistical significance of a module is assessed using a permutation test in which the search is repeated in 10 000 random neighborhoods of the seed node. Each random network precisely reflects the topology of the original seed‐centered network neighborhood, while node labels (protein names) along with associated pSNV sets are shuffled globally over the entire kinase–substrate network. This strategy is useful for two reasons. First, permutations maintain protein‐based survival correlations and break down signals originating from kinase–substrate interactions, therefore highlighting real modules where such interactions are important. Second, the strategy corrects for topological biases like hub proteins with large network neighborhoods, as such neighborhoods produce strong survival correlations even with permuted mutations and are therefore considered less significant. Finally, to test modules originating from a particular seed, we discard modules whose survival significance is similar to P‐values obtained from random networks (FDR P⩽0.05).
We searched for survival correlations across all 164 mutated proteins in the network and found 16 significant signaling modules (FDR P⩽0.05, permutation test, Supplementary Table 9). One of our top‐ranking modules associates with increased survival and describes mutations in eight patients who were all alive at the end of the ovarian cancer study (Figure 6A). This is highly significant according to the Cox survival regression model used in the search (P=4.3 × 10−5, Figure 6B). In addition, the permutation test shows that such a survival P‐value is unlikely to be found in randomly mutated networks (P=2.2 × 10−3, FDR P=0.015, Figure 6C). The module comprises eight mutually exclusive pSNVs in phosphosites and kinase domains of eight proteins (Figure 6D). Correlations with other clinical variables further support the module: alive patients and tumor‐free patients are enriched in the module (Fisher's exact P=6.8 × 10−4 and P=8.0 × 10−3, respectively), while subjects of additional chemotherapy are depleted (P=8.4 × 10−3, Figure 6E–G).
GO enrichment analysis links the survival module to immune response‐activating signal transduction and locomotion, among other terms (FDR P⩽10−4 from g:Profiler). The network seed is the hemopoietic cell kinase HCK with a highly specific mutation E410K in its kinase domain that potentially disables the adjacent autophosphorylation site at Y411, required for kinase activity (Porter et al, 2000). HCK is involved in immune signaling and cell proliferation in hematopoietic cells and is linked to cancer. High levels of HCK are associated with drug resistance in chronic myeloid leukemia, and its constitutively active isoforms induce solid tumors in mice (Poincloux et al, 2009; Pene‐Dumitrescu and Smithgall, 2010). These published findings, the likely inactivating mutation in HCK and the observed mutation pattern suggest that the pSNVs work to disable HCK signaling for the benefit of patient survival. The positive survival correlation is not surprising, as other cancer mutations with positive prognosis are known, for example NPM1 mutations in acute myeloid leukemia (Verhaak et al, 2005).
While the output of our network search algorithm is not directly applicable for predicting clinical outcome, as it is challenged by infrequent mutations and the small‐world property of interaction networks, it is useful as an exploratory tool that helps discover and interpret rare, specific mutations in signaling networks that are significantly correlated with clinical outcome.
Many cancer genes are discovered because of high mutation frequency in tumor samples. Our analysis considers more detailed information in the form of SNVs that specifically target experimentally determined phosphosites and kinase–substrate interactions. As a result, we are able to highlight potential cancer genes that would otherwise remain hidden in the long tail of rare mutations. Pathway and network analysis highlights similar patterns at higher levels of cellular organization. The observed enrichment of known cancer genes in our results validates the analysis, and lends confidence to less‐studied genes and systems that we predict to be important in tumor development.
A major challenge in genomics is the functional interpretation of mutations. Functional mutations affect important protein residues, but traditional mutation evaluation methods generally focus on gene and protein specific information, such as amino‐acid conservation and DNA sites of alternative splicing (Jordan et al, 2010). We can uncover additional information about functional mutations by considering protein sites related to molecular interactions. Protein interaction interfaces are often involved in cell signaling; therefore, significant mutations in these sites are likely to be functional and important in disease. A large class of protein sites in cell signaling are short linear motifs bound by peptide recognition domains (Pawson and Nash, 2003). The writers, readers, and erasers of the phosphorylation machinery recognize phosphorylated motifs, but many other protein domains perform similar functions using other types of sites (Pawson and Nash, 2003). For instance, SH3 domains recognize proline‐rich motifs and the histone code has its own set of reader, writer, and eraser domains for post‐translational acetylation and methylation. Further, transcription factors recognize regulatory sites at the DNA level. These data are now increasingly available for constructing binding site resolution molecular interaction networks of many human proteins (Chua et al, 2004; Dinkel et al, 2012; Reimand et al, 2012; Wang et al, 2012). Such networks allow functional interpretation of mutations with much higher levels of precision than currently possible. For instance, we can identify mutations that alter signaling networks by disrupting or creating binding sites. Our methods are a step in this direction and our future work will include a wider array of signaling domains and their binding motifs.
Our current analysis involves several limitations. First, we only analyzed missense point mutations that are the simplest to interpret, and excluded other mutation types. Considering the short‐read DNA sequencing techniques employed by cancer genomics projects, these types of mutations are also likely to be the most accurate. Our methods can be extended to include other mutation types in the future. Second, our analysis treats all phosphosites equally, but phosphorylation is only functional in a cell if the appropriate signaling machinery is properly co‐expressed and co‐localized. Also, our phosphorylation site data set is derived from a collection of publications, represents a mixture of different cell types, tissues and experimental conditions, and is not specific to the cancer types we study. Cancer‐specific phosphoproteomics experiments will hopefully be available in the future to address this limitation. We currently address this limitation by assigning a higher score to genes with phosphorylation sites that are mutated multiple times in different samples and also mutated in multiple sites within the same protein. We thus assume that the highest scoring proteins from our analysis are more likely to be cancer relevant due to the repeated and statistically significant mutation pattern. Finally, mutations in transcriptionally active genes are more likely to be functional than mutations in silent genes, and mutation calls combined with RNA sequencing of corresponding samples therefore reveal an ‘active’ set of mutations. Mutation filtering will increase the precision of our methods once these data become more widely available. Our models can also consider additional factors that indicate the importance of phosphorylation and other functional sites, such as site conservation (Tan et al, 2009), stoichiometry, and number of kinases targeting a site.
Protein phosphorylation machinery, in particular the protein kinase family, is an important drug development target, and several agents such as kinase inhibitors are routinely used in the clinic. Phosphorylation‐specific cancer mutations are therefore likely to highlight potential drug targets and may be useful as predictive markers of drug response, or to identify alternate agents in primary‐drug‐resistant tumors.
We present a wealth of hypotheses for cancer‐related discovery, and several novel analysis methods for interpreting cancer genomes. Our approaches will become more powerful as data accumulate from the expanding efforts to decipher cancer genomes.
Materials and methods
Experimentally determined phosphorylated residues and their flanking regions of ±seven residues (referred to as phosphosites) and phosphosite‐associated kinases were retrieved from three curated public databases: PhosphoSitePlus (Hornbeck et al, 2012), PhosphoELM (Dinkel et al, 2011), and Human Protein Reference Database (HPRD) (Keshava Prasad et al, 2009) (all downloaded on 30 November 2011). Phosphosites were mapped to high‐confidence protein sequences from the Consensus Coding Sequence (CCDS) database (Pruitt et al, 2009) (build 20110907, NCBI build 37.3). We mapped phosphopeptides to CCDS sequences using exact sequence matching to avoid discrepancies between protein isoforms, and discarded non‐matching peptides. Phosphosites with overlapping flanking sequences were merged into continuous regions. Kinase domains were retrieved from the HPRD database and mapped to protein sequences using a similar procedure. We used HGNC symbols provided by all three phosphosite databases, and retrieved all corresponding isoforms from the CCDS data set after removing 1380 sequences with non‐public status and 16 pseudoautosomal genes. We selected the longest isoform of every protein and discarded the remaining isoforms from further analysis.
Somatic mutations in cancer genomes
Somatic mutations from nine cancer genomics projects (Wood et al, 2007; Cancer Genome Atlas Research Network, 2008, 2011; Ding et al, 2008; Jones et al, 2008; Parsons et al, 2008; Puente et al, 2011; Totoki et al, 2011) were downloaded from the International Cancer Genome Consortium (ICGC) data portal (The International Cancer Genome Consortium (2010), version 4–6, downloaded on 30 November 2011). A subset of mutations matching the human genome build 36 were mapped to build 37 with the LiftOver software of the UCSC Genome Browser. We discarded all non‐coding mutations, silent mutations, multi‐nucleotide substitutions, insertions and deletions from all data sets, and retained only non‐synonymous, missense SNVs. Further, all mutations in a given gene and sample were discarded if they contained a mutation that created a premature stop codon or removed the existing stop codon.
Phosphorylation‐associated SNVs (pSNVs) comprised SNV modified phosphosites or kinase domains. Phosphosite‐associated pSNVs involved mutations that directly modified central, phosphorylated serine (S), threonine (T), and tyrosine (Y) residues (i.e., direct pSNVs), as well as mutations flanking the central S/T/Y within seven residues. Kinase‐associated pSNVs were mapped to domains from HPRD. We compared phosphosite‐associated pSNVs with single‐nucleotide polymorphisms (SNPs) published by the 1000 Genomes Project Consortium (2010) and found a small set of 12 loci that overlap, representing 1.5% of unique missense pSNVs.
Global analyses of phosphosite mutations
Global enrichment of SNVs in phosphosites was determined using Fisher's exact test, by considering the protein‐coding sequence length of all involved cancer samples as the background. The background was corrected for each cancer data set independently, as some studies sequenced all genes and some sequenced a focused set of genes. We considered all genes described in the publication and any additional genes indicated in the ICGC data sets.
Global phosphorylation network properties were evaluated with the non‐parametric Wilcoxon test, by comparing the degree distribution of mutated and non‐mutated proteins in the kinase–substrate network. To assess protein centrality, we compared the betweenness centrality distribution of mutated and non‐mutated proteins in the network with the IGraph R package (Csardi and Nepusz, 2006). An edge in the phosphorylation network was considered mutated if either the kinase or substrate participants had at least one mutation in the kinase domain or any phosphosite. The enrichment of mutated kinase–kinase edges in the phosphorylation network was determined with Fisher's exact test, by taking the collection of all kinase–substrate interactions as the statistical background.
The high‐confidence collection of 555 cancer genes was defined as the union of genes described in four review papers (Mitelman, 2000; Hahn and Weinberg, 2002; Futreal et al, 2004; Vogelstein and Kinzler, 2004), as collected in the databases of Cancer Genes (Higgins et al, 2007), and the Cancer Gene Census (Futreal et al, 2004) (downloaded on 15 December 2011). We also compared our list of phosphomutated genes with coding mutations from the Catalog Of Somatic Mutations In Cancer (COSMIC; Forbes et al (2010), downloaded on 5 April 2012) and found that 99% were expectedly present as COSMIC contains data from all published cancer genomics projects. The list of 1870 human transcription factors was published by Vaquerizas et al (2009). The list of 546 human kinases was compiled using kinase domain information from the HPRD database (Keshava Prasad et al, 2009) as well as kinase–substrate interactions from PhosphositePlus, PhosphoELM and HPRD. The set of 1658 genes corresponding to known approved, experimental, and illicit drug targets excluding nutraceuticals were retrieved from the DrugBank database (Knox et al (2011), downloaded on 16 October 2012). Paralogs for gene family analysis were retrieved from Ensembl 69. Statistical analyses with these lists were performed with the Fisher's exact test, by considering the final set of unique CCDS genes as the background. Protein disorder predictions were performed with DISOPRED2 (Ward et al, 2004), using a local installation of the software and the NCBI BLAST 2.26 NR data bank (downloaded on 21 June 2012).
ActiveDriver analysis of phosphosite mutations
We developed and applied a generalized linear regression model, called ActiveDriver, to find phosphosites whose mutations are unexpected given the background mutation rate. The model assumes that missense mutations in a sequence of a particular gene and cancer type follow the Poisson probability distribution where y⩾0, y∊N is the observed number of SNVs, and λ>0, λ∊R is the missense mutation rate of the protein sequence of the gene. The rate parameter λ corresponds to the expected number of mutations per residue E(Y)=λ, as well as its variance Var(Y)=E(Y2)−E(Y)2=λ, where Y=y1,…,yn is the vector of mutation counts per residue in the protein sequence of n residues.
In a data set of n samples, generalized linear regression models express the dependency between a response vector Y=y1,…,yn and k predictor vectors X1=x11,…,xn1;…;Xk=x1k,…,xnk, as where β=β1…βk is a vector of model coefficients and β0 is the model intercept, X=X1;…;Xk is the predictor matrix and XTβ is the dot product of the transposed predictor matrix and model coefficients, and g() is a link function for non‐linear transformation. Poisson regression expresses the dependency between response and predictors through the Poisson distribution, as where Y∼Po(λ) and the link function g()=ln() corresponds to the natural logarithm.
The likelihood of a Poisson regression model given data is computed as the product of Poisson probabilities of response values yi of samples indexed with i, given corresponding values xij of predictors indexed with j, as where Xi* is the vector of k predictor values xi1,…,xik for a given sample i. The log likelihood of the model is equivalent and more efficient in practice:
Maximum likelihood estimation (MLE) is used to find model coefficients that give the best agreement (i.e., smallest absolute log likelihood) between response and predictor values,
The factorial term yi! does not involve coefficients and can be discarded from the estimation. While no analytical solutions exist for the above equation, the Iteratively Reweighted Least Squares algorithm is used to find optimal coefficients and a corresponding MLE value
In our approach, we used Poisson regression to test whether a specific phosphosite region in a given gene involves a significantly different mutation rate than the gene on average. According to our null hypothesis, mutations across the whole protein sequence of the gene follow the Poisson distribution with the intercept coefficient reflecting background rate β0 linearly combined with a structure parameter X(s) and corresponding coefficient β(s). The null hypothesis is expressed as the following intercept‐only model in which X(s) is set to one if the sequence position corresponds to disordered protein sequence, and equals zero if the corresponding region is structured (non‐disordered). According to our alternative hypothesis, the mutations in the phosphosite region q are generated by rates λ that are significantly different from the baseline rate λ0 while considering protein disorder:
In addition, our alternative model may include information about phosphosite density if it significantly improves model fit. The information is expressed in three additional predictor variables X(d),X(v),X(w) that are set to zero in protein sequence positions outside the phosphosite region of interest, and otherwise encode relative phosphosite position within the region. Specifically, xi(d) equals one if position i of protein sequence encodes a phosphosite, and zero otherwise. The value xi(v) is set to express the number of nearby phosphosites within a flanking region of ±(1…2) residues around sequence position i. Similarly, the value xi(w) expresses the number of distant phosphosites within a region of ±(3…7) residues. These variables are added to the model one by one using a forward stepwise model selection procedure that evaluates extended models with the Akaike Information Criterion (AIC): where is the MLE value and ν is the number of degrees of freedom (number of model coefficients). At every step, adjacency‐encoding predictors are added to the model one by one, corresponding AIC values are computed, and the predictor that provides the greatest increase in AIC is added to the model. If no additional factors are sufficient for improvement, the original alternative hypothesis is used for significance estimation.
To compare the null h0 and alternative h1 models, we use MLE to determine model likelihoods at optimal coefficients, and compute the deviance statistic, as
A high deviance statistic indicates that the alternative model of phosphosite‐specific mutation rates fits the observed mutations considerably better than the null model of gene‐wide mutation rate. The deviance statistic approximately follows the χ2 distribution with degrees of freedom. The log‐likelihood ratio test is used to estimate the statistical significance of deviance and the null hypothesis is refuted if the P‐value of the likelihood ratio test is P⩽0.05.
The alternative hypotheses in ActiveDriver are tested separately for every phosphosite region, gene and cancer type. The cancer type‐specific composite P‐value for a given gene is computed as a product of significant P‐values (only including P⩽0.05) for all phosphosite regions. We subsequently correct composite P‐values with the Benjamini–Hochberg FDR separately for each cancer type. Genes with no phosphosites or no cancer SNVs are discarded prior to modeling and not included in the multiple testing procedures.
To identify proteins with significant kinase domain mutations, we implemented a version of ActiveDriver that considers a simplified alternative hypothesis for pSNV enrichment detection, as where xi,(d) is set to one if position i of protein sequence encodes a kinase domain and zero otherwise. All other factors of our method remain the same. We used the simplified model to analyze the subset of kinase proteins with SNVs, and found that only EGFR involves a significant enrichment of kinase domain‐specific pSNVs in lung cancer samples.
Comparison of ActiveDriver results with alternative approaches
Cancer genes are traditionally identified based on high mutation frequency in tumors. To compare our method to the traditional approach, we first compared the genes ranked by ActiveDriver to all genes with missense mutations ranked according to their mutation frequency (number of missense point mutations) in all studied cancer data sets.
Second, we compared gene mutation rates with global (exome‐wide) mutation rates using the binomial statistic. We computed independent gene‐based mutation rates for each cancer type using distinct background rates and only genes and cancer samples sequenced in corresponding projects. The two‐tailed binomial test was used to assess the number of observed missense mutations in a gene, given the total number of missense mutations and total sequence length (amino‐acid positions times number of relevant samples). For a particular cancer type, all sequenced genes were tested with the binomial statistic and corrected for multiple testing (FDR P⩽0.05).
Third, we designed a simplified gene‐centric binomial statistic to test phosphosite mutation rates in comparison to the Poisson regression model in ActiveDriver. Except for the statistical model as well as protein disorder and phosphosite density integration, all other aspects of the method remained the same. Similarly to the global method, the two‐tailed binomial test was used to evaluate number of missense mutations in a particular phosphosite region of a gene of interest, given the number of missense mutations and total protein sequence length. The final P‐value for the gene was computed as the product of significant P‐values from site‐specific tests (only including P⩽0.05), followed by multiple testing correction independently for each cancer type.
Pathway enrichment analysis of phosphosite mutations
Pathway enrichment analysis of phosphosite mutations involved testing a given gene list for enrichment of biological process, molecular function and cell component annotations from GO (Ashburner et al, 2000), curated human biological pathways from Reactome (Matthews et al, 2009), and mammalian protein complexes from the CORUM database (Ruepp et al, 2010). We also studied human disease genes from the Human Phenotype Ontology (Robinson and Mundlos, 2010), microRNA target genes from MicroCosm (Griffiths‐Jones et al, 2008) and transcription factor target genes from TRANSFAC (Matys et al, 2006); however, these data sets provided no significant categories after multiple testing correction. Each type of gene set was analyzed and corrected for multiple testing separately. All functional annotations except protein complexes were retrieved from g:Profiler (Reimand et al, 2011). We discarded small gene sets with less than three genes and general sets with >1000 genes, and did not use KEGG pathway information due to strong bias towards well‐annotated cancer genes (e.g., the ‘pathways in cancer’ pathway). Ordered functional enrichment analyses were carried out with the g:Profiler software.
Pathway enrichment analysis was conducted across all cancer types simultaneously. The enrichment P‐value was computed using a one‐sample one‐tailed Poisson exact test, using a null hypothesis of uniform background mutation rate across all genes λ0=m/n, where m and n reflect counts of all SNVs and genes, respectively. The P‐value of seeing mp or more mutations in a pathway of np genes, given the background gene mutation rate λ0, is computed as one minus the sum of all less extreme events:
Gene sets covering only a single mutated sample, or gene sets with mutations in a single gene were discarded prior to statistical testing. P‐values were corrected for multiple testing with Benjamini–Hochberg FDR and filtered for significant results (FDR P⩽0.05). We considered the set of genes with kinase domains or phosphosites as the statistical background, after excluding EGFR and TP53 because of their excessive mutation rate in phosphosites (181 samples) and in general (340 samples). The analysis was carried out on the full collection of SNVs to identify gene sets with significant sample coverage, and also with only pSNVs, finally keeping only terms uniquely significant in the pSNV analysis. The non‐specific SNV analysis used all unique genes in the CCDS data set as statistical background, while the pSNV analysis involved a more stringent set of genes with kinase domains or phosphosites, both excluding EGFR and TP53. The analysis focusing on all SNVs additionally excluded KRAS from significance tests, since it introduced a functional bias due to SNVs in 187 samples (KRAS involves no pSNVs and is highlighted in our model due to pSNV depletion). Gene sets from GO and Reactome were filtered to reduce redundancy, by keeping the most significant term among those with a common ancestor. For significant CORUM protein complexes that covered the exact same samples, only the complex with the strongest P‐value was retained in the analysis. A separate sequence of permutation tests was carried out for ovarian cancer samples to identify gene sets with clinical correlation.
Kinase‐specific subnetwork analysis
Experimentally determined kinase–phosphosite interactions were retrieved from three public databases PhosphoSitePlus (Hornbeck et al, 2012), Phospho ELM (Dinkel et al, 2011) and HPRD (Keshava Prasad et al, 2009). Kinase‐specific subnetworks include (i) a central kinase, (ii) all downstream substrates of the central kinase, and (iii) upstream kinases phosphorylating the central kinase. Cancer samples were considered to be mutated in subnetworks if they contained at least one mutation in any kinase domain or phosphosite in subnetwork genes. Gene sets covering only a single mutated sample, or gene sets with mutations in a single gene were discarded prior to statistical testing. For evaluating statistical significance of pSNV enrichment, we used a more stringent statistical background of all genes in the kinase–substrate network. We employed the Poisson tests described above, and identified kinases with more than the expected number of pSNV mutations (FDR P⩽0.05). As above, direct mutations of TP53 and EGFR where excluded from the analysis and calculation of background mutation rate. Hierarchical classes of kinases were defined as follows. Master regulators are kinases whose modules involved >20 samples with pSNVs (i.e., two‐fold median sample coverage over all significant modules) and at least two‐thirds of pSNVs occurring downstream of the central kinase. Local kinases, managed kinases and self‐employed kinases involved modules with at least two‐thirds of pSNVs in downstream targets, upstream targets, and central kinases, respectively. The middle manager class covered the remaining kinases.
Discovery of clinically correlated modules in the kinase–substrate network
Clinical annotation of ovarian cancer samples (Cancer Genome Atlas Research Network, 2011) was retrieved from the Cancer Genome Atlas (TCGA, downloaded on 21 December 2011). We only included the patients that were screened for SNVs as the test set, and excluded patients with missing values in the particular annotation type tested. To test patient survival, we used a regression model involving age, vital status, time of follow‐up and mutation status in the given module.
We developed a greedy algorithm to search the kinase–substrate network for signaling modules that cover pSNVs that are maximally informative of a given clinical outcome. We defined statistical objective functions to determine the best steps in the search, comparing (a) the clinical signal in the group of patients defined by a set of pSNVs in a given signaling module and (b) the clinical signal in the remaining patient cohort. For survival analysis, the objective function uses the Cox Proportional Hazards (PH) framework of generalized linear regression, as where λ0 reflects constant positive baseline risk, t corresponds to survival time, X is the matrix of predictor variables and β is the vector of model coefficients. As the null hypothesis, survival is modeled in a univariate regression model with patient age as the confounding predictor. The alternative hypothesis includes an additional indicator predictor that reflects the mutation status of each sample in the given module. The significance of survival difference between the module‐related group and the remaining cohort is estimated in the likelihood ratio test by comparing the fit and degrees of freedom of the null model and the alternative model.
The network search starts from a given gene in the kinase–substrate network as the ‘seed’ and constrains the search space to the neighborhood of distance d=2 from the seed. First, the algorithm extracts all paths of length two, each comprising the seed and two additional proteins. The set of paths is then filtered to reduce the search space, so that only paths linking to additional pSNV mutations are considered. Paths are then iteratively merged into modules containing two or more paths. At each merging iteration, all possible pairwise combinations of modules and/or paths are considered for merging, and the pair with the greatest improvement in clinical correlation according to Cox regression is merged into one module. Merging stops when no such improvement steps are available.
The statistical significance of identified modules in a given seed neighborhood is assessed using 10 000 permutations to evaluate the non‐random association between network topology, number of proteins included and corresponding pSNVs. The interaction topology of the neighborhood is retained during permutation, while node labels (protein names) are shuffled globally across the entire kinase–substrate network. This strategy preserves the gene‐based correlation with survival, while disrupting the correlation originating from multiple closely interacting proteins. The search algorithm is executed on the set of random neighborhoods with shuffled labels, and the resulting modules and clinical correlations (Cox P‐values) provide the null model for statistical significance estimation. The significance of each module is computed as the fraction of P‐values of randomly found modules that have an equivalent or better clinical correlation Cox P‐value. The P‐values of all modules originating from a single seed are then filtered with multiple testing correction (FDR P⩽0.05). To find all survival‐associated modules, we searched the network with every mutated protein as a seed.
To validate survival modules, we also compared related patients and mutations with other types of clinical information using Fisher's exact test, including vital status, neoplasm status, tumor stage, and additional chemotherapy. To form larger sample groups for statistical analysis, we simplified tumor stage classification by removing alphabetical subclasses, resulting in three stages (II, III, IV). Besides age‐adjusted Cox regression, modules were validated using alternative significance P‐values of survival correlation, namely likelihood ratio tests with unadjusted Cox regression and log‐rank tests. Clinical correlation tests for ovarian pSNV‐enriched GO categories, pathways, protein complexes and individual genes were performed in a similar manner, followed by multiple testing correction and filtering (FDR P⩽0.05). Survival correlation of TP53 pSNVs in ovarian cancer was identified using Cox regression and log‐rank tests. A separate set of tests was used for the subset of glioblastoma patients with long‐term survival (follow‐up time >1 year).
R Source code of ActiveDriver is available at the website www.baderlab.org/Software/ActiveDriver.
We thank Shirley Hui, Leopold Parts, Michael D Taylor and Liis Uusküla for critical comments on the manuscript; Ian Clarke, Peter Dirks, Mona Meyer, Jason Moffat, Robert Rottapel, Andrea Uetrecht and all members of the Bader lab for useful discussions; and Ruth Isserlin for providing drug target gene sets. This work was supported by the Canadian Institutes of Health Research grant MOP‐84324 and the National Resource for Network Biology (NRNB) under award numbers P41 RR031228 and GM103504. Computational resources for large‐scale simulations were provided by the High Performance Computing Centre at the University of Tartu, Estonia.
Author contributions: JR designed the study, analyzed the data, implemented the algorithms, and wrote the first manuscript. GDB designed and supervised the study. Both authors wrote and approved the final manuscript.
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary Online Material
Supplementary Figures 1–5 [msb201268-sup-0001.pdf]
Supplementary Tables 1–9 [msb201268-sup-0002.zip]
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2013 EMBO and Macmillan Publishers Limited