br A cancer type specific model
A cancer-type-specific model should also be able to distin-guish the relevant cancer type among driver mutations in human tumors. We therefore used a literature-curated EPZ-6438 data-base (OncoKB; Chakravarty et al., 2017) to annotate an indepen-dent cohort of 10,000 patients whose tumors were sequenced on a targeted gene panel (MSK-IMPACT; Zehir et al., 2017) for oncogenic mutations (STAR Methods). We compared perfor-mance on four cancer types (breast invasive ductal carcinoma [BRCA], glioblastoma multiforme [GBM], high-grade serous ovarian cancer [OV], and colon adenocarcinoma [COAD]), as these overlapped cancer types found in the TCGA, and CHASM-plus, CHASM, and CanDrA had cancer-specific models for these types. CHASMplus had a significantly higher auROC compared to all other methods for each of the cancer types (p < 0.05; DeLong test; Figure 2B; Table S2). In general, neither CanDrA nor CHASM showed consistent improvements over ParsSNP and REVEL.
Last, we reasoned that distinguishing cancer-type specificity of driver mutations within the same gene would be an even harder task to accomplish. It has been previously documented that lung adenocarcinoma (LUAD) missense mutations in EGFR appear predominantly in its kinase domain, while GBM missense mutations appear in its extracellular domain (Brennan et al., 2013; Ji et al., 2006; Paez et al., 2004; Porta-Pardo et al., 2017). We therefore scored TCGA missense mutations in the gene EGFR from LUAD patients and from GBM patients with CHASMplus, CanDrA, and CHASM (Figure 2C). The CHASMplus GBM model correctly scores the missense muta-tions from GBM patients significantly higher than those from LUAD patients (p = 0.004; two-sided t test), and vice versa for the CHASMplus LUAD model (p = 0.003; two-sided t test). In contrast, the CHASM GBM model and the CHASM LUAD model both score the mutations from LUAD patients higher than those from GBM patients (p = 1e 5 and 5e 5, respec-tively, two-sided t test). CanDrA does not have a LUAD model, but its GBM model scores mutations from LUAD patients higher than those from GBM patients (p = 0.0002, two-sided t test), which is significant in the wrong direction. Both REVEL and ParsSNP showed no significant differences in scores between GBM and LUAD (p = 0.99 and 0.53, respectively, two-sided t test).
In summary, several lines of evidence suggest that CHASMplus, relative to other methods, has a substantial advantage in distin-guishing between driver and passenger missense mutations, specifically by cancer type.
CHASMplus Improves Pan-cancer Identification of Driver Missense Mutations
In contrast to cancer-type-specific approaches, there are many excellent existing methods that have been used for pan-cancer analysis. This approach is useful because some cancer-driver mutations do occur in many cancer types. The power to detect these mutations, particularly when they occur at low frequency in many cancer types, is increased when many cancer types are aggregated, known as a pan-cancer analysis (Cancer Genome Atlas Research et al., 2013).
We sought to evaluate whether CHASMplus would also perform well in a pan-cancer analysis, where mutations from all cancer types were modeled together. Because of the greater breadth of relevant methods, we were able to conduct a much broader comparison, and a larger number of published bench-marks were available. We compared CHASMplus to 12 methods that span different computational approaches, including those that predict protein functional damage or pathogenicity, and selected meta-predictors (aggregates of multiple methods)— such as M-CAP (Jagadeesh et al., 2016), REVEL (Ioannidis et al., 2016), and ParsSNP (Kumar et al., 2016)—based on per-formance in a recent comparative study (Ghosh et al., 2017). We compared these methods on 5 benchmarks, which fall under three broad categories: in vitro experiments, a high-throughput in vivo screen, and curation from published literature. Each of these categories has weaknesses, but, in aggregate, they span multiple scales of evaluation and type of supportive evidence (Figure 2D). For example, several benchmarks were limited to one or a few well-established driver genes, while others were exome-wide but lacked experimental support. A range of