PCAWG共同5中包含的肿瘤和健康的ICGC RNA-SEQ数据与人类参考基因组(GRCH37.P13)一致,使用两个读取器:star58(v.2.4.0i,两pass),在MSKCC和EthZürich和Zürich和Tophat259(V.2.2.2.0.12)上表演,在MSKCC和Eth eTh eTh eTh eThertial courist at the ofortion at inoft at in of the of courionited artimed artimed couristing。两种工具都使用Gencode(版本19)60作为参考基因注释。对于Star两通行对齐,对每个样品进行了初始对齐运行,以生成从RNA-Seq数据得出的剪接连接列表。然后使用这些连接来建立每个样品参考基因组的增强指数。在第二次通过中,增强指数用于更灵敏的对齐。对齐参数已固定在https://github.com/icgc-tcga-pancer/pcawg3-rnaseq-align-star中报告的值。TOPHAT2比对策略也遵循了两次通行原理,但在与各个参数集的单一对齐步骤中进行。对于TOPHAT2比对,使用IRAP分析SUITE61。完整的参数与https://hub.docker.com/r/nunununofonseca/irap_pcawg/中的对齐代码一起提供。对于两个对准器,以BAM格式的结果文件进行对齐位置进行排序,并在GDC Portal(https://portal.gdc.cancer.gov/)和ICGC数据门户(https://dcc.icgc.org/)中下载。单个登录号和下载链接可以在PCAWG数据发行表中找到:http://pancer.info/data_releases/may2016/release_may2016.v1.4.tsv。补充表23中列出了癌症类型的缩写。组织学源自PCAWG病理学和临床相关工作组的较旧版本。可以在https://dcc.oreleases/pcawg/pcawg/transcriptome/metadata/ https://dcc.orgc.org/releases/transcriptome/metadata/中找到本研究中使用的供体对组织学的分配。
所有数据集的质量控制均以三个主要级别进行:(1)使用FASTQC62(V.0.11.3)评估初始原始数据(补充图4);(2)评估对齐数据的评估(两种对齐方法的映射和未映射读取百分比);(3)定量(通过将基于Star和TOPHAT2表达管道产生的表达值相关联(补充图2)。总的来说,我们定义了六个质量控制标准,以评估样品的质量。我们将样本标记为排除候选者,如果:(1)在5个主要FastQC测量中的3个(基础质量,K-mer过度代表性,鸟嘌呤 - 偶然含量,n碱基的含量和序列质量)没有通过;(2)使用Star Pipeline可以绘制超过50%的读数或少于100万的读数;(3)使用TopHat2管道总共可以绘制超过50%的读数或少于100万的读数;(4)我们测量了降解得分63大于10;(5)对齐样品中的片段计数(在恒星和tophat2上平均)为 <5 million; and (6) the correlation between the expression counts of both pipelines was <0.95. If a sample did not pass one of these six criteria it was marked as problematic and placed on a greylist. If more than two criteria were not passed, we excluded the sample.
A subset of 722 libraries from the projects ESAD-UK, OV-AU, PACA-AU and STAD-US were identified as technical replicates generated from the same sample aliquot. These libraries were integrated post-alignment for both the STAR and the TopHat2 pipelines using samtools64 into combined alignment files. Further analysis was based on these files. Read counts of the individual libraries were integrated to a sample-level count by adding the read counts of the technical replicates.
Initially, a total of 2,217 RNA-seq libraries were fully processed by the pipeline. Quality-control filtering and integration of technical replicates (722 libraries) gave a final number of 1,359 fully processed RNA-seq sample aliquots from 1,188 donors.
For a panel of RNA-seq data from a variety of healthy tissues, data from 3,274 samples from GTEx (phs000424.v4.p1) were used and analysed with the same pipeline as PCAWG data for quantifying gene expression. A list of GTEx identifiers are provided at https://dcc.icgc.org/releases/PCAWG/transcriptome/metadata.
STAR and TopHat2 alignments were used as input for HTSeq65 (v.0.6.1p1) to produce gene expression counts. Gencode v.1960 was used as the gene annotation reference. Quantification on a per-transcript level was performed with Kallisto66 (v.0.42.1). This implementation is available as a Docker container at https://hub.docker.com/r/nunofonseca/irap_pcawg. The implementation of the STAR and TopHat2 quantification is available as docker containers in: https://github.com/ICGC-TCGA-PanCancer/pcawg3-rnaseq-align-star and https://hub.docker.com/r/nunofonseca/irap_pcawg/, respectively. Quantification of consensus expression was performed by taking the average expression based on STAR and TopHat2 alignments. Gene counts were normalized by adjusting the counts to FPKM67 as well as FPKM with upper quartile normalization (FPKM-UQ) in which the total read counts in the FPKM definition has been replaced by the upper quartile of the read count distribution multiplied by the total number of protein-coding genes.
The FPKM and FPKM-UQ calculations were as follows. FPKM = (C × 109)/(NL), in which N denotes the total fragment count to protein-coding genes, L denotes the length of the gene and C denotes the fragment count. FPKM-UQ = (C × 109)/(ULG), in which U denotes the upper quartile of fragment counts to protein-coding genes on autosomes unequal to zero, and G denotes the number of protein-coding genes on autosomes.
The t-distributed stochastic neighbour embedding (t-SNE) plots in Supplementary Figs. 5 and 6 were produced using the RTsne package68 (with a perplexity value of 3) based on the Pearson correlation of the aggregated expression (log + 1) of the 1,500 most variable genes. FPKM expression values per gene were aggregated (median) by tissue (GTEx) and study (PCAWG). Coefficient of variation for each gene was also computed per tissue (GTEx) and study (PCAWG) to determine the 1,500 most variable genes. Purity values were previously described69.
The t-SNE plot in Extended Data Fig. 17c is based on all exon-skipping events in protein-coding genes confirmed by SplAdder70. Each event was quantified in both the PCAWG and GTEx cohort. All events with more than 1% of missing percentage spliced in (PSI) values across the concatenated PCAWG and GTEx samples were removed. The remaining missing values were imputed as the mean over the non-missing samples. The centred data were then visualized using the TSNE package from the Scikit Learn toolkit71 with a perplexity value of 100, random state 0 and an initialization with PCA.
To associate genetic variation with gene expression, we analysed whole-genome sequencing (WGS) of the 1,188 donors with matched whitelisted RNA-seq data from the PCAWG cohort. Germline genotypes, SNV calls and segmented allele-specific SCNA calls were previously reported5. We matched 1,188 tumour RNA-seq IDs5 to WGS whitelist tumour IDs (synapse entry syn10389164). For patients with multiple WGS IDs (2 out of 1,188) or RNA-seq aliquot IDs (17 out of 1,188), we resolved the matching by pairing samples with the same ‘tumor_wgs_submitter_specimen_id’ (Supplementary Table 1). The 1,188 patients are spread across 27 types of cancer and 29 project codes and include 899 carcinomas; 34 patients are metastatic and 13 recurrent with the remaining patients being primary tumours (Supplementary Table 1).
We used the data of these 1,188 patients for performing somatic and germline eQTL mapping, ASE analysis and association studies between gene expression and mutational signatures.
Gene expression values (measured in FPKM; https://dcc.icgc.org/releases/PCAWG/transcriptome/gene_expression) from consensus expression quantification as described above were used for this analysis.
Genes with FPKM ≥ 0.1 in at least 1% of the patients (12 patients) were retained, resulting in 47,730 genes. only 18,898 protein-coding genes (according to the ‘gene_type’ biotype reported in Gencode v.1960) were used for the subsequent QTL analyses. The log2-transformed expression values (FPKM + 1) were subjected to peer analysis72 to account for hidden covariates (syn7850427; https://dcc.icgc.org/releases/PCAWG/transcriptome/eQTL/phenotype). To balance the number of covariates, statistical power and available sample sizes per cancer type, we followed the GTEx protocol and estimated 15, 30 and 35 hidden covariates to be used depending on sample size73 (n < 150, 150 ≤ n < 250, n ≥ 250). Peer residuals were then rank-standardized across patients. The FPKM cut-off values and peer correction were also applied to the subset of 899 patients with carcinoma, yielding 18,837 protein-coding genes after filtering. Furthermore, we used ordinary least-squares regression to correlate each of the 35 peer factors with per-sample covariates, including cancer project codes, gender, tumour purity, somatic burden and several sequence metrics (Supplementary Notes), to understand the proportion of variance explained by known biological and technical covariates.
In all linear models, we accounted for known confounding factors by modelling them as fixed effects. In all association studies, we accounted for sex, project code (describing cancer type and country of origin) and per-gene copy-number status (Supplementary Table 1 for the list of per patient covariates; syn7253568 and syn7253569 for sex and project codes; syn9661460 for per gene copy number). Per-gene copy-number alterations were derived as the average copy number across all copy-number aberrations called within the annotated gene boundaries based on syn8042988.
The somatic eQTL, ASE and mutational signature analyses also accounted for total somatic mutation burden (number of SNVs and short insertions and deletions (indels)) and sample purity (Supplementary Table 1). Purity was estimated based on copy-number segmentation. In addition, the somatic eQTL and ASE analyses accounted for local SNV burden calculated in a 1-Mb window from the gene coordinates (https://dcc.icgc.org/api/v1/download?fn=/PCAWG/transcriptome/eQTL/covariates/pergene.somatic.snv.cis.burden.1188.wl.donors.tsv.gz).
The germline eQTL analysis also modelled the population structure as random effect. The population structure was assessed by a kinship matrix that was calculated based on every twentieth germline variant, processed as described below (see ‘Germline eQTL variants’). The kinship matrix was then calculated as an empirical patient-by-patient covariance matrix.
Different covariates were accounted for per-analysis method (Supplementary Table 1). The project code describes cancer type and country-of-origin. Somatic burden is the total number of SNVs and indels. Purity was estimated based on copy-number segmentation. Local somatic burden is the number of SNVs in a 1-Mb window around the gene coordinates. Local copy number was defined as the average copy-number state across all SCNAs called within the annotated gene boundaries.
We performed GO74,75 and Reactome pathway20,21 enrichment with the Bioconductor packages biomaRt76,77, clusterProfiler78 and ReactomePA79 (FDR ≤ 10%). The number of genes used as background set is described per analysis method.
PCAWG variant calls v.0.15 were downloaded from GNOS and processed following the PCAWG-8 protocol: (1) VCF files were indexed and merged using bcftools80. (2) All variants were filtered for ‘PASS’ flag. (3) All variants were filtered for quality larger than 20. (4) only bi-allelic sites were considered.
HDF5 files for each 100-kb chunk of the VCF files were generated, assuming additivity that was numerically encoded as 0, 1 or 2 for homozygous reference, heterozygous or homozygous alternative state, respectively. For indels, we encoded the presence or absence of the variant as 0 or 1, respectively. Each variant was normalized to mean 0 and standard deviation 1. Missing variants were mean-imputed. To create our eQTL release set v.1.0, the resulting HDF5 files were subsequently merged into a global HDF5 file and all variants which follow any of the following conditions were removed: (1) minor allele frequency ≤ 1%; and (2) missing values ≥ 5%
In the germline eQTL analyses, we used the processed gene expression dataset from 1,178 patients for which germline variant calls (eQTL release set v.1.0, see ‘Germline eQTL variants’) were available. Linear mixed models were used to model the correlation between germline variants (within 100 kb of gene boundaries) and gene expression values (see ‘Gene expression filtering’) using the limix package81. Known covariates were modelled as fixed effects and population structure as random effect (see ‘Covariates’).
A two-step approach was used to adjust for multiple testing. First, for each gene, we adjusted for the number of independent tests estimated based on local linkage disequilibrium82. Second, we performed a global correction across the lead variants, that is, the most significant SNPs, per eQTL. Germline eGenes were defined as genes with an eQTL with global FDR ≤ 5%.
The GTEx comparative eQTL analysis was based on the eQTL maps v.6p10. We mapped the positions and alleles of our PCAWG-specific eQTL to the eQTL in all GTEx tissues. To determine whether a lead eQTL variant is replicated in a given GTEx tissue, we followed the previously described strategy10. For each eGene, we considered the eQTL lead variant and assessed the replicability of the signal in the GTEx cohort based on marginal association statistics using 42 GTEx tissues without cell lines (P < 0.00024 = 0.01/42, corrected for the number of GTEx tissues—that is, 42)). If the lead variant did not replicate or was not tested, we determined replication based on the variant with the smallest P value within the linkage disequilibrium block (r2 ≥ 0.8 estimated based on UK10K project) of the lead variant across 25 (or 42) tissue-matched GTEx analyses. If neither lead nor any variant within the linkage disequilibrium block was tested, we determined replication based on the smallest P value of any variant within the 100-kb window tested within the GTEx cohort. We also derived less stringent sets of PCAWG-specific eGenes by allowing replication in up to 1, 5 or 10 GTEx tissues.
Using the R package qvalue (https://github.com/StoreyLab/qvalue, v.2.14.0), we generated π1 statistics comparing the lead variants of one histotype against their P value distribution in the other histotypes. Because π1 statistics are known to be confounded by sample size and number of eQTL found, we subsampled the eQTL lead variants to a randomly selected set of 100 variants. After 20 rounds of subsampling, we derived the same π1 statistics as mentioned earlier and reported the average.
For each lead variant, we generated a matching background set of 1,000 variants using SNPsnap83. Each variant (background and foreground) was intersected with the location of 25 Roadmap factors16 in 127 cell types. From this we derived fold change and P values. Significant changes of fold change between PCAWG-specific and unspecific eQTLs is based on a one-sided Wilcoxon rank-sum test.
Enrichment of Reactome pathways of PCAWG-specific eGenes was performed using the Bioconductor package ReactomePA79.
We used the set of consensus SNVs somatic calls provided by PCAWG (syn7357330) based on three core caller pipelines and MuSE84. On average, we counted 22,144 somatic SNVs per patient, with different median numbers of SNVs per cancer type, ranging from 1,139 in thyroid adenocarcinoma to 72,804 SNVs in skin melanoma (Extended Data Fig. 5a). Owing to the low frequency of somatic SNVs across the cohort (Extended Data Fig. 5b), we collapsed the variants by genomic regions defined by gene annotations (Gencode v.1960). Specifically, we generated a set of disjoint gene exons by collapsing overlapping exon annotations into single features using bedtools85. The set of disjoint introns was generated using bedtools by subtracting the collapsed exonic regions from the gene regions. To map local effects of somatic mutations in flanking features outside the gene body, we binned the surrounding regions (plus and minus 1 Mb from the gene boundaries) into 2-kb windows (flanking) overlapping by 1 kb.
We defined three different types of aggregated somatic burden to assess differences in power in detecting somatic eGenes and P value calibration. The burden in a genomic region was defined as (1) a binary value that indicates presence or absence of SNVs; (2) the aggregated burden as sum of SNVs; or as (3) weighted burden, that is, sum of variant allele frequencies of the SNVs (Supplementary Fig. 10a) to take into account their clonality (https://dcc.icgc.org/releases/PCAWG/transcriptome/eQTL/genotypes). We assessed calibration of all three analyses with Q–Q plots of nominal and permuted P values (permutation of the patients in the gene expression matrix) (Supplementary Fig. 10b–d). Moreover, for the linear regression analysis, genotypes were standardized across patients (to mean zero and standard deviation one) and standardized effect sizes are provided in Supplementary Table 5.
Overall, somatic burden within flanking regions was the most prevalent type of burden tested per gene (Extended Data Fig. 6a). We found similar average relative mutation density per type of genomic region (flanking = 0.008 mutations per kb; introns = 0.007 mutations per kb; exons = 0.006 mutations per kb) (Extended Data Fig. 6b) and average recurrence of the same mutated region across the cohort was rather low (flanking = 1.4%; exons = 1.7%; introns = 4%) (Extended Data Fig. 6c).
Linear models were used to model the correlation between recurrent somatic burden and gene expression of up to 18,898 protein-coding genes, using the limix package81 (see ‘Gene expression filtering’). Gene expression was corrected for 35 hidden Peer factors. Known covariates were modelled as fixed effects (see ‘Covariates’). We considered only somatic burdens with frequency greater than 1%, including exonic and intronic burdens, as well as flanking burdens, within 1 Mb from gene boundaries.
The somatic eQTL analysis was performed on all 1,188 patients and on the subset of 899 patients with carcinoma (representing 20 of the 27 types of cancer) to replicate the analysis on a more homogeneous set of tumours. A cis window of 1 Mb from the gene boundaries was used to find mutated genomic intervals with a burden frequency ≥ 1% in the cohort (at least 12 patients in the full cohort and 9 patients in the carcinoma cohort). Together, 18,708 of the genes had at least one mutated interval at that frequency and were included in the analysis and 1,049,102 regions showed a burden frequency ≥ 1%
Bonferroni correction was applied to correct for multiple cis windows tested within the same gene. Then, Benjamini–Hochberg correction was applied to adjust the P values of the lead genomic regions across genes. Somatic eGenes were defined as genes with an eQTL at a FDR ≤ 5%.
We compared our 649 somatic eQTL set with three previous cancer studies86,87,88 to identify independent evidence of interaction between our eGenes and the associated cis-genomic regions with somatic burden. Studies were chosen if they provided lists of cancer regulatory elements linked to genes or regulatory elements with somatic mutations linked to gene expression deregulation in cancer. All the three studies examined were based on TCGA cancers. For this, we checked perfect overlaps with both the somatic burden location and the eGene. Moreover, we looked at the overlap between somatic eQTL and 72,987 GeneHancer89 enhancers-to-genes interactions, with at least two independent supporting methods (called ‘double-elite’), downloaded from the UCSC hg19 GeneHancer track90. We then compared this overlap with a set of nulls generated by 1,000 random permutations of the GeneHancer regulatory elements with nearby genes located within 1 Mb. We then retrieved an empirical P value of enrichment by counting the number of random nulls (N) showing greater number of overlaps than those found between the somatic eQTL set and the GeneHancer set (P = (N + 1)/(1,000 + 1)).
To identify putative regulatory sites enriched for somatic eQTL, we retrieved functional annotations of the lead genomic flanking intervals of the somatic eQTL (556 intervals linked to 638 somatic eQTL). Therefore, we mapped somatic eQTL to 25 Roadmap Epigenomics chromatin marks of 127 different cell types16 and ENCODE transcription-factor binding site annotations in 9 cell types (including 8 cancer and one embryonic stem-cell lines91) (Supplementary Tables 6 and 7). We compared annotations in the significant set of eQTLs with a null distribution based on 1,000 random samplings of a matched set of genomic intervals. To define the matched sets of genomic intervals, we selected flanking genomic intervals from the whole set of tested genes that showed a similar distance from the gene start (exact distance ± 2 kb) and that matched the exact burden frequency of the corresponding interval in the significant associations. We then overlapped the 1,000 matched sets with Roadmap Epigenomics and ENCODE annotations. To avoid ambiguous overlaps (with multiple annotations), we retained only genomic intervals showing a minimum overlap of 10% of their length.
We retrieved an empirical P value of enrichment for each annotation by counting the number of randomly sampled flanking intervals (N) showing greater number of overlaps compared to the eQTL set (P = (N + 1)/(1,000 + 1)). Benjamini–Hochberg correction was applied to the empirical P values (over 25 marks in 127 cell lines for Roadmap Epigenomics annotations and over 149 transcription-factor-binding sites for 9 ENCODE cell lines). We then computed the fold change per annotation and cell line as a ratio of annotated lead flanking intervals and mean number of annotated matched random flanking intervals over the 1,000 samplings.
Furthermore, we performed GO74,75 and Reactome pathway20,21 enrichment with the Bioconductor packages biomaRt76,77, clusterProfiler78 and ReactomePA79 (FDR ≤ 10%) and also looked at enrichment within high-confidence cancer testis genes previously described92, using 18,708 genes with at least one mutated interval as background.
Limix was used to perform variance decomposition using the same covariates as in the somatic variant analyses except for local copy-number state (see ‘Covariates’). The random effects were based on the following common germline variants and somatic burden (frequency > 1%) (see ‘Somatic calls and mutational burden’ for detailed description of burden): (1) cis-somatic intronic: weighted burden in introns; (2) cis-somatic exonic: weighted burden in exons; (3) cis-somatic flanking: weighted burden in 1-kb-overlapping regions of 2 kb within 1 Mb from gene boundaries; (4) somatic intergenic: weighted burden in 1-kb-overlapping regions of 2 kb outside the 1 Mb window; (5) cis-germline: germline variants within 100 kb from gene boundaries; (6) trans-germline: genome-wide population structure (see ‘Covariates’); and (7) local copy-number variation (see ‘Covariates’).
All the data was mean-centred and standardized. For each of the random effects, a linear kernel was computed and used as covariance matrix. The resulting variance components were normalized to add up to one.
We obtained 39 mutational signatures from PCAWG-7 beta 2 release9 and used linear models to associate the mutational signatures with gene expression of up to 18,898 protein-coding genes across 1,159 patients while accounting for known covariates (see ‘Covariates’) (quality control) (Extended Data Fig. 10a–e). The 1,159 patients were a subset of the total 1,188 patients, for whom mutational signature profiles were available. Gene expression was corrected for 35 hidden peer factors (see ‘Gene expression filtering’).
We retained 18,888 genes that showed a minimum FPKM of 0.1 in at least 1% of 1,159 the patients (see ‘Gene expression filtering’). Signatures with zero variance and a prevalence below 1% were filtered, and we obtained 28 signatures. We applied linear models to associate expression of these genes with the signatures across all 1,159 patients, a subset of 877 patients with carcinoma or a subset of 891 European patients to assess consistency of the associations (Extended Data Fig. 10f, g).
Across all patients, we found 1,176 significantly associated genes after Benjamini–Hochberg correction (we used an FDR ≤ 10% for enrichment analyses, multiple testing was applied across all signature–gene pairs) (Supplementary Tables 19a–c). We performed gene enrichment analyses of the significant genes per signature (see ‘GO and Reactome pathway enrichment’) (here 18,831 background genes, multiple testing correction across all ontologies per signature FDR ≤ 10%) (Supplementary Table 19d). Whereas most signatures were associated with only few genes, 18 showed recurrent trans effects and affected expression of over 20 genes (Extended Data Fig. 11d, Supplementary Table 19e). We further found that the vast majority of genes (85.8%) were associated with only one signature (1,009 genes); 129 genes were associated with two, 32 with three, 5 with four and 1 with five signatures.
To assess how tissue-specific both mutational signatures and their associations with gene expression are, we analysed the occurrence of each signature in each of the types of cancer. We assessed the presence (at least one SNV of a signature in at least one patient with a specific cancer type) and mean prevalence (mean number of SNVs of a certain signature across all patients of a specific cancer type) of the signatures in the types of cancer (Extended Data Fig. 13c, d). We defined cancer-type-specific signatures to occur in up to four types of cancer (signatures 4, 7, 9, 12, 16, 38 and 39) and common signatures to be missing in up to five types of cancer (signatures 2, 13 and 18). For each of these signatures, we performed cancer-type-specific analyses, that is, we assessed the association between the respective signature and gene expression in just the patients who are of a cancer type that shows mutations of the respective signature (Extended Data Fig. 13c, left heat map). We then correlated the P values of these cancer-type-specific analyses with the P values of the analysis across all patients and calculated the Pearson correlation coefficients (Supplementary Fig. 24a–e). We show that the correlation between cancer-type-specific and whole-cohort P values is dependent on the sample size of the respective analysis (r2 = 0.671) (Supplementary Fig. 1f).
We further performed PCA on the signatures across both, patients (PCA on signature-specific SNVs per patient) and genes (PCA on adjusted P values of signature-gene expression associations) (Extended Data Fig. 11a, b).
To assess significance of the functional annotation of SNVs by mutational signatures, we also associated gene expression with the total number of SNVs and correlated the P values (−log10(P)) of the associations with the respective signature-specific P values. The absolute Pearson correlation coefficients remain below 0.1 (Supplementary Table 19f).
To establish causality of signature–gene expression associations, we included the germline eQTL into the analysis using linear mixed models; 197 of our 1,176 signature-associated genes were also germline eGenes. These 197 associations involved 26 of the 28 mutational signatures. We associated the lead variants of these eGenes with the rank-standardized signature SNVs across 2,507 patients. We used the subset of the 2,818 WGS patients for which mutational signature profiles and all known covariates were available. We accounted for the same fixed covariates as in the mutational signature–gene expression association studies and, in addition, for kinship as a random effect (see ‘Covariates’).
We then performed proportional colocalization analysis with Bayesian model averaging using the R package coloc93 to test whether gene expression and mutational signatures share common causal genetic variants in a given gene region. A proportional colocalization analysis tests the null hypothesis of colocalization by assuming that two phenotypes that share causal variants will have proportional regression coefficients for either phenotype with any variant selection in the vicinity of the causal variant. We applied the Bayesian model averaging approach, with each tested model consisting of a selection of two variants. The P values are then averaged over all models to generate posterior predictive P values93. We filtered variants so that no pair of variants showed r2 > 0.95,每个变体的边缘后部概率与其中一种表型包含在内,大于0.01。补充表19E列出了拒绝共定位零假设的名义p值。
然后,我们进行了调解分析94,95,以评估种系EQTL,基因表达和突变特征之间效果的方向性。首先,使用R套件lavaan96的结构方程模型,将因果中介分析应用于EQTL铅变体,基因和突变特征的每个三元组。然后,我们使用R包装97来评估调解的显着性,并估计非参数Bootstrap置信区间(1,000个模拟)的介导效应的比例。
为了理解体细胞变异在其基因组环境中的精确效应,以及在随后的等位基因特异性分析中,生殖线和体细胞变异都被逐步分析。为了组装分阶段的种系基因型,我们使用了Sanger 1000G Callset6,并应用了插入298来缩放杂合种系变体。使用Battenberg CN调用算法99的结果校正Impute2输出,以确定连续复制数增益的区域内没有单倍型开关。布置了所得的分阶段种系基因型,使得单倍型1始终与SCNA(主要等位基因)区域的扩增等位基因相对应。在同一NG读取(约1000万个变体,占所有SNV的20%)上的同时同时发生的情况下,我们将单个体细胞变体逐步逐步到最近的生殖线杂合位点。对于下游分析,我们仅考虑了至少通过三个读取的SNV,这些SNV被逐步读取到相应的种系变体(在1000万个SNV中约为600万)。
所有分阶段的SNV均基于其基于基因注释定义的基因组区域(上游,下游,启动子,5'UTR,内含子,同义词,错义,停止增益和3'UTR),并映射到使用变种效果(VEP vep predictor predictor toper(vep vep predictor predictor predictor predictor predictor fordeor)。启动子变体定义为TSS上游的1-KB。我们通过使用VEP“ Updowndistance”插件(最大范围参数为100 KB)包括了侧翼区域。我们使用10 kb窗口将上游和下游变体类别从10到100 kb分为不相交的类别。我们将“剪接捐赠者”和“剪接受体”变体集成到“剪接区域”变体类别中,并将“停止保留”变体映射到“同义”变体类别。我们将转录级注释平均至基因级注释,以检索给定基因变体的预期功能效应。我们分析了SNV变体等位基因频率与SCNA之间的关系,以确定变体是在(“早期”)之前还是在(“晚”)相应的SCNA(PCAWG-11)之前发生。我们通过估计每个SNV的癌细胞分数并将SNV汇总到由其各自的癌细胞分数加权的总局部负担来计算加权的顺式负担负担。
杂合种系变体的位置信息与RNA-Seq BAM文件一起使用,作为用于计数ASE读取的GATK ASEREADCOUNTER101算法的输入。我们考虑了最低映射质量为20的读取,最低基质量为10。仅考虑了所有进一步分析的杂合变量,最小覆盖范围为8个RNA-Seq读取。
RAW ASE读数计数已进行后处理:(1)ASE站点转换为床文件,并与编码50-MER Mappability Track(WgenCodeCrgmapyabilityalign50mer.bigwig)对齐,以提取所有站点的可视分数。所有具有可遵守性得分的位点与1不相等。(2)除去等位基因读取计数的所有位点均删除1或等于1,以防止基因分型误差以影响ASE定量。(3)所有性别染色体都被丢弃以进行进一步分析。(4)我们将每位患者的测序误差估计为在碱基总数上的非参考和非替代基础的总和。我们使用估计的测序误差概率通过二项式测试评估了统计单相关性,并使用Benjamini – Hochberg校正了逐步校正。除去统计上单相的所有位点均已去除。(5)对于每个ASE站点,从Sanger Copy-Number共识呼叫集(PCAWG-11)中检索了复制数状态。从随附的纯度表中检索了每位患者的纯度估计。
为了汇总位点级ASE到基因级读数并允许估计效应方向性,我们使用了分阶段的种系基因型。使用Pyensembl Python库对Ensembl版本75进行了基因映射。我们在每个ASE位点检索了所有基因,并总结了对基因级单倍型特异性读数的读数。我们进一步将每个基因的平均单倍型特异性拷贝态平均为平均单倍型特异性拷贝数状态,并将基因级拷贝数比例计算为这些平均值的主要总比率。为了允许对基因级ASE进行强有力的评估,我们仅考虑了至少15个读数的基因,产生了1,120名患者的4,379,378个基因 - 患者对和17,009个独特基因,总共12,441,502个可访问地点。使用二项式测试测试了每个剩余基因的AEI,以预期的读取率为0.5,以得出标称P值,并且针对通过肿瘤纯度修改的预期拷贝数比率的二项式测试,以得出拷贝数校正的P值。使用Benjamini – Hochberg程序分别调整了名义和拷贝数校正的P值,以进行多次测试。在FDR≤5%时称为显着的AEI。我们用用于聚集的ASE位点的数量进一步注释了每个基因。对于所有下游分析,我们仅将注释的基因视为蛋白质编码(Ensembl Biotype ='Protein_coding')。
在所有4,379,378个基因对患者对中,我们使用(i)逻辑回归培训了多元线性模型,以针对基因中AEI缺失或存在的二进制指标,或(ii)标准线性回归针对基因的分阶段ASE比率的标准线性回归,以评估法规变化的基因方向。对于(i),将单倍型特异性突变总结至每个类别的总负担,而对于(ii),我们使用了单倍型1和2之间的负担差异。躯体变异和ASE站点之间的分阶段映射的一致性和ASE站点之间的平态映射的一致性确保了模型系数将其方向保持在其方向上,以使其与HAPLOTYS的任意量相关(以1或2的方式),而不是Haplotys ins 1或2。基因基因座的拷贝数比(0.5≤x≤1);(2)样品纯度(0 < x < 1); (3) natural logarithm of total gene length (x > 0); (4) natural logarithm of the length of the canonical transcript (x > 0);(5)铅eqtl变体的杂合性(如果纯合子,x = 1,则x = 0,如果不是纯合子);(6)由VEP注释确定的所有突变负担类别(在10 kb窗口上的上游,10 kb窗口中的下游,启动子,5'UTR,内含子,同义词,错位,停止增益和3'Utr;x≥0用于逻辑模型,x用于有针对性模型)。
为了比较SCNA,种系EQTL,编码和非编码SNV的全局效应以及不同的贡献,在累积了所有编码和非编码变体以分离类别和报告标准化效果大小的情况下,对简化的逻辑模型进行了培训(图1E)。
如前所述,使用Fisher的精确测试和基因集富集分析在宇宙普查上进行癌症基因富集。为了富集,在整个队列上计算了一个基因的平均得分,并且只保留了在队列中至少重复五个重复的基因,总共产生了16,078个基因。
我们计算了每种肿瘤类型中ASE基因的复发。为了检查ASE基因的染色体分布,我们通过10种基因步骤计算了每个200基因窗口的所有基因的平均复发,然后减去每种肿瘤类型中的平均ASE发生,以获得所有染色体中ASE盈余的峰值。拷贝数基因的复发以类似的方式计算。
我们使用RNA-Seq数据和Gencode(第19版)注释估算了20,738个基因中70,937个启动子的启动子活性。在假设它们受相同启动子103调节的假设下,我们将转录本与重叠的第一外显子分组。从该分析中删除了位于内部外显子内部或与剪接受体位点重叠的TSS,因为这些启动子很难从RNA-Seq Data28中估算。可以使用外显子使用29,剪接reads28或基于同工型的估计值估算启动子活性30。在这里,我们使用基于同工型的方法来量化启动子活性。我们使用Kallisto66从RNA-Seq数据中量化了每个转录物的表达,并计算了在每个启动子处启动的转录本的表达之和,以获得启动子活性的估计。为了获得每个启动子的相对活性,我们通过总体基因的表达将每个启动子的活性归一化。我们将每个基因的启动子根据其平均泛伴侣启动子活性分为三类。发起人与 <1 FPKM average activity are called inactive promoters, and the most active promoter of each gene is called the major promoter. The remaining active promoters of the gene are called minor promoters.
The association between promoter activities and promoter mutation burden was estimated using the same framework as the somatic eQTL analysis. We examined associations for the promoters of expressed multi-promoter genes with a burden frequency ≥ 1% in the cohort (at least 12 patients in the full cohort). The weighted burden of the region 1-kb upstream of the TSS—that is, the sum of variant allele frequencies of the SNVs for each gene—was used as the genotype for the promoters of the respective genes. We used linear models to study the associations between the recurrent somatic burden and the promoter activity (both for the relative activity and the log2-transformed absolute activity). Similar to the somatic eQTL analysis, the known covariates and the 35 hidden peer factors were provided as cofactors to the linear models. We adjusted the P values using Benjamini–Hochberg correction method and looked for associations with FDR ≤ 5%.
We used the alignments based on the STAR pipeline to collect and quantify alternative splicing events with SplAdder70. The software has been run with its default parameters with confidence level 3. We generated individual splicing graphs for each RNA-seq sample for both tumour samples as well as matched healthy samples (when available). All graphs were then integrated into a merged graph to comprehensively reflect all splice junctions observed in all samples together. On the basis of this combined graph, SplAdder was used to extract alternative splicing events of the following types: alternative 3′ splice site, alternative 5′ splice site, cassette exon, intron retention, mutually exclusive exons, coordinated exon skip (see supplementary figure 3 in ref. 70). Each identified event was then quantified in all samples by counting split alignments for each splice junction in any previously identified event and the average read coverage of each exonic segment involved in the event was determined. We then computed a PSI value for each event that was then used for further analysis. We further generated different subsets of events, filtered at different levels of confidence, in which confidence is defined by the SplAdder confidence level (generally 2), the number of aligned reads supporting each event, the number of samples that were found to support the event by SplAdder, and the number of samples that passed the minimum aligned read threshold.
We assessed the significance of mutational enrichment for 5′ and 3′ splice sites, and branch-point104,105 intronic regions using a permutation-based approach. Impactful mutations were defined as mutations overlapping exons and introns involved in cassette exon events, in which the PSI-derived z-score was ≥ 3 or ≤ −3. For each intronic site, we compared the frequency of observed impactful mutations against frequencies of randomly sampled intronic regions (number of iterations = 1,000). For exonic sites, the null distribution was established from randomly sampled exonic sites. Randomly sampled sites were within a 100-bp window around the 5′ and 3′ splice site. For branch-point regions, sampled sites were within a 50-bp window around the branch-point sequence. The P value was computed as the number of randomly sampled frequencies greater or equal to the observed frequency.
The SAVNet approach35 was designed for identifying somatic variants associated with local aberrant splicing alterations from matched genome and transcriptome sequencing data. It uses permutations to calculate an FDR and by restricting to two classes of relationships between somatic mutations and splicing alterations to focus: (1) splice site disruption, in which exon skipping, alternative 5′ or 3′ splice site, or intron retention is associated with a mutation in a splice site motif; and (2) splice site creation, in which alternative 5′ or 3′ splice sites are associated with mutations that create a novel splice motif (FDR ≤ 10%) (Extended Data Fig. 17e).
Gene fusions between any two genes were identified based on two gene fusions detection pipelines: FusionMap (v.2015-03-31) pipeline106 and FusionCatcher (v.0.99.6a)/STAR-Fusion (v.0.8.0) pipeline107. ChimerDB 3.0 was used as a reference of previously reported gene fusions. The database contains 32,949 fusion genes split into three groups: (1) KB: 1,067 fusion genes manually curated based on public resources of fusion genes with experimental evidences; (2) Pub: 2,770 fusion genes obtained from text mining of PubMed abstracts; and (3) Seq: archive with 30,001 fusion gene candidates from deep-sequencing data. This set includes fusions found by re-analysing the RNA-seq data of the TCGA project encompassing 4,569 patients from 23 types of cancer.
In brief, FusionMap was applied to all unaligned reads from the PCAWG aligned TopHat2 RNA-seq BAM files for each aliquot to detect gene fusions. In the FusionCatcher/STAR-Fusion pipeline, for each aliquot with paired-end RNA-seq reads FusionCatcher was applied to the raw reads, with the genome reference. Specifically, for each aliquot with paired-end RNA-seq reads FusionCatcher was applied to the raw reads. The ‘-U True; -V True’ runtime options were used. For each aliquot with single-end RNA-seq reads, STAR-Fusion was applied to the raw reads, with the same reference genome and gene models as FusionCatcher and with default settings. In parallel, FusionMap was applied to all unaligned reads from the PCAWG aligned TopHat2 RNA-seq BAM files for each aliquot to detect gene fusions with the following non-default options values: MinimalHit = 4; OutputFusionReads = True; RnaMode = True; FileFormat = BAM.
To reduce the number of false-positive fusions, the two sets of fusions were filtered to exclude fusions based on the number of supporting junction reads, sequence homology, and occurrence in normal samples (from the GTEx and the PCAWG cohort). To get a high-confident consensus fusion call set from these two pipelines, a fusion to be included in the final set of fusions had to: (i) be detected by both fusion detection tools in at least one sample; and/or (ii) be detected by one of the methods and have a matched structural variant in at least one sample. The consensus WGS-based somatic structural variants (v.1.6) were obtained from the PCAWG repository in https://dcc.icgc.org/releases/PCAWG.
For integration with matched structural variant evidence, a fusion was considered to match a structural variant if the absolute distance between the fusion break points and structural variant break points did not exceed 500 kb (the distance was considered infinite when the chromosomes of the fusion and structural variant break point differ). When there was no evidence for a direct structural variant fusion, the search was expanded to look for composite fusions. In this case, an exhaustive search was performed to look for two structural variants with break points close to the fusion break points and with an effective distance smaller than 250 kb.
Finally, 3,540 fusion events were included as the consensus fusion call set, from these 2,268 were detected by both FusionCatcher/STAR-Fusion and FusionMap (from these, 1,821 had matched structural variant evidence) and 1,112 were detected by only one method and had matched structural variance evidence.
In total, approximately 36% of all detected fusion transcripts were predicted to be in-frame, several UTR-mediated fusion transcripts preserve complete coding sequences of one fusion partner. These include a known fusion TBL1XR1-PIK3CA in a breast tumour and a notable new example CTBP2-CTNNB1 in a gastric tumour.
All fusions are available in Synapse: https://dcc.icgc.org/releases/PCAWG/transcriptome/fusion.
We used an RNA-editing events calling pipeline, which is an improved version of that previously published108. First, we summarized the base calls of pre-processed aligned RNA reads to the human reference in pileup format. Second, the initially identified editing sites were then filtered by the following quality-aware steps: (1) the depth of candidate editing site, base quality, mapping quality and the frequency of variation were taken into account to do a basic filter: the candidate variant sites should be with base-quality ≥ 20, mapping quality ≥ 50, mapped reads ≥ 4, variant-supporting reads ≥ 3, and mismatch frequencies (variant-supporting-reads/mapped-reads) ≥ 0.1. (2) Statistical tests based on the binomial distribution B(n, p) were used to distinguish true variants from sequencing errors on every mismatch site109, in which p denotes the background mismatch rate of each transcriptome sequencing, and n denotes sequencing depth on this site. (3) Discard the sites present in combined DNA SNP datasets (dbSNP v.138, 1000 Genome SNP phase 3, human Dutch populations110, and BGI in-house data; combined datasets deposited at: ftp://ftp.genomics.org.cn/pub/icgc-pcawg3). (4) Estimate strand bias and filter out variants with strand bias based on two-tailed Fisher’s exact test. (5) Estimate and filter out variants with position bias, such as sites only found at the 3′ end or at 5′ end of a read. (6) Discard the variation site in simple repeat region or homopolymer region or <5 bp from splicing site. (7) To reduce false positives introduced by misalignment of reads to highly similar regions of the reference genome, we performed a realignment filtering. Specifically, we extracted variant-supporting reads on candidate variant sites and realign them against a combination reference (hg19 genome plus Ensembl transcript reference v.75) by bwa0.5.9-r16. We retain a candidate variant site if at least 90% of its variant-supporting reads are realigned to this site. Finally, all high confident RNA-editing sites were annotated by ANNOVAR111. (8) To remove the possibility of an RNA-editing variant being a somatic variant, the variant sites are positionally filtered against PCAWG WGS somatic variant calls (9). The final two steps of filtering are designed to enrich the number of functional RNA editing sites. First, we keep only events that occur more than two times in at least one cancer type. Second, we keep only events that occur in exonic regions with a predicted function of missense, nonsense or stop-loss. The final step of filtering within exonic regions with a specific predicted function induces the largest difference in observed frequencies of RNA-editing events between our analysis and the published one108. A comparative depiction of the frequencies of RNA-editing events identified in our analysis (Supplementary Table 24) and the previously published analysis108 is seen in Supplementary Fig. 23.
To perform joint analysis across RNA and DNA alterations, each alteration type was condensed into a binary gene-centric format. Because alterations occur at many different scales (nucleotide, exonic, gene or transcript), to make them comparable we projected each alteration type onto the gene body. We summarized each alteration type by its presence or absence within a single gene, yielding a binary value per type for each gene-sample pair.
The events we included in this analysis were: RNA editing, non-synonymous variants, expression, splicing alterations, copy-number alterations, fusions and alternative promoters. Each alteration type was summarized differently owing to their inherent differences.
RNA-editing events and non-synonymous variants can occur several times within a single gene body, so these events were denoted as 1 if they occurred at least once within a gene–sample pair.
For copy number, to obtain a single numerical value per gene-sample pair, the copy-number alteration was averaged over the gene body. Because we do not have matched normal samples against which to compare, we instead consider outlying events within each histotype as significant. Thus, a value of 1 was given to average copy-number alterations larger than 6 or smaller than 1.
Similar to non-synonymous variants, multiple splice events can occur within a gene body. The event with the most extreme PSI value within the gene body is selected as the candidate event for the gene. The candidate’s PSI value for a gene is compared over all samples within a histotype and it is set to 1 (that is, significant) only if it the absolute value of its z-score is larger than 6 and the standard deviation is larger than 0.01 within that histotype.
Similar to expression outliers, we calculate a z-score using the log-transformed upper-quartile normalized FPKM values with a pseudo-count of 1. All genes within a histotype with a standard deviation larger than zero and an absolute value larger than three were identified as an outlier. Alternative promoter outliers were calculated based on relative promoter activity within each cancer type. To binarize the promoter activity, a z-score cut-off of two over the relative expression distribution within each cancer type was used.
For ASE outliers, only genes with significant allelic imbalance (FDR ≤ 5% and allelic imbalance > 0.2,二项式测试)表示为1。所有鉴定的ASE事件均被进一步过滤,以保持尚未识别为印记的基因26。
除了上述Z分数过滤外,我们还进一步过滤了非同义SNV,RNA编辑事件和剪接事件,以便它们诱导frameshift或替代区域包含“损坏”类别的HGMD变体112。
必须注意的是,在许多情况下,计算得出的Z分数不是来自高斯分布,因此可能错过或错误地包括某些事件。通过选择非常严格的Z分数阈值和功能过滤器,我们希望使虚假的异常事件最小化。
对于我们的途径分析,我们使用TCGA途径定义来检查在DNA和RNA级别上都有几个改变的基因和途径。
同时分析还对上述二元基因表进行了,但仅包括变体,表达异常值,替代启动子,替代剪接和融合。由于大量的预期共发生,将SCNA和ASE排除在外。在此分析中,我们需要至少一个给定改变对的基因作为宇宙基因。对于每个变化对,基于两种变化的捐助者数量,仅一个变化,并且在一组癌症样本中都没有变化,我们进行了Fisher的精确测试,以确定改变对是否彼此独立。此类测试之后是Benjamini – Hochberg多重测试校正以获得FDR(或Q值)。为了排除组织特异性改变引起的潜在假阳性关联,我们对每种肿瘤类型进行了至少50名患者进行了相同的分析,并且仅保留了在Pan-Canter分析和至少一种特定癌症指示中显着相关的那些改变对。在显着相关的变化对中,相关的对的比值比大于1。途径富集和可视化21,114使用R套件ReactomePa79进行了21,114。使用R封装Circlize115生成Circos图。与剪接相关的基因源自分子标志数据库(MSIGDB)116中的“ Reactome_MRNA_Splicing”或“ ReactOME_MRNA_SPLICING_MINOR_PATHWAY”的基因或“ ReactOME_MRNA_SPLICING_MINOR_PATHWAY”。
从CIS变体与基因表达,ASE,融合和剪接的相关性中鉴定出具有多种RNA改变的多种异质机理的基因。对于基因表达,选择了与体细胞EQTL相关的基因,而FDR <5%。对于ASE,由体细胞变体对ASE的预测贡献排名的基因的前5%。对于融合,选择了所有具有结构变异支持的RNA融合。对于剪接,选择了在注释的剪接位点的10 bp之内具有体细胞突变的基因,或者选择了分支点的3 bp和相关的剪接。这些相关的剪接事件也必须具有| z分数|大于或等于3,在异常事件中剪接百分比的差异大于或等于10%。
对所有九种改变类型的二进制基因表进行了复发分析。复发分析以三个主要步骤进行:(1)在所有样本中的每个变更类型中的聚集体。这会导致每个基因改变对的总和。(2)将计数转换为每个更改中的等级。最小的等级是最常见的基因。等级均匀地分布在领带上。(3)为了生成每个基因的单个分数,跨变化的第二最小等级用作得分。为了确定显着变化基因的得分截止值,通过排列产生了无效分布。对每个基因细胞对中的样品进行了排列,这是在所有基因和样品中进行1,000次进行的,将所有观察结果串联在一起,导致1680万个排列得分。源自零分布得出的p <0.05被定义为显着,导致得分大于或等于774被认为是显着的。
Wext117用于测试RNA和DNA改变的相互排斥性的重要性。作为进一步的证据表明CDK12的变化可能会产生功能影响,我们发现了先前检测到的链接55的证据55在大小大于100 kb的10个以上的串联复制和CDK12体细胞EQTL突变(在18个体重EQTL载体中7分)在215级均衡器中,Pulictroretion在215大于100 kb中的链条55。测试)。
除非另有说明,否则所有常见的统计测试都是双面的。没有使用统计方法来预先确定样本量。实验不是随机的,研究人员在实验和结果评估过程中并未对分配视而不见。
有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。

