——标题:“TCGAbiolinks和ELMER的综合分析研讨会-分析”作者:“Tiago Chedraoui Silva, Simon Coetzee, Dennis Hazelett, Ben Berman, Houtan Noushmehr”日期:“' r Sys.Date() '”输出:html_document: self_contained: true number_sections:无主题:flatly highlight: tango mathjax: default toc: true toc_float: true toc_depth: 2 css: style.css参考文献:参考文献。bib vignette: > %\VignetteIndexEntry{整合分析车间与TCGAbiolinks和ELMER - analysis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}——' ' ' {r, echo =FALSE,隐藏=TRUE,消息=FALSE,警告=FALSE} devtools::load_all(".")我们还将介绍的另一个Bioconductor包被称为[ELMER](//www.andersvercelli.com/packages/ELMER/) [@yao2015inferring,@elmer2],它允许人们识别远端调控区域的DNA甲基化变化,并将这些特征与附近基因的表达相关联,以识别与癌症相关的转录靶点。对于这些与基因相关的远端探针,进行转录因子motif分析,然后进行转录因子表达分析,以推断上游调控因子。我们希望本次研讨会的参与者能够理解使用[TCGAbiolinks](//www.andersvercelli.com/packages/TCGAbiolinks/) + [ELMER](//www.andersvercelli.com/packages/ELMER/)进行的综合分析,并能够从数据采集过程执行到最终的结果解释。本课程假定用户对R有中等程度的熟悉,并对肿瘤生物学有基本的了解。下图突出显示了本节将介绍的工作流部分。[本节中涉及的部分工作流](figures/workflow_ELMER.png) ##加载所需的库' ' ' {r vignette, eval=TRUE, message=FALSE,warning =FALSE}库(ELMER)库(DT)库(dplyr)来自' r BiocStyle::Biocpkg("MultiAssayExperiment") '包的MultiAssayExperiment对象是' r BiocStyle::Biocpkg("ELMER") '的多个主要函数的输入。要创建它,您可以使用' createMAE '函数。但是,我们将不使用之前下载的数据,而是使用本包中提供的更多示例。 ```{r data, eval=TRUE, message=FALSE, warning = FALSE} library(MultiAssayExperiment) lusc.exp lusc.met ``` Also, we will need to filter the probes in the MAE to select only distal probes. ```{r distal probes, eval=TRUE, message=FALSE, warning = FALSE} distal.probes <- get.feature.probe(genome = "hg38", met.platform = "450K") ``` The `createMAE` function will receive both DNA methylation and gene expression objects. Also `filter.probes` argument will select only probes in the regions of the distal.probes object. ```{r mae, eval=TRUE, message=FALSE, warning = FALSE} library(MultiAssayExperiment) mae <- createMAE(exp = lusc.exp, met = lusc.met, save = TRUE, linearize.exp = TRUE, filter.probes = distal.probes, save.filename = "mae_bioc2017.rda", met.platform = "450K", genome = "hg38", TCGA = TRUE) mae colData(mae)[1:5,] %>% as.data.frame %>% datatable(options = list(scrollX = TRUE)) sampleMap(mae)[1:5,] %>% as.data.frame %>% datatable(options = list(scrollX = TRUE)) ``` # Analysis ## Identifying differentially methylated probes This step is to identify DNA methylation changes at distal enhancer probes which is carried out by function `get.diff.meth`. For each distal probe, the function first rank samples from group 1 and group 2 samples by their DNA methylation beta values. To identify hypomethylated probes, the function compared the lower control quintile (20\% of control samples with the lowest methylation) to the lower experiment quintile (20\% of experiment samples with the lowest methylation), using an unpaired one-tailed t-test. ![Source: Yao, Lijing, et al. "Inferring regulatory element landscapes and transcription factor networks from cancer methylomes." Genome biology 16.1 (2015): 105.](figures/paper_diff_meth.png)
主要的get.diff.meth参数
| |参数描述 | |------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 数据| multiAssayExperiment的DNA甲基化和基因表达数据。参见' createMAE '函数。| | diff.dir |一个字符可以是“hypo”或“hyper”,表示不同的甲基化方向。它可以是“hypo”只选择低甲基化的探针;“hyper”只选择高甲基化探针| | minSubgroupFrac |一个从0到1的数字,指定来自第1组和第2组的极端样本的比例,用于识别差异DNA甲基化。默认值是0.2,因为我们通常希望能够检测出肿瘤中特定的(可能未知的)分子亚型;这些亚型通常只占样本的少数,为了统计能力的目的,选择20%作为下限。如果您使用预定义的组标签,例如处理过的复制与未处理的复制,请使用值1.0(***监督***模式)| | pvalue |一个数字指定显著P值(通过BH调整P值),用于选择显著低甲基化/超甲基化探针。默认为0.01 | |组。col |定义样本组的列。 You can view the available columns using: `colnames(MultiAssayExperiment::colData(data))`. | | group1 | A group from group.col. ELMER will run group1 vs group2. That means, if direction is hyper, get probes hypermethylated in group 1 compared to group 2. | | group2 | A group from group.col. ELMER will run group1 vs group2. That means, if direction is hyper, get probes hypermethylated in group 1 compared to group 2. | | sig.dif | A number specifies the smallest DNA methylation difference as a cutoff for selecting significant hypo/hyper-methylated probes. Default is 0.3. |
{r,eval=TRUE, message=FALSE, warning =FALSE, results = "hide"} sig.diff <- get.diff。冰毒(数据= mae,组。col = "definition", group1 = "原发实体瘤",group2 = "实体组织正常",minSubgroupFrac = 0.2, sig.dif = 0.3, diff.dir = "hypo", #搜索组1核的低甲基化探针= 1,dir。Out ="result", pvalue = 0.01)``` ```{r,eval=TRUE, message=FALSE, warning =FALSE} as.data.frame(sig.diff) %>% datatable(options = list(scrollX =TRUE)) # get.diff.meth自动保存输出文件。csv包含所有探测的统计信息。# getMethdiff.hypo.probes.significant.csv只包含有意义的探针,#与sig.diff相同dir(path = "result", pattern = "getMethdiff")识别假定的探针-基因对这一步是将甲基化改变的增强子探针与表达改变的靶基因联系起来,并报告所选探针的假定靶基因。这是由函数' get.pair '实现的。对于每个甲基化差异的远端探针,检测最近的10个上游基因和最近的10个下游基因,以确定探针甲基化与基因表达之间的相关性。因此,对每个探测执行了20个统计检验,如下所示。对于每个探针基因对,样本(所有实验样本和对照样本)被分为两组:M组,由上甲基化五分位数组成(增强子探针上甲基化最高的20%的样本),U组,由最低甲基化五分位数组成(甲基化最低的20%的样本)。 For each probe-gene pair tested, the raw p-value `Pr` was corrected for multiple hypothesis using a permutation approach. ![Source: Yao, Lijing, et al. "Inferring regulatory element landscapes and transcription factor networks from cancer methylomes." Genome biology 16.1 (2015): 105.](figures/paper_get.pairs.png)
主要得到的。对参数
| |参数描述 | |-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 数据| multiAssayExperiment DNA甲基化和基因表达数据。参见createMAE函数。| | nearGenes |可以是包含GetNearGenes函数输出的列表,也可以是包含' GetNearGenes '函数输出的rda文件路径。| | minSubgroupFrac |一个从0到1.0的数字,指定用于创建U组(未甲基化)和M组(甲基化)的样本百分比,用于将探针连接到基因。默认值是0.4(最低的五分位数样本将在U组中,最高的五分位数样本将在M组中)。| | permu。size |排列次数。缺省值是10000。| | pvalue |用于定义有效值对的原始p值截断值。默认值是0.05。 It will select the significant P value cutoff before calculating the empirical p-values. | | Pe | A number specify the empirical p-value cutoff for defining signficant pairs. Default is 0.001. | | group.col | A column defining the groups of the sample. You can view the available columns using: `colnames(MultiAssayExperiment::colData(data))`. | | group1 | A group from group.col. | | group2 | A group from group.col. | | filter.probes | Should filter probes by selecting only probes that have at least a certain number of samples below and above a certain cut-off. See `preAssociationProbeFiltering` function. | | filter.portion | A number specify the cut point to define binary methlation level for probe loci. Default is 0.3. When beta value is above 0.3, the probe is methylated and vice versa. For one probe, the percentage of methylated and unmethylated samples should be above filter.percentage value. Only used if filter.probes is TRUE. See preAssociationProbeFiltering function. | | filter.percentage | Minimum percentage of samples to be considered in methylated and unmethylated for the filter.portion option. Default 5%. Only used if filter.probes is TRUE. See preAssociationProbeFiltering function. |
{r, eval = TRUE, message = FALSE, warning = FALSE, results = "hide"} nearGenes <- GetNearGenes(data = mae, probes = sig.diff$probe, numFlankingGenes = 20, # 10个上游和10个下游基因cores = 1)Pair <- get。配对(数据= mae,组。col = "definition", group1 = "原发实体瘤",group2 = "实体组织正常",nearGenes = nearGenes, minSubgroupFrac = 0.4, # %的样本用于创建组U/M permu。Dir = "result/permu", permu。pvalue = 0.05, Pe = 0.01, #请设置为0.001,以获得显著结果过滤。probes = TRUE, #见preAssociationProbeFiltering函数filter。百分比= 0.05,过滤器。部分= 0.3,dir。Out = "result", cores = 1, label = "hypo")``` ```{r, eval = TRUE, message = FALSE, warning = FALSE}假设。get %>% datatable(options = list(scrollX = TRUE)) #Pair自动保存输出文件。# getpair . sub .all.pairs.statistic.csv包含所有探针-基因对的统计信息。# getpair . hypos .pairs.significant.csv只包含与hypos .pair相同的重要探测。dir(path = "result", pattern = "getPair")这一步是通过函数get. enrichment . Motif来识别一组探针中的富集Motif。 This function uses a pre-calculated data `Probes.motif` which was generated using HOMER with a $p-value \le 10^{–4}$ to scan a $\pm250bp$ region around each probe using HOmo sapiens COmprehensive MOdel COllection [http://hocomoco.autosome.ru/](HOCOMOCO) v10 [@kulakovskiy2016hocomoco] position weight matrices (PWMs). For each probe set tested (i.e. the list of gene-linked hypomethylated probes in a given group), a motif enrichment Odds Ratio and a 95% confidence interval were calculated using following formulas: $$ p= \frac{a}{a+b} $$ $$ P= \frac{c}{c+d} $$ $$ Odds\quad Ratio = \frac{\frac{p}{1-p}}{\frac{P}{1-P}} $$ $$ SD = \sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}} $$ $$ lower\quad boundary\quad of\quad 95\%\quad confidence\quad interval = \exp{(\ln{OR}-SD)} $$ where `a` is the number of probes within the selected probe set that contain one or more motif occurrences; `b` is the number of probes within the selected probe set that do not contain a motif occurrence; `c` and `d` are the same counts within the entire enhancer probe set. A probe set was considered significantly enriched for a particular motif if the 95% confidence interval of the Odds Ratio was greater than 1.1 (specified by option `lower.OR`, 1.1 is default), and the motif occurred at least 10 times (specified by option `min.incidence`. 10 is default) in the probe set. As described in the text, Odds Ratios were also used for ranking candidate motifs.
主要得到的。对参数
|参数|描述||-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 数据从createMAE |多分析实验功能。如果设置并探测。缺少主题/背景探测,这将用于正确获取其他两个参数。此参数不是必需的,您可以设置探测。母题和背景。手动探针。| | probes |一个向量列出探针的名称,用于定义计算motif富集OR和置信区间的探针集合。| |低。OR |比值比95%置信区间的最小下界。优势比95%置信区间下界高于数量的基序为显著富集基序(详见文献)。| |最小发生率|非负整数指定给定探针集中motif的最小发生率。 10 is default. |
{r,eval=TRUE, message=FALSE, warning =FALSE} #识别显著低甲基化探针的富集motif,其中#具有假定的靶基因。充实。Motif <- get. enrichment。motif(数据= mae,探针= Hypo。两美元的探针,dir。Out = "result", label = "hypo", min.incidence = 10, lower。OR = 1.1) names(富贵motif) #富贵motif head(富贵motif["P73_HUMAN.H10MO.A"]) ##给定集合中有TP53 motif的探针。# get. enrichment .motif自动保存输出文件。# getmotif . sub . enrichment .motifs.rda包含了丰富的motif和带有motif的探针。# getmotif . sub .motif.enrichment.csv包含丰富的motif的摘要。dir(path = "result", pattern = "getMotif")enrichment <- read.csv("result/getMotif.hypo.motif.enrichment.csv") motif.enrichment %>% datatable(options = list(scrollX = TRUE)) # motif enrichment figure will be automatically generated. dir(path = "result", pattern = "motif.enrichment.pdf") ``` ### Figure: hypo.all.quality.motif.enrichment.pdf母题富集图这一步是鉴定其表达与TF结合基序DNA甲基化相关的调控TF,该结合基序DNA甲基化由函数' get.TF '执行。对于每个被认为在特定探针集中富集的基元,它将在基元出现的$\pm250bp$内的所有远端增强子探针上的平均DNA甲基化与人类tf的表达进行比较。对每个motif-TF对进行统计检验,如下所示。样本(所有组样本)被分为两组:M组,由20%的样本在所有基序相邻探针上平均甲基化最高组成,U组,由20%的样本甲基化最低组成。对于每个候选motif-TF对,采用Mann-Whitney U检验检验M组整体基因表达大于或等于U组基因表达的原假设。所有tf按$-log_{10}(P_{r})$进行排序,排名在前5%以内的为上游调控因子候选。来源:姚,李静,等。“从癌症甲基化组推断调节元件景观和转录因子网络。”基因组生物学16.1(2015):105.(图/paper_get_pairs.png)
主要得到的。对参数
| |参数描述 | |--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 数据| multiAssayExperiment DNA甲基化和基因表达数据。参见' createMAE '函数。| |富集。motif |包含get. enrichment .motif函数输出或XX路径的列表。包含get. enrichment .motif函数输出的Rda文件。| |组。col |定义样本组的列。您可以使用:' colnames(MultiAssayExperiment::colData(data)) '查看可用的列。| group1 | group.col中的组。| | group2 | group.col中的组。| | minSubgroupFrac |一个从0到1的数字,指定用于创建U(未甲基化)和M(甲基化)组的样本百分比,用于将探针连接到TF表达。 Default is 0.4 (lowest quintile of all samples will be in the U group and the highest quintile of all samples in the M group). |
{r,eval=TRUE, message=FALSE, warning =FALSE, results = "hide"} ##识别富集motif TF <- get的调控TF。tf (data = mae, group。col = "definition", group1 = "原发实体瘤",group2 = "实体组织正常",minSubgroupFrac = 0.4,富贵。主题=丰富。主题,dir。Out = "result", cores = 1, label = "hypo")``` ```{r,eval=TRUE, message=FALSE, warning =FALSE} TF %>% datatable(options = list(scrollX =TRUE)) # get。tf会自动保存输出文件。# gettf . hypos . tfs .with.motif.pvalue.rda包含了在富基序位点具有平均# DNA甲基化的所有TF的统计数据。# gettf . sub .significant. tfs .with.motif.summary.csv只包含重要探测。dir(path = "result", pattern = "getTF") #基于统计数据的TF排名图将自动生成。dir(path = "result/TFrankPlot_family/", pattern = "pdf")图:FOXJ3_HUMAN.H10MO.S.TFrankPlot.pdfTF排序图#会话信息{r sessioninfo, eval=TRUE} sessioninfo ()参考书目