—title: E. Gene Set enricent author: Martin Morgan (mtmorgan@fredhutch.org) date: " r Sys.Date() " output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{E。} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}——' ' {r style, echo=FALSE, result ='asis'} suppressPackageStartupMessages({library(edgeR) library(goseq) library(org.Hs.eg.db) library(GO.db)})“#动机是一组基因的表达与实验条件有关吗?”-例如,基因组中是否存在异常多的上调基因?许多方法,最近的回顾是Kharti et al., 2012。-过度表达分析(ORA) -集合中的差异表达(DE)基因是否比预期的更常见?-功能类评分(FCS) -总结一组基因DE的统计,并与零路径拓扑(PT)比较-包括路径知识评估一组基因DE ##什么是基因集?**将“基因”按优先顺序分类为生物学相关的组——同一生化途径的成员——在同一细胞腔室中表达的蛋白质——在特定条件下共表达——相同调控元件的靶蛋白——在同一细胞基因带上的靶蛋白。set不需要是…-详尽的# #集不相交的基因集的基因本体论((去)(http://geneontology.org))注释(果)- CC蜂窝组件- BP生物过程- MF分子功能通路(MSigDb) (http://www.broadinstitute.org/gsea/msigdb/)——(KEGG) (http://genome.jp/kegg)(不再免费提供)- (reactome) (http://reactome.org) [PantherDB](http://pantherdb.org) - ... E.g., [MSigDb](http://www.broadinstitute.org/gsea/msigdb/) - c1 Positional gene sets -- chromosome \& cytogenic band - c2 Curated Gene Sets from online pathway databases, publications in PubMed, and knowledge of domain experts. - c3 motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes. - c4 computational gene sets defined by mining large collections of cancer-oriented microarray data. - c5 GO gene sets consist of genes annotated by the same GO terms. - c6 oncogenic signatures defined directly from microarray gene expression data from cancer gene perturbations. - c7 immunologic signatures defined directly from microarray gene expression data from immunologic studies. # Statistical approaches Initially based on a presentation by Simon Anders, [CSAMA 2010](http://marray.economia.unimi.it/2009/material/lectures/L8_Gene_Set_Testing.pdf) ## Approach 1: hypergeometric tests Steps 1. Classify each gene as 'differentially expressed' DE or not, e.g., based on _P_ < 0.05 2. Are DE genes in the set more common than DE genes not in the set?
在基因集中?
是的 没有
不同 是的 k K
表达了吗? 没有 N - K. N - K.
3. Fisher Hypergeometric测试,通过`fiser.test()`或`r biocpkg(“gostats”)`notes - 条件超高距离容纳Go Dag,`R Biocpkg(“Goostats”)` - 但是:人工部门分为两组##方法2:浓缩得分 - Mootha等,2003;修改过的Subramanian等人。,2005。步骤 - 通过日志折叠变化对基因进行排序 - 计算运行和:在组中的基因时递增,递减。- 最多运行和浓缩得分es;大es表示集合中的基因是列表之外。- 置于符合SIGNSCATCH的主题标签##方法3:类别$t$-test例如,Jiang \& Gentleman, 2007;\Biocpkg{Category} -总结每个集合中的$t$(或其他)统计数据-通过排列主题标签进行显著性检验-更直接地实现“核糖体”KEGG通路基因在NEG vs BCR/ABL样本中的表达;' r Biocpkg(“类别”)的故事。##竞争力与独立的空假设Goemann&Bühlmann,2007 - 竞争性无效:基因组中的基因没有比其他基因更强烈的主题条件相关性。(接近1,2) - 自含有空缺:基因组中的基因与主题条件没有任何关联。(Approach 3) - Probably, self-contained null is closer to actual question of interest - Permuting subjects (rather than genes) is appropriate ## Approach 4: linear models E.g., Hummel et al., 2008, \Biocpkg{GlobalAncova} - Colorectal tumors have good ('stage II') or bad ('stage III') prognosis. Do genes in the p53 pathway (_just one gene set!_) show different activity at the two stages? - Linear model incorporates covariates -- sex of patient, location of tumor `r Biocpkg("limma")` - Majewski et al., 2010 `romer()` and Wu \& Smythe 2012 `camera()` for enrichment (competitive null) linear models - Wu et al., 2010: `roast()`, `mroast()` for self-contained null linear models ## Approach 5: pathway topology E.g., Tarca et al., 2009, \Biocpkg{SPIA} - Incorporate pathway topology (e.g., interactions between gene products) into signficance testing - Signaling Pathway Impact Analysis - Combined evidence: pathway over-representation $P_{NDE}$; unusual signaling $P_{PERT}$ (equation 1 of Tarca et al.) Evidence plot, colorectal cancer. Points: pathway gene sets. Significant after Bonferroni (red) or FDR (blue) correction.##序列数据的问题?- All else being equal, long genes receive more reads than short genes - Per-gene $P$ values proportional to gene size E.g., Young et al., 2010, `r Biocpkg("goseq")` - Hypergeometric, weighted by gene size - Substantial differences - Better: read depth?? DE genes vs. transcript length. Points: bins of 300 genes. Line: fitted probability weighting function.##方法6:_de novo_发现 - 到目前为止:类似于监督机器学习,其中途径提前已知 - 无监督的发现怎么样?示例:langfelder&hovarth,[wgcna](http://labs.genetics.ucla.edu/horvath/coexpressionnetwork/rpackages/wgcna/) - 加权相关网络分析 - 在Langfelder&Horvath中描述,[2008](http:// www.biomedcentral.com/1471-2105/9/559)##代表R - 命名的`list()`中的基因集,其中列表的名称是设置的,并且列表的每个元素是基因的向量该集合。- `data.frame()`set name / gene name对 - `r biocpkg(“gseabase”)`##结论基因套装浓缩分类 - Kharti等:过度表示分析;功能阶级评分;途径拓扑 - Goemann \&Bühlmann:竞争与\自给零零选择\ Biocumon {}包装|接近|包裹|| ----------------- | ------------------------------------------- || Hypergeometric | `r Biocpkg("GOstats")`, `r Biocpkg("topGO")`| | Enrichment | `r Biocpkg("limma")``::romer()` | | Category $t$-test | `r Biocpkg("Category")` | | Linear model | `r Biocpkg("GlobalAncova")`, `r Biocpkg("GSEAlm")`, `r Biocpkg("limma")``::roast()` | | Pathway topology | `r Biocpkg("SPIA")` | | Sequence-specific | `r Biocpkg("goseq")` | | _de novo_ | `r CRANpkg("WGCNA")` | # Practical This practical is based on section 6 of the `r Biocpkg("goseq")` [vignette](//www.andersvercelli.com/packages/devel/bioc/vignettes/goseq/inst/doc/goseq.pdf). ## 1-6 Experimental design, ..., Analysis of gene differential expression This (relatively old) experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP (Li et al., [2008](http://dx.doi.org/10.1073/pnas.0807121105)). The experiment used short (35bp) single-end reads from 4 control and 3 untreated lines. Reads were aligned to hg19 using Bowtie, and counted using ENSEMBL 54 gene models. Input the data to `r Biocpkg("edgeR")`'s `DGEList` data structure. ```{r prostate-edgeR-input} library(edgeR) path <- system.file(package="goseq", "extdata", "Li_sum.txt") table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE) counts <- table.summary[,-1] rownames(counts) <- table.summary[,1] grp <- factor(rep(c("Control","Treated"), times=c(4,3))) summarized <- DGEList(counts, lib.size=colSums(counts), group=grp) ``` Use a 'common' dispersion estimate, and compare the two groups using an exact test ```{r prostate-edgeR-de} disp <- estimateCommonDisp(summarized) tested <- exactTest(disp) topTags(tested) ``` ## 7. Comprehension Start by extracting all P values, then correcting for multiple comparison using `p.adjust()`. Classify the genes as differentially expressed or not. ```{r prostate-edgeR-padj} padj <- with(tested$table, { keep <- logFC != 0 value <- p.adjust(PValue[keep], method="BH") setNames(value, rownames(tested)[keep]) }) genes <- padj < 0.05 table(genes) ``` ### Gene symbol to pathway Under the hood, `r Biocpkg("goseq")` uses Bioconductor annotation packages (in this case `r Biocannopkg("org.Hs.eg.db")` and `r Biocannopkg("GO.db")` to map from gene symbols to GO pathways. Expore these packages through the `columns()` and `select()` functions. Can you map between ENSEMBL gene identifiers (the row names of `topTable()`) to GO pathway? What about 'drilling down' on particular GO identifiers to discover the term's definition? ### Probability weighting function Calculate the weighting for each gene. This looks up the gene lengths in a pre-defined table (how could these be calculated using TxDb packages? What challenges are associated with calculating these 'weights', based on the knowledge that genes typically consist of several transcripts, each expressed differently?) ```{r prostate-edgeR-pwf} pwf <- nullp(genes,"hg19","ensGene") head(pwf) ``` ### Over- and under-representation Perform the main analysis. This includes association of genes to GO pathway ```{r prostate-goseq-wall} GO.wall <- goseq(pwf, "hg19", "ensGene") head(GO.wall) ``` ### What if we'd ignored gene length? Here we do the same operation, but ignore gene lengths ```{r prostate-goseq-nobias} GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric") ``` Compare the over-represented P-values for each set, under the different methods ```{r prostate-goseq-compare, fig.width=5, fig.height=5} idx <- match(GO.nobias$category, GO.wall$category) plot(log10(GO.nobias[, "over_represented_pvalue"]) ~ log10(GO.wall[idx, "over_represented_pvalue"]), xlab="Wallenius", ylab="Hypergeometric", xlim=c(-5, 0), ylim=c(-5, 0)) abline(0, 1, col="red", lwd=2) ``` # References - Khatri et al., 2012, PLoS Comp Biol 8.2: e1002375. - Subramanian et al., 2005, PNAS 102.43: 15545-15550. - Jiang \& Gentleman, 2007, Bioinformatics Feb 1;23(3):306-13. - Goeman \& B\"uhlmann, 2007, Bioinformatics 23.8: 980-987. - Hummel et al., 2008, Bioinformatics 24.1: 78-85. - Wu \& Smyth 2012, Nucleic Acids Research 40, e133. - Wu et al., 2010 Bioinformatics 26, 2176-2182. - Majewski et al., 2010, Blood, published online 5 May 2010. - Tarca et al., 2009, Bioinformatics 25.1: 75-82. - Young et al., 2010, Genome Biology 11:R14.