---标题:“7.芯片-SEQ用于了解基因调节”作者:“马丁摩根(Martin.morgan@roswellpark.org)
罗斯威尔公园癌症研究所,布法罗,纽约
2015年10月5日 - 9“输出:生物科技:: html_document:toc:true:true toc_depth:2小插图:>%\ vignetteindexentry {7.芯片-SEQ用于了解基因调节}%\ vignetteengine {knitr :: Rarkndown} ---```{r style,echo = false,结果='asis'}生物焦质:: markdown()选项(width = 100,max.print = 1000)knitr :: opts_chunk $ set(eval = as.logical(sys.getenv(“knitr_eval”,“true”)),cache = as.logical(sys.getenv(“knitr_cache”,“true”)))`````{r setup,echo = false,消息= false,警告= false} suppresspackageStartUpMessages({库(CSAW)库(Edger)库(Genomicranges)库(ChipSeeker)库(GenseeSeker)库(TXDB.hsapiens.ucsc.hg19.knowngene)库(org.hs.eg.db)库(ClusterProfiler)})```本课程中的材料需要R版本3.2和Biocumon V9.2``` {r Configure-test} stopifnot(getRversion()> ='3.2'&& getRversion()<'3.3',Biocinstaller::Biocversion()==“3.2”)```#动机和工作流程键参考 - Kharchenko,Tolstorukov和Park([2008](http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2597701/))。- LUN和SMYTH([2014](http://dx.doi.org/10.1093/nar/gku351))。## Chip-SEQ LUN,[BioC 2015](http://biocondudard.org/help/course-materials/2015/bioc2015/csaw_lab.html)![芯片-SEQ卡通](your_figures / lun chipseq-cartoon。PNG)Kharchenko等。([2008](http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2597701/))。![芯片-SEQ概述](of_figures / chipseq_nbt-1508-f1.jpg) - 标签与测序读数;single-end read extension in 3' direction - Strand shift / cross-correlation - Defined (narrow, e.g., transcription factor binding sites) versus diffuse (e.g., histone marks) peaks ChIP-seq for differential binding - Designed experiment with replicate samples per treatment - Analysis using insights from microarrays / RNA-seq Novel statistical issues - Inferring peaks without 'data snooping' (using the same data twice, once to infer peaks, once to estimate differential binding) - Retaining power - Minimizing false discovery rate ## Work flow - Following Bailey et al., [2013](http://dx.doi.org/10.1371/journal.pcbi.1003326) ![](our_figures/ChIPSeq-workflow.png) Experimental design and execution - Single sample - ChIPed transcription factor and\ldots - Input (fragmented genomic DNA) or control (e.g., IP with non-specific antibody such as immunoglobulin G, IgG) - Designed experiments - Replication of TF / control pairs Sequencing & alignment - Sequencing depth rules of thumb: $>10M$ reads for narrow peaks, $>20M$ for broad peaks - Long & paired end useful but not essential -- alignment in ambiguous regions - Basic aligners generally adequate, e.g., no need to align splice junctions - Sims et al., [2014](http://dx.doi.org/10.1038/nrg3642) Peak calling - Very large number of peak calling programs; some specialized for e.g., narrow vs. broad peaks. - Commmonly used: [MACS](http://liulab.dfci.harvard.edu/MACS/), PeakSeq, CisGenome, ... - MACS: Model-based Analysis for ChIP-Seq, Liu et al., [2008](http://dx.doi.org/10.1186/gb-2008-9-9-r137) - Scale control tag counts to match ChIP counts - Center peaks by shifting $d/2$ - Model occurrence of a tag as a Poisson process - Look for fixed width sliding windows with exceess number of tag enrichment - Empirical FDR: Swap ChIP and control samples; FDR is \# control peaks / \# ChIP peaks - Output: BED file of called peaks Down-stream analysis - Annotation: what genes are my peaks near? - Differential representation: which peaks are over- or under-represented in treatment 1, compared to treatment 2? - Motif identification (peaks over known motifs?) and discovery - Integrative analysis, e.g., assoication of regulatory elements and expression ## Peak calling 'Known' ranges - Count tags in pre-defined ranges, e.g., promoter regions of known genes - Obvious limitations, e.g., regulatory elements not in specified ranges; specified range contains multiple regulatory elements with complementary behavior _de novo_ windows - Width: narrow peaks, 1bp; broad peaks, 150bp - Offset: 25-100bp; influencing computational burden _de novo_ peak calling - Third-party software (many available; [MACS](http://liulab.dfci.harvard.edu/MACS/) commonly used) - Various strategies for calling peaks -- Lun & Smyth, [Table 1](http://nar.oxfordjournals.org/content/42/11/e95/T1.expansion.html) - Call each sample independently; intersection or union of peaks across samples, ... - Call peaks from a pooled library - ... - Relevant slides [pdf](//www.andersvercelli.com/help/course-materials/2014/CSAMA2014/4_Thursday/lectures/ChIPSeq_slides.pdf) ## Peak calling across libraries -表格1:峰值呼叫策略的描述。每个策略都被赋予标识符,并且由运行MAC的模式,运行它的图书馆以及在库或组之间组合峰值的合并操作(如果有的话)。对于方法6,采用富集的每个方向上的峰的结合。
ID 模式 图书馆 手术
1 单样本 个人 联盟
2 单样本 个人 路口
3. 单样本 个人 至少2
4. 单样本 汇集在一起 联盟
5. 单样本 汇集在一起 路口
6. 两个样本 汇集在一起 联盟
7. 单样本 汇总了所有 -
- 如何选择?- LUN&SMYTH, - 在空假设下,I型错误率是统一的```{R null-p,null下的100,000 t-tests,n = 6 n < - 6;m < - 矩阵(rnorm(n * 100000),ncol = n)p < - genefilter :: rowtttess(m,factor(rep(1:2,每个= 3)))$ p.value stantile(p,c(.001,.01,.05))hist(p,breaks = 20)``` - [表2](http://nar.oxfordjournals.org/content/42/11/e95/t2.expansion.html):I型错误的后果 - 最佳策略:从池库呼叫峰值 -表2.:使用来自每个峰值呼叫策略的计数,观察到的I型错误率。显示了一系列指定错误阈值的错误率。所有值表示具有括号中所示的标准误差的10个模拟迭代的平均值。RA:参考分析使用10 000随机选择真正的峰值。
ID 错误率
0.01 0.05 0.1
ra. 0.010(0.000) 0.051(0.001) 0.100(0.002)
1 0.002(0.000) 0.019(0.001) 0.053(0.001)
2 0.003(0.000) 0.030(0.000) 0.073(0.001)
3. 0.006(0.000) 0.042(0.001) 0.092(0.001)
4. 0.033(0.001) 0.145(0.001) 0.261(0.002)
5. 0.000(0.000) 0.001(0.000) 0.005(0.000)
6. 0.088(0.006) 0.528(0.013) 0.893(0.006)
7. 0.010(0.000) 0.049(0.001) 0.098(0.001)
#实用:峰概要和注释(`r biocpkg(“chipseeker”)`)the [`chipseeker`vignette](http://biocidodde.org/packages/devel/bioc/vignettes/cipseeker/inst/doc/chipseeker。HTML)是一个优秀的资源,我们将以我们的实验室通往它的部分。````{rVignette,可以找到更详细的详细信息。这是以两种方式创新的:(1)它没有称之为“峰值”,而是基于对跨越整个基因组的_Windows_的分析;(2)它强调了跨样品芯片模式的比较,寻找治疗组之间的_百分比结合_。首先下载[csaw-data.rds] []和[csaw-normfacs.rds] [] ## 1 - 4:实验设计...对齐实验涉及胚胎干细胞之间的NFYA蛋白的结合谱的变化终端神经元。它是Tiwari等人提供的数据的子集。[2012](http://dx.doi.org/10.1038/ng.1036)可用作[gse25532](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25532)。有两种ES(胚胎干细胞)和两种TN(末端神经元)重复。 Single-end FASTQ files were extracted from GEO, aligned using `r Biocpkg("Rsubread")`, and post-processed (sorted and indexed) using `r Biocpkg("Rsamtools")` with this script: ``` ## SystemRequirements: ascp; fastq-dump source("setup.R") library(SRAdb) library(Rsubread) library(Rsamtools) library(BiocParallel) sradb <- "SRAmetadb.sqlite" key <- "/app/aspera-connect/3.1.1/etc/asperaweb_id_dsa.openssh" cmd = sprintf("ascp -TT -l300m -i %s", key) source("setup.R") if (!file.exists(sradb)) getSRAdbFile() con = dbConnect(dbDriver("SQLite"), sradb) accs <- rownames(files)[!file.exists(files$sra)] for (acc in accs) sraFiles = ascpSRA(acc, con, cmd, fileType="sra", destDir=getwd()) sras <- files$sra[!file.exists(files$fastq)] bplapply(sras, function(sra) system(sprintf("fastq-dump --gzip %s", sra))) fastqs <- files$fastq[!file.exists(files$bam)] if (length(fastqs)) Rsubread::align("../mm10/mm10.Rsubread.index", fastqs, nthreads=parallel::detectCores() / 2L) bams <- files$bam[!file.exists(sprintf("%s.bai", files$bam))] bams_sorted <- sub(".BAM", ".sorted.bam", bams) sorted <- bpmapply(sortBam, bams, bams_sorted) ## oops! didn't mean to do the next line file.rename(sorted, names(sorted)) bplapply(sorted, indexBam) ``` This generates 4 BAM files, one per sample. The BAM files are about 2 GB each. The files are summarized by the following data frame: ```{r csaw-setup} files <- local({ acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417", tn_2="SRR074418") data.frame(Treatment=sub("_.*", "", names(acc)), Replicate=sub(".*_", "", names(acc)), sra=sprintf("%s.sra", acc), fastq=sprintf("%s.fastq.gz", acc), bam=sprintf("%s.fastq.gz.subread.BAM", acc), row.names=acc, stringsAsFactors=FALSE) }) ``` ## 5: Reduction To process the data, I'll change to the directory where the BAM files are located at ```{r csaw-setwd, eval=FALSE} setwd("~/UseBioconductor-data/ChIPSeq/NFYA") ``` ...and then load the csaw library and count reads in overlapping windows. ```{r csaw-reduction, eval=FALSE} library(csaw) library(GenomicRanges) frag.len <- 110 system.time({ data <- windowCounts(files$bam, width=10, ext=frag.len) }) # 156 seconds acc <- sub(".fastq.*", "", data$bam.files) colData(data) <- cbind(files[acc,], colData(data)) ``` Load this data into your own _R_ session with the following command: ```{r load-csaw} data <- readRDS("csaw-data.Rds") ``` `data` is a `SummarizedExperiment`, so explore it a bit... ## 6: Analysis **Filtering** (vignette Chapter 3) Start by filtering low-count windows. There are likely to be many of these (how many?). Is there a rational way to choose the filtering threshold? ```{r csaw-filter} library(edgeR) # for aveLogCPM() keep <- aveLogCPM(assay(data)) >= -1 data <- data[keep,] ``` **Normalization (composition bias)** (vignette Chapter 4) csaw uses binned counts in normalization. The bins are large relative to the ChIP peaks, on the assumption that the bins primarily represent non-differentially bound regions. The sample bin counts are normalized using the `r Biocpkg("edgeR")` TMM (trimmed median of M values) method seen in the RNASeq differential expression lab. Explore vignette chapter 4 for more on normalization (this is a useful resource when seeking to develop normalization methods for other protocols!). This requires access to the big data, so we don't run the following code ```{r csaw-normalize, eval=FALSE} system.time({ binned <- windowCounts(files$bam, bin=TRUE, width=10000) }) #139 second normfacs <- normalize(binned) ``` ...but instead load the summary into our session ```{r csaw-normacs-load} normfacs <- readRDS("csaw-normfacs.Rds") ``` **Experimental design and Differential binding** (vignette Chapter 5) Differential binding will be assessed using `r Biocpkg("edgeR")`, where we need to specify the experimental design ```{r csaw-experimental-design} design <- model.matrix(~Treatment, colData(data)) ``` Apply a standard `r Biocpkg("edgeR")` work flow to identify differentially bound regions. Creatively explore the results. ```{r csaw-de} y <- asDGEList(data, norm.factors=normfacs) y <- estimateDisp(y, design) fit <- glmQLFit(y, design, robust=TRUE) results <- glmQLFTest(fit, contrast=c(0, 1)) head(results$table) ``` **Multiple testing** (vignette Chapter 6) The challenge is that FDR across all detected differentially bound _regions_ is what one is interested in, but what is immediately available is the FDR across differentially bound _windows_; region will often consist of multiple overlapping windows. As a first step, we'll take a 'quick and dirty' approach to identifying regions by merging 'high-abundance' windows that are within, e.g., 1kb of one another ```{r csaw-merge-windows} merged <- mergeWindows(rowRanges(data), tol=1000L) ``` Combine test results across windows within regions. Several strategies are explored in section 6.5 of the vignette. ```{r csaw-combine-merged-tests} tabcom <- combineTests(merged$id, results$table) head(tabcom) ``` Section 6.6 of the vignette discusses approaches to identifying the 'best' windows within regions. Finally, create a `GRangesList` that associated with two result tables and the genomic ranges over which the results were calculated. ```{r csaw-grangeslist} gr <- rowRanges(data) mcols(gr) <- as(results$table, "DataFrame") grl <- split(gr, merged$id) mcols(grl) <- as(tabcom, "DataFrame") ``` # Resources Acknowledgements - Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum. - The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ``` [csaw-data.Rds]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/csaw-data.Rds [csaw-normfacs.Rds]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/csaw-normfacs.Rds