{r setup, echo=FALSE}库(LearnBioconductor) stopifnot(BiocInstaller::biocVersion() == "3.0")BiocStyle::markdown() knitr::opts_chunk$set(tidy=FALSE)生物导体马丁·摩根简介
2014年10月29日##生物导体分析和理解高通量基因组数据-统计分析:大数据,技术人工制品,设计实验;高通量测序:RNASeq, ChIPSeq,变体,拷贝数,…-微阵列:表达,SNP,…-流式细胞术,蛋白质组学,图像,…Package, vignettes, work flows - 934 Packages -通过[biocViews]发现和导航[]- Package 'landing page' -标题,作者/维护人,简短描述,引文,安装说明,…,下载统计数据-所有用户可见的功能都有帮助页面,大多数带有可运行的示例-“Vignettes”是Bioconductor的一个重要功能-叙述文件说明如何使用包,与集成代码-“发布”(每六个月)和“devel”分支-[支持网站](https://support.bioconductor.org);[视频](https://www.youtube.com/user/bioconductor),[最近的课程](//www.andersvercelli.com/help/course-materials/)对象-代表复杂的数据类型-促进互操作性- S4对象系统-自省:' getClass() ', ' showMethods(…, where=search()) ', ' selectMethod() ' - 'accessors'和其他文档化的函数/方法进行操作,而不是直接访问对象结构-交互式帮助- '方法?“字符串的子串, " '在方法上选择帮助'类?D {r Biostrings, message=FALSE} suppressPackageStartupMessages({library(Biostrings)}) data(phiX174Phage) # sample data(phiX174Phage) # sample data,参见?m[,多态性]``` ```{r showMethods, eval=FALSE} showMethods(class=class(phiX174Phage), where=search())基因组范围-染色体(' seqnames '),开始,结束,可选链-坐标- 1-based -关闭-开始和结束坐标_include _在范围-最左-开始总是在结束的左边,无论链,为什么基因组范围?简单的范围:外显子,启动子,转录因子结合位点,CpG岛,…-范围列表:基因模型(外显子-转录本内)-“数据”-读取本身,或派生数据-简单范围:ChIP-seq峰,SNPs, ungap Reads,…-范围列表:间隙对齐,对端读取,…数据对象——“r Biocpkg(“GenomicRanges”)::_GRanges_——“seqnames()”——“开始()”,“结束()”,“宽度()”——“链()”——“mcols()”:“元数据”与每个范围相关联,存储为“DataFrame”——许多非常有用的操作上定义范围(后来)——“r Biocpkg(“GenomicRanges”)::_GRangesList_ -类似列表(例如,长度(),的名字 ()`, `[`, `[[`)- 每个列表元素a _grangeS_ - 列表和元素列表级别的元数据 - 非常简单(快速)到`()`和`RELIST()`。- `r biocpkg(“genomicalignings”)`:: _ galignments_,_galignmentslist_,_galignemnptpairs_;“r biocpkg(”Variantannotation“)`:: _ vcf_,_vranges_ - 与具有更多专业角色的_granges_的对象示例:_granges_``` {r eg-granges} ##'注释'包;更介于以后... suppressPackageStartUpMessages({库(TXDB.HSAPIENS.CUCSC.HG19.KC1NOKNGENE)})启动子< - 启动子(TXDB.HSAPIENS.CUCSC.HG19.KNOKNOKNGENE)##'GRANGES'具有2个METADATA专栏启动子头(表(SEQNAMES(促进剂)))表(链(促进剂))SEQINFO(促进剂)##载体样接入启动子[SEQNAMES(启动子)%在%C(“CHR1”,“CHR2”)] ##元数据米米尔(启动子)长度(唯一(促销者$ TX_NAME))`````names=TRUE) ## list-like subsetting exByTx[1:10] # also logical, character, ... exByTx[["uc001aaa.3"]] # also numeric ## accessors return typed-List, e.g., IntegerList width(exByTx) log10(width(exByTx)) ## 'easy' to ask basic questions, e.g., ... hist(unlist(log10(width(exByTx)))) # widths of exons exByTx[which.max(max(width(exByTx)))] # transcript with largest exon exByTx[which.max(elementLengths(exByTx))] # transcript with most exons ``` There are many neat range-based operations (more later)! ![Range Operations](our_figures/RangeOperations.png) Some detail - _GRanges_ and friends use data structures defined in `r Biocpkg("S4Vectors")`, `r Biocpkg("IRanges")` - These data structures can handle relatively large data easily, e.g., 1-10 million ranges - Basic concepts are built on _R_'s vector and list; _List_ instances are implemented to be efficient when there are long lists of a few elements each. - Takes a little getting used to, but very powerful ### Integrated containers What is an experiment? - 'Assays' - Regions-of-interest x samples - E.g., read counts, expression values - Regions-of-interest - Microarrays: probeset or gene identifiers - Sequencing: genomic ranges - Samples - Experimental inforamtion, covariates - Overall experimental description Why integrate? - Avoid errors when manipulating data - Case study: [reproducible research]() Data objects - `r Biocpkg("Biobase")`::_ExpressionSet_ - Assays (`exprs()`): matrix of expression values - Regions-of-interest (`featureData(); fData()`): probeset or gene identifiers - Samples (`phenoData(); pData()`: `data.frame` of relevant information - Experiment data (`exptData()`): Instance of class `MIAME`. - `r Biocpkg("GenomicRanges")`::_SummarizedExperiment_ - Assays (`assay(), assays()`): arbitrary matrix-like object - Regions-of-interest (`rowData()`): `GRanges` or `GRangesList`; use `GRangesList` with names and 0-length elements to represent assays without ranges. - Samples (`colData()`): `DataFrame` of relevant information. - Experiment data (`exptData()`): `List` of arbitrary information. ![SummarizedExperiment](our_figures/SummarizedExperiment.png) Example: `ExpressionSet` (see vignettes in `r Biocpkg("Biobase")`). ```{r eg-ExpressionSet} suppressPackageStartupMessages({ library(ALL) }) data(ALL) ALL ## 'Phenotype' (sample) and 'feature' data head(pData(ALL)) head(featureNames(ALL)) ## access to pData columns; matrix-like subsetting; exprs() ALL[, ALL$sex %in% "M"] range(exprs(ALL)) ## 30% 'most variable' features (c.f., genefilter::varFilter) iqr <- apply(exprs(ALL), 1, IQR) ALL[iqr > quantile(iqr, 0.7), ] ``` Example: `SummarizedExperiment` (see vignettes in `r Biocpkg("GenomicRanges")`). ```{r eg-SummarizedExperiment} suppressPackageStartupMessages({ library(airway) }) data(airway) airway ## column and row data colData(airway) rowData(airway) ## access colData; matrix-like subsetting; assay() / assays() airway[, airway$dex %in% "trt"] head(assay(airway)) assays(airway) ## library size colSums(assay(airway)) hist(rowMeans(log10(assay(airway)))) ``` ## Lab ### GC content 1. Calculate the GC content of human chr1 in the hg19 build, excluding regions where the sequence is "N". You will need to 1. Load the `r Biocannopkg("BSgenome.Hsapiens.UCSC.hg19")` 2. Extract, using `[[`, chromosome 1 ("chr1").3.使用`alphabetfrequency()`来计算CHR1 4中核苷酸的计数或频率。使用标准_r_函数来计算GC含量。```{r gc-reference}库(bsgenome.hsapiens.ucsc.hg19)chr1seq < - bsgenome.hsapiens.ucsc.hg19 [[chr1“] chr1alf < - 字母频率(chr1seq)chr1gc < - sum(chr1alf [c(“g”,“c”)])/ sum(chr1alf [c(“a”,“c”,“g”,“t”)])```2.计算'Exome'的GC内容(大约,所有基因区域)在CHR1上。您需要1.加载“r Biocannopkg(”Txdb.hsapiens.ucsc.hg19.knowngene“)`package。2.使用`基因()`提取所有基因的基因区域,然后将操作置于染色体1. 3. 3.使用`getseq,bsgenome-method`培养来自Bsgenome物体的染色体1的序列。4.使用“alphabetfrequency()`(具有参数`折叠= true` - 为什么?)和标准_r_操作以提取基因的GC内容。```{r gc-exons-1}库(txdb.hsapiens.ucsc.hg19.knowngene)基因< - 基因(txdb.hsapiens.ucsc.hg19.knowngene)Genes1 < - 基因[Seqnames(基因)%“chr1”] seq1 < - getseq(bsgenome.hsapiens.ucsc.hg19,genes1)Alf1 < - 字母频率(SEQ1,折叠=真)GC1 < - SUM(ALF1 [C(“G”,“C”)])/总和(ALF1 [C(“A”,“C”,“G”,“T”)])“GC内容刚刚计算出与每个外显子的GC含量的平均值进行计算?使用“alphabetfrequency()`”resk,但使用`collapse = false)`,并调整GC内容的计算以对矩阵,而不是向量行动。 Why are these numbers different? ```{r gc-exons-2} alf2 <- alphabetFrequency(seq1, collapse=FALSE) gc2 <- rowSums(alf2[, c("G", "C")]) / rowSums(alf2[,c("A", "C", "G", "T")]) ``` 3. Plot a histogram of per-gene GC content, annotating with information about chromosome and exome GC content. Use base graphics `hist()`, `abline()`, `plot(density(...))`, `plot(ecdf(...))`, etc. (one example is below). If this is too easy, prepare a short presentation for the class illustrating how to visualize this type of information using another _R_ graphics package, e.g., `r CRANpkg("ggplot2")`, `{r CRANpkg("ggvis")`, or `{r CRANpkg("lattice")}. ```{r gc-denisty} plot(density(gc2)) abline(v=c(chr1gc, gc1), col=c("red", "blue"), lwd=2) ``` ### Integrated containers This exercise illustrates how integrated containers can be used to effectively manage data; it does _NOT_ represent a suitable way to analyze RNASeq differential expression data. 1. Load the `r Biocpkg("airway")` package and `airway` data set. Explore it a litte, e.g., determining its dimensions (number of regions of interest and samples), the information describing samples, and the range of values in the `count` assay. The data are from an RNA-seq experiment. The `colData()` describe treatment groups and other information. The `assay()` is the (raw) number of short reads overlapping each region of interest, in each sample. The solution to this exercise is summarized above. 2. Create a subset of the data set that contains only the 30% most variable (using IQR as a metric) observations. Plot the distribution of asinh-transformed (a log-like transformation, except near 0) row mean counts ```{r airway-plot} iqr <- apply(assay(airway), 1, IQR) airway1 <- airway[iqr > quantile(iqr, 0.7),] plot(density(rowMeans(asinh(assay(airway1))))) ``` 3. Use the `r Biocpkg("genefilter")` package `rowttests` function (consult it's help page!) to compare asinh-transformed read counts between the two `dex` treatment groups for each row. Explore the result in various ways, e.g., finding the 'most' differentially expressed genes, the genes with largest (absolute) difference between treatment groups, adding adjusted _P_ values (via `p.adjust()`, in the _stats_ package), etc. Can you obtain the read counts for each treatment group, for the most differentially expressed gene? ```{r airway-rowttest} suppressPackageStartupMessages({ library(genefilter) }) ttest <- rowttests(asinh(assay(airway1)), airway1$dex) ttest$p.adj <- p.adjust(ttest$p.value, method="BH") ttest[head(order(ttest$p.adj)),] split(assay(airway1)[order(ttest$p.adj)[1], ], airway1$dex) ``` 4. Add the statistics of differential expression to the `airway1` _SummarizedExperiment_. Confirm that the statistics have been added. ```{r airway-merge} mcols(rowData(airway1)) <- ttest head(mcols(airway1)) ``` # Resources - [Web site][Bioconductor] -- install, learn, use, develop _R_ / _Bioconductor_ packages - [Support](http://support.bioconductor.org) -- seek help and guidance; also - [biocViews](//www.andersvercelli.com/packages/release/BiocViews.html) -- discover packages - Package landing pages, e.g., [GenomicRanges](//www.andersvercelli.com/packages/release/bioc/html/GenomicRanges.html), including title, description, authors, installation instructions, vignettes (e.g., GenomicRanges '[How To](//www.andersvercelli.com/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf)'), etc. - [Course](//www.andersvercelli.com/help/course-materials/) and other [help](//www.andersvercelli.com/help/) material (e.g., videos, EdX course, community blogs, ...) Publications (General _Bioconductor_) - Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi: [10.1371/journal.pcbi.1003118][GRanges.bib] Other - Lawrence, M. 2014. Software for Enabling Genomic Data Analysis. Bioc2014 conference [slides][Lawrence.bioc2014.bib]. [R]: http://r-project.org [Bioconductor]: //www.andersvercelli.com [GRanges.bib]: http://dx.doi.org/10.1371/journal.pcbi.1003118 [Scalable.bib]: http://arxiv.org/abs/1409.2864 [Lawrence.bioc2014.bib]: //www.andersvercelli.com/help/course-materials/2014/BioC2014/Lawrence_Talk.pdf [AnnotationData]: //www.andersvercelli.com/packages/release/BiocViews.html#___AnnotationData [AnnotationDbi]: //www.andersvercelli.com/packages/release/bioc/html/AnnotationDbi.html [AnnotationHub]: //www.andersvercelli.com/packages/release/bioc/html/AnnotationHub.html [BSgenome.Hsapiens.UCSC.hg19]: //www.andersvercelli.com/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html [BSgenome]: //www.andersvercelli.com/packages/release/bioc/html/BSgenome.html [BiocParallel]: //www.andersvercelli.com/packages/release/bioc/html/BiocParallel.html [Biostrings]: //www.andersvercelli.com/packages/release/bioc/html/Biostrings.html [Bsgenome.Hsapiens.UCSC.hg19]: //www.andersvercelli.com/packages/release/data/annotation/html/Bsgenome.Hsapiens.UCSC.hg19.html [CNTools]: //www.andersvercelli.com/packages/release/bioc/html/CNTools.html [ChIPQC]: //www.andersvercelli.com/packages/release/bioc/html/ChIPQC.html [ChIPpeakAnno]: //www.andersvercelli.com/packages/release/bioc/html/ChIPpeakAnno.html [DESeq2]: //www.andersvercelli.com/packages/release/bioc/html/DESeq2.html [DiffBind]: //www.andersvercelli.com/packages/release/bioc/html/DiffBind.html [GenomicAlignments]: //www.andersvercelli.com/packages/release/bioc/html/GenomicAlignments.html [GenomicFiles]: //www.andersvercelli.com/packages/release/bioc/html/GenomicFiles.html [GenomicRanges]: //www.andersvercelli.com/packages/release/bioc/html/GenomicRanges.html [Homo.sapiens]: //www.andersvercelli.com/packages/release/data/annotation/html/Homo.sapiens.html [IRanges]: //www.andersvercelli.com/packages/release/bioc/html/IRanges.html [KEGGREST]: //www.andersvercelli.com/packages/release/bioc/html/KEGGREST.html [PSICQUIC]: //www.andersvercelli.com/packages/release/bioc/html/PSICQUIC.html [Rsamtools]: //www.andersvercelli.com/packages/release/bioc/html/Rsamtools.html [Rsubread]: //www.andersvercelli.com/packages/release/bioc/html/Rsubread.html [ShortRead]: //www.andersvercelli.com/packages/release/bioc/html/ShortRead.html [SomaticSignatures]: //www.andersvercelli.com/packages/release/bioc/html/SomaticSignatures.html [TxDb.Hsapiens.UCSC.hg19.knownGene]: //www.andersvercelli.com/packages/release/data/annotation/html/TxDb.Hsapiens.UCSC.hg19.knownGene.html [VariantAnnotation]: //www.andersvercelli.com/packages/release/bioc/html/VariantAnnotation.html [VariantFiltering]: //www.andersvercelli.com/packages/release/bioc/html/VariantFiltering.html [VariantTools]: //www.andersvercelli.com/packages/release/bioc/html/VariantTools.html [biocViews]: //www.andersvercelli.com/packages/release/BiocViews.html#___Software [biomaRt]: //www.andersvercelli.com/packages/release/bioc/html/biomaRt.html [cn.mops]: //www.andersvercelli.com/packages/release/bioc/html/cn.mops.html [edgeR]: //www.andersvercelli.com/packages/release/bioc/html/edgeR.html [ensemblVEP]: //www.andersvercelli.com/packages/release/bioc/html/ensemblVEP.html [h5vc]: //www.andersvercelli.com/packages/release/bioc/html/h5vc.html [limma]: //www.andersvercelli.com/packages/release/bioc/html/limma.html [metagenomeSeq]: //www.andersvercelli.com/packages/release/bioc/html/metagenomeSeq.html [org.Hs.eg.db]: //www.andersvercelli.com/packages/release/data/annotation/html/org.Hs.eg.db.html [org.Sc.sgd.db]: //www.andersvercelli.com/packages/release/data/annotation/html/org.Sc.sgd.db.html [phyloseq]: //www.andersvercelli.com/packages/release/bioc/html/phyloseq.html [rtracklayer]: //www.andersvercelli.com/packages/release/bioc/html/rtracklayer.html [snpStats]: //www.andersvercelli.com/packages/release/bioc/html/snpStats.html [Gviz]: //www.andersvercelli.com/packages/release/bioc/html/Gviz.html [epivizr]: //www.andersvercelli.com/packages/release/bioc/html/epivizr.html [ggbio]: //www.andersvercelli.com/packages/release/bioc/html/ggbio.html [OmicCircos]: //www.andersvercelli.com/packages/release/bioc/html/OmicCircos.html