---标题:“B.2 - _Bioconductor_构建块”作者:马丁摩根 日期:“2016年5月16日 - 2016年5月”“输出:Biocstyle :: html_document:toc:true toc_depth:2小插图:>%\ vignetteindexentry {b.2 - vignetteengine}%\ vignetteengine {knitr :: Rarkmdown} -- ```{R样式,Echo = False,结果='ASIS'}选项(宽度= 100)KNITR :: OPTS_CHUNK $ SET(eval = AS.LOGICY(SYS.GETENV(“KNITR_EVAL”,“TRUE”)),缓存= as.logical(sys.geteNv(“knitr_cache”,“true”)))`````{r setup,echo = false} suppresspackageStartupMessages({库(摘要化)库(GenomicalIgnments)库(AnnotationHub)库(BioMart)库(Airway)库(org.hs.eg.db)库(txdb.hsapiens.ucsc.hg19.knowngene)库(rnaseqdata.hnrnpc.bam.chr14)})```#核心基础架构##_genomicranges_![Alt](our_figures / granges.png)###范围操作![alt范围代数](our_figures / srangeoperations.png)范围 - 讽刺 - `start()`/`ex()`/`wide()`/`wide()` - list-like - `length()`,subset等 - 'metadata',`mcols()` - granges - 'seqnames'(染色体),'strand' - `seqinfo`,包括NG`Seqlevels`和`seqlengths`内部方法 - 与同一对象中的其他范围无关 - Granges Variants Strand-Aware - `shift()`,`窄()`,`flank()`,`,`促销者()`,`resize()`,`tryict()`,`trim()` - 请参阅“?”范围内 - 方法“”范围内“ - 依赖于同一对象中的其他范围 - ”范围“()`,`dress()`,`空白()`,`disjoin()` - `coverage()`(!) - 请参阅`?“范围内 - 方法”之间的范围方法 - 两个(或更多))范围对象 - “sockoverlaps()`,`,...,`,`%超过%`,在%`中为%`,bater%``Union()`,`intersect()`,`setdiff()`,`penion()`,`petersect()`,`psetdiff()`示例```{r ranges,message = false}库(基因组)GR < - GRANGES(“A”,讽刺(C(10,20,22),宽度= 5),“+”)换档(GR,1)#范围内范围(GR)#间距减少(GR)#范围间SNPS < - GRANGES(“A”,讽刺(C(11,17,24),宽度= 1))Findoverlaps(SNPS,GR)#范围间套件(范围(GR),GR)#'内含子'```````##先见[b.1 _biociductor_的简介_] ## _genomicalign_表示对齐读取的表示。见下面的练习。##注释资源 - _biocidodder_提供了广泛的访问“注释”资源(请参阅[AnnotationData] [] BiocViews层次结构); some interesting examples to explore during this lab include: - [biomaRt][], [PSICQUIC][], [KEGGREST][] and other packages for querying on-line resources; each of these have informative vignettes. - [AnnotationDbi][] is a cornerstone of the [Annotation Data][AnnotationData] packages provided by Bioconductor. - **org** packages (e.g., [org.Hs.eg.db][]) contain maps between different gene identifiers, e.g., ENTREZ and SYMBOL. The basic interface to these packages is described on the help page `?select` - **TxDb** packages (e.g., [TxDb.Hsapiens.UCSC.hg19.knownGene][]) contain gene models (exon coordinates, exon / transcript relationships, etc) derived from common sources such as the hg19 knownGene track of the UCSC genome browser. These packages can be queried, e.g., as described on the `?exonsBy` page to retrieve all exons grouped by gene or transcript. - **BSgenome** packages (e.g., [BSgenome.Hsapiens.UCSC.hg19][]) contain whole genomes of model organisms. - [VariantAnnotation][] and [ensemblVEP][] provide access to sequence annotation facilities, e.g., to identify coding variants; see the [Introduction to VariantAnnotation][] vignette for a brief introduction. - Take a quick look at the [annotation work flow](//www.andersvercelli.com/help/workflows/annotation/annotation/) on the Bioconductor web site. Static packages - _org.\*_: identifier mappings - `select()`, `columns()`, `keys()` - `mapIds()` ```{r} library(org.Hs.eg.db) org <- org.Hs.eg.db select(org, "BRCA1", c("ENSEMBL", "GENENAME"), "SYMBOL") ``` - _TxDb.\*_: gene models - `exons()`, `transcripts()`, `genes()`, `promoters()`, ... - `exonsBy()`, `transcriptsBy()` - `select()`, etc. ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene promoters(txdb) ``` Web-based resources, e.g., [biomaRt][], [PSICQUIC][], [GEOquery][], ... Genome-scale resources via [AnnotationHub][] ```{r} library(AnnotationHub) hub = AnnotationHub() hub query(hub, c("ensembl", "81.gtf")) hub[["AH48004"]] ``` ## _SummarizedExperiment_ ![](our_figures/SE_Description.png) - 'feature' x 'sample' `assays()` - `colData()` data frame for desciption of samples - `rowRanges()` _GRanges_ / _GRangeList_ or data frame for description of features - `exptData()` to describe the entire object ```{r SummarizedExperiment} library(SummarizedExperiment) library(airway) data(airway) airway colData(airway) airway[, airway$dex %in% "trt"] chr14 <- as(seqinfo(airway), "GRanges")["14"] airway[airway %over% chr14,] ``` # Exercises ## _GenomicAlignments_ The [RNAseqData.HNRNPC.bam.chr14][] package is an example of an experiment data package. It contains a subset of BAM files used in a gene knock-down experiment, as described in `?RNAseqData.HNRNPC.bam.chr14`. Load the package and get the path to the BAM files. ```{r} library(RNAseqData.HNRNPC.bam.chr14) fls = RNAseqData.HNRNPC.bam.chr14_BAMFILES basename(fls) ``` Create `BamFileList()`, basically telling R that these are paths to BAM files rather than, say, text files from a spreadsheet. ```{r} library(GenomicAlignments) bfls = BamFileList(fls) bfl = bfls[[1]] ``` Input and explore the aligments. See `?readGAlignments` and `?GAlignments` for details on how to manipulate these objects. ```{r} ga = readGAlignments(bfl) ga table(strand(ga)) ``` Many of the reads have cigar "72M". What does this mean? Can you create a subset of reads that do not have this cigar? Interpret some of the non-72M cigars. Any hint about what these cigars represent? ```{r} tail(sort(table(cigar(ga)))) ga[cigar(ga) != "72M"] ``` Use the function `summarizeJunctions()` to identify genomic regions that are spanned by reads with complicated cigars. Can you use the argument `with.revmap=TRUE` to extract the reads supporting a particular (e.g., first) junction? ```{r} summarizeJunctions(ga) junctions <- summarizeJunctions(ga, with.revmap=TRUE) ga[ junctions$revmap[[1]] ] ``` It is possible to do other actions on BAM files, e.g., calculating the 'coverage' (reads overlapping each base). ```{r} coverage(bfl)$chr14 ``` ## Annotation and _GenomicFeatures_ Load the org package for _Homo sapiens_. ```{r} library(org.Hs.eg.db) ``` Use `select()` to annotate the HNRNPC gene symbol with its Entrez identifier and less formal gene name. Create a map between SYMBOL and ENTREZID using `mapIds()`. ```{r} select(org.Hs.eg.db, "HNRNPC", c("ENTREZID", "GENENAME"), "SYMBOL") sym2eg <- mapIds(org.Hs.eg.db, "HNRNPC", "ENTREZID", "SYMBOL") ``` Load the TxDb package for the UCSC hg19 knownGene track ```{r} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ``` Extract coordinates of genes, and of exons grouped by gene for the HNRNPC gene. ```{r} gns <- genes(txdb) exonsBy(txdb, "gene")[sym2eg] ``` Use the gene coordinates to query the BAM file for a specific genomic region; see `?ScanBamParam()` for other ways of restricting data input. ```{r} library(Rsamtools) param <- ScanBamParam(which=gns[sym2eg]) readGAlignments(bfl, param=param) ``` ## _SummarizedExperiment_ The [airway][] experiment data package summarizes an RNA-seq experiment investigating human smooth-muscle airway cell lines treated with dexamethasone. Load the library and data set. ```{r} library(airway) data(airway) airway ``` `airway` is an example of the _SummarizedExperiment_ class. Explore its `assay()` (the matrix of counts of reads overlapping genomic regions of interest in each sample), `colData()` (a description of each sample), and `rowRanges()` (a description of each region of interest; here each region is an ENSEMBL gene). ```{r} x <- assay(airway) class(x) dim(x) head(x) colData(airway) rowRanges(airway) ``` The row names are Ensembl gene identifiers. Use `mapIds()` to map from these to gene symbols. ```{r} symid <- mapIds(org.Hs.eg.db, rownames(airway), "SYMBOL", "ENSEMBL") ``` Add the gene symbols to the summarized experiment object. ```{r} mcols(rowRanges(airway))$symid <- symid ``` It's easy to subset a _SummarizedExperiment_ on rows, columns and assays, e.g., retaining just those samples in the `trt` level of the `dex` factor. Accessing elements of the column data is common, so there is a short-cut. ```{r} cidx <- colData(airway)$dex %in% "trt" airway[, cidx] ## shortcut airway[, airway$dex %in% "trt"] ``` It's also easy to perform range-based operations on `SummarizedExperiment` objects, e.g., querying for range of chromosome 14 and then subsetting to contain only genes on this chromosome. Range operations on rows are very common, so there are shortcuts here, too. ```{r} chr14 <- as(seqinfo(rowRanges(airway)), "GRanges")["14"] ridx <- rowRanges(airway) %over% chr14 airway[ridx,] ## shortcut chr14 <- as(seqinfo(airway), "GRanges")["14"] airway[airway %over% chr14,] ``` Use the `assay()` and `rowSums()` function to remove all rows from the `airway` object that have 0 reads overlapping all samples. Summarize the library size (column sums of `assay()`) and plot a histogram of the distribution of reads per feature of interest. ## _AnnotationHub_ The [Roadmap Epigenomics Project][] generated genome-wide maps of regulatory marks across a number of cell lines. Retrieve the Epigenome Roadmap table from [AnnotationHub][]... ```{r} library(AnnotationHub) hub <- AnnotationHub() query(hub, c("epigenome", "metadata")) meta <- hub[["AH41830"]] ``` Explore the metadata to identify a cell line of interest to you; see also the [metadata][] spreadsheet version of the data made available by the Epigenome roadmap project. ```{r} table(meta$ANATOMY) meta[meta$ANATOMY == "LIVER",] ``` Use the 'EID' to query for and retrieve the 'mnemonic' file summarizing chromatin state ```{r} query(hub, c("E118", "mnemonic")) E118 <- hub[["AH46971"]] E118 ``` Explore the object, e.g., tabulating the different chromatin state classifications (in the `name` column). Subset the object to return, e.g., just those regions marked as 'Heterochromatin' ```{r} table(E118$name) E118[E118$name %in% "Heterochromatin"] ``` Can you, using a TxDb package and the `genes()` and `subsetByOverlaps()` functions, determine how many genes overlap heterochromatic states, or the genes `nearest()` each enhancer? ## _biomaRt_ Visit the [biomart website][] and figure out how to browse data to retreive, e.g., genes on chromosmes 21 and 22. You'll need to browse to the ensembl mart, _Homo spaiens_ data set, establish filters for chromosomes 21 and 22, and then specify that you'd like the Ensembl gene id attribute returned. Now do the same process in [biomaRt][]: ```{r biomart, eval=FALSE} library(biomaRt) head(listMarts(), 3) ## list marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" substr(filterOptions(myFilter, ensembl), 1, 50) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl) ``` [B.1 Introduction to _Bioconductor_]: ./B1_Bioconductor_Intro.html [Roadmap Epigenomics Project]: http://egg2.wustl.edu/roadmap/web_portal/ [metadata]: https://docs.google.com/spreadsheets/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15 [biomart website]: http://biomart.org [Introduction to VariantAnnotation]: //www.andersvercelli.com/packages/release/bioc/vignettes/ShortRead/inst/doc/Overview.pdf [AnnotationDbi]: //www.andersvercelli.com/packages/AnnotationDbi [AnnotationHub]: //www.andersvercelli.com/packages/AnnotationHub [BSgenome.Hsapiens.UCSC.hg19]: //www.andersvercelli.com/packages/BSgenome.Hsapiens.UCSC.hg19 [BSgenome]: //www.andersvercelli.com/packages/BSgenome [Biostrings]: //www.andersvercelli.com/packages/Biostrings [GenomicAlignments]: //www.andersvercelli.com/packages/GenomicAlignments [GenomicFeatures]: //www.andersvercelli.com/packages/GenomicFeatures [GenomicRanges]: //www.andersvercelli.com/packages/GenomicRanges [KEGGREST]: //www.andersvercelli.com/packages/KEGGREST [PSICQUIC]: //www.andersvercelli.com/packages/PSICQUIC [RNAseqData.HNRNPC.bam.chr14]: //www.andersvercelli.com/packages/RNAseqData.HNRNPC.bam.chr14 [Rsamtools]: //www.andersvercelli.com/packages/Rsamtools [ShortRead]: //www.andersvercelli.com/packages/ShortRead [TxDb.Hsapiens.UCSC.hg19.knownGene]: //www.andersvercelli.com/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [VariantAnnotation]: //www.andersvercelli.com/packages/VariantAnnotation [airway]: //www.andersvercelli.com/packages/airway [biomaRt]: //www.andersvercelli.com/packages/biomaRt [GEOquery]: //www.andersvercelli.com/packages/GEOquery [ensemblVEP]: //www.andersvercelli.com/packages/ensemblVEP [org.Hs.eg.db]: //www.andersvercelli.com/packages/org.Hs.eg.db [rtracklayer]: //www.andersvercelli.com/packages/rtracklayer [AnnotationData]: //www.andersvercelli.com/packages/release/BiocViews.html#___AnnotationData