---标题:“生物素质注释资源简介”作者:“詹姆斯W.麦克唐纳”日期:6月26日|BioC 2016输出:iOSlides_Presentation:fig_retina:null css:style.css videcreen:false gignette:>%\ gignetteIndexentry {Biocuctor annotation资源介绍}%\ vignetteengine {knitr :: Rarmardown}%\ vignetteencoding {UTF-8} ---##这个研讨会的目标 - 了解各种注释包类型 - 了解查询这些资源的基础知识 - 讨论生物数据结构的注释 - 在一些练习```{r设置,包括=假}库中,进入注释(Biocanno2016)库(Hugene20SttranscriptCluster.db)库(ensdb.mmusculus.v79)库(org.hs.eg.db)库(txdb.hsapiens.ucsc.hg19.knowngene)库(homo.sapiens)库(bsgenome)库(bsgenome。hsapiens.ucsc.hg19)库(注释声)```##我们通过注释是什么意思?将已知ID映射到其他功能或位置信息##特定目标我们有数据和统计数据,我们希望添加其他有用的信息最终结果可能像Data .frame或HTML表一样简单,也可能像' rangedsummarizeexperiment ' ## Data容器一样复杂## Expressions``` {r}加载(system.file(“data / eset.rdata”,package =“biocanno2016”))ESET``` ## Expressionset(续)```{R}头(exprs(ESET))HEAD(PDATA(eSET)))```## Expressionset(续)```{R}头(PDATA(Featureda(ESET)))```## BIOC容器与基本结构###stry *有效性检查*子集*函数调度*自动行为###缺点*难以创建*笨重用手提取数据*仅在R ##注释源````{r,结果=“asis”中有用,echo= false} df < - data.frame(“包类型”= c(“chipdb”,“Orgdb”,“TXDB / EnsdB”,“OrganisMDB”,“BSGenome”,“其他”,“AnnotationHub”,“Biomart”),示例= c(“hugene20sttranscriptcluster.db”,“org.hs.eg.db”,“txdb.hsapiens.ucsc.hg19.knowngene; ensdb.hsapiens.v75”,“homo.sapiens”,“bsgenome.hsapiens.cc.hg19“,”go.db; kegg.db“,”在线资源“,”在线资源“),check.names = false)Knitr :: Kable(DF)```##与Annodb包交互主要功能是“选择”:“选择”(* Annopkg *,*键*,*列*,* keytype *)其中* Annopkg是注释包*键是我们**所知道的ID的IDS是我们**想要的值** * keyType是所用键的类型+如果KeyType是**中央**键,则可以保持未指定的##简单示例说我们已经从Affymetrix人类Gene ST 2.0阵列分析了数据,并且想知道基因是什么。出于此实验室的目的,我们只需随机选择一些ID。```{r}库(Hugene20sttranscriptCluster.db)set.seed(12345)IDS < - featureNames(ESET)[Sample(1:25000,5)] IDS选择(Hugene20sttranscriptCluster.db,ID,“符号”)```##问题! How do you know what the central keys are? * If it's a ChipDb, the central key are the manufacturer's probe IDs * It's sometimes in the name - org.Hs.eg.db, where 'eg' means Entrez Gene ID * You can see examples using e.g., head(keys(*annopkg*)), and infer from that * But note that it's never necessary to know the central key, as long as you specify the keytype ## More questions! What keytypes or columns are available for a given annotation package? ```{r} keytypes(hugene20sttranscriptcluster.db) columns(hugene20sttranscriptcluster.db) ``` ## Another example There is one issue with `select` however. ```{r} ids <- c('16737401','16657436' ,'16678303') select(hugene20sttranscriptcluster.db, ids, c("SYMBOL","MAP")) ``` ## The `mapIds` function An alternative to `select` is `mapIds`, which gives control of duplicates * Same arguments as `select` with slight differences - The columns argument can only specify one column - The keytype argument **must** be specified - An additional argument, multiVals used to control duplicates ```{r} mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID") ``` ## Choices for multiVals Default is `first`, where we just choose the first of the duplicates. Other choices are `list`, `CharacterList`, `filter`, `asNA` or a user-specified function. ```{r} mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "list") ``` ## Choices for multiVals (continued) ```{r} mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "CharacterList") mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "filter") mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "asNA") ``` ## ChipDb/OrgDb questions Using either the hugene20sttranscriptcluster.db or org.Hs.eg.db package, * What gene symbol corresponds to Entrez Gene ID 1000? * What is the Ensembl Gene ID for PPARG? * What is the UniProt ID for GAPDH? * How many of the probesets from the ExpressionSet (eset) we loaded map to a single gene? How many don't map to a gene at all? ## TxDb packages TxDb packages contain positional information; the contents can be inferred by the package name **TxDb.Species.Source.Build.Table** * TxDb.Hsapiens.UCSC.hg19.knownGene - *Homo sapiens* - UCSC genome browser - hg19 (their version of GRCh37) - knownGene table TxDb.Dmelanogaster.UCSC.dm3.ensGene TxDb.Athaliana.BioMart.plantsmart22 ## EnsDb packages EnsDb packages are similar to TxDb packages, but based on Ensembl mappings EnsDb.Hsapiens.v79 EnsDb.Mmusculus.v79 EnsDb.Rnorvegicus.v79 ## Transcript packages As with ChipDb and OrgDb packages, `select` and `mapIds` can be used to make queries ```{r} select(TxDb.Hsapiens.UCSC.hg19.knownGene, c("1","10"), c("TXNAME","TXCHROM","TXSTART","TXEND"), "GENEID") select(EnsDb.Hsapiens.v79, c("1", "10"), c("GENEID","GENENAME","SEQNAME","GENESEQSTART","GENESEQEND"), "ENTREZID") ``` But this is not how one normally uses them... ## GRanges The normal use case for transcript packages is to extract positional information into a `GRanges` or `GRangesList` object. An example is the genomic position of all genes: ```{r} gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) gns ``` ## GRangesList Or the genomic position of all transcripts **by** gene: ```{r} txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene) txs ``` ## Other accessors * Positional information can be extracted for `transcripts`, `genes`, coding sequences (`cds`), `promoters` and `exons`. * Positional information can be extracted for most of the above, grouped by a second element. For example, our `transcriptsBy` call was all transcripts, grouped by gene. * More detail on these *Ranges objects is beyond the scope of this workshop, but why we want them is not. ## Why *Ranges objects The main rationale for *Ranges objects is to allow us to easily select and subset data based on genomic position information. This is really powerful! `GRanges` and `GRangesLists` act like data.frames and lists, and can be subsetted using the `[` function. As a really artificial example: ```{r} txs[txs %over% gns[1:2,]] ``` ## *Ranges use cases * Gene expression changes near differentially methylated CpG islands * Closest genes to a set of interesting SNPs * Genes near DNAseI hypersensitivity clusters * Number of CpGs measured over Gene X by Chip Y ## SummarizedExperiment objects SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations:## TxDb练习*根据UCSC, PPARG有多少转录本?*合奏同意吗?* hg19基因组中chr2的2858473到3271812之间有多少个基因?* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *所有之前的访问器都可以工作;' select ', ' mapIds ', ' transcripts '等' ' ' {r}库(Homo.sapiens)。*对TxDb访问器的调用包括一个'columns'参数' ' ' ' {r} head(基因(人)。列= c("ENTREZID","ALIAS","UNIPROT")),4)*获得BRCA1的所有GO术语* UCSC转录的是什么基因ID uc002fai。3地图?*这个基因有多少其他的转录本? * Get all the transcripts from the hg19 genome build, along with their Ensembl gene ID, UCSC transcript ID and gene symbol ## BSgenome packages BSgenome packages contain sequence information for a given species/build. There are many such packages - you can get a listing using `available.genomes` ```{r} library(BSgenome) head(available.genomes()) ``` ## BSgenome packages We can load and inspect a BSgenome package ```{r} library(BSgenome.Hsapiens.UCSC.hg19) Hsapiens ``` ## BSgenome packages The main accessor is `getSeq`, and you can get data by sequence (e.g., entire chromosome or unplaced scaffold), or by passing in a GRanges object, to get just a region. ```{r} getSeq(Hsapiens, "chr1") getSeq(Hsapiens, gns["5467",]) ``` The Biostrings package contains most of the code for dealing with these `*StringSet` objects - please see the Biostrings vignettes and help pages for more information. ## BSgenome exercises * Get the sequences for all transcripts of the TP53 gene ## AnnotationHub AnnotationHub is a package that allows us to query and download many different annotation objects, without having to explicitly install them. ```{r, include = FALSE} library(AnnotationHub) hub <- AnnotationHub() ``` ```{r} library(AnnotationHub) hub <- AnnotationHub() hub ``` ## Querying AnnotationHub Finding the 'right' resource on AnnotationHub is like using Google - a well posed query is necessary to find what you are after. Useful queries are based on * Data provider * Data class * Species * Data source ```{r} names(mcols(hub)) ``` ## AnnotationHub Data providers ```{r} unique(hub$dataprovider) ``` ## AnnotationHub Data classes ```{r} unique(hub$rdataclass) ``` ## AnnotationHub Species ```{r} head(unique(hub$species)) length(unique(hub$species)) ``` ## AnnotationHub Data sources ```{r} unique(hub$sourcetype) ``` ## AnnotationHub query ```{r} qry <- query(hub, c("granges","homo sapiens","ensembl")) qry ``` ## AnnotationHub query ```{r} qry$sourceurl ``` ## Selecting AnnotationHub resource ```{r, message = FALSE} whatIwant <- qry[["AH50377"]] ``` We can use these data as they are, or convert to a TxDb format: ```{r} GRCh38TxDb <- makeTxDbFromGRanges(whatIwant) GRCh38TxDb ``` ## AnnotationHub exercises * How many resources are on AnnotationHub for Atlantic salmon (Salmo salar)? * Get the most recent Ensembl build for domesticated dog (Canis familiaris) and make a TxDb ## biomaRt The biomaRt package allows queries to an Ensembl Biomart server. We can see the choices of servers that we can use: ```{r} library(biomaRt) listMarts() ``` ## biomaRt data sets And we can then check for the available data sets on a particular server. ```{r} mart <- useMart("ENSEMBL_MART_ENSEMBL") head(listDatasets(mart)) ``` ## biomaRt queries After setting up a `mart` object pointing to the server and data set that we care about, we can make queries. We first set up the `mart` object. ```{r} mart <- useMart("ENSEMBL_MART_ENSEMBL","hsapiens_gene_ensembl") ``` Queries are of the form getBM(*attributes*, *filters*, *values*, *mart*) where * attributes are the things we **want** * filters are the *types of* IDs we **have** * values are the IDs we **have** * mart is the `mart` object we set up ## biomaRt attributes and filters Both attributes and filters have rather inscrutable names, but a listing can be accessed using ```{r} atrib <- listAttributes(mart) filts <- listFilters(mart) head(atrib) head(filts) ``` ## biomaRt query A simple example query ```{r} afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at") getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart) ``` ## biomaRt exercises * Get the Ensembl gene IDs and HUGO symbol for Entrez Gene IDs 672, 5468 and 7157 * What do you get if you query for the 'gene_exon' for GAPDH?