---标题:“4. R&Biocondure的中间实验室”作者:“Sonali Arora”输出:Biocstyle :: html_document:toc:true toc_depth:2 vignette:>%\ gignetteIndexentry {4。R&Beocumond的中间实验室}%\ vignetteengine {knitr :: Rarmardown} ---```{r style,echo = false,结果='asis'} biocstyle :: markdown()选项(宽度= 100,max。print = 1000)knitr :: Opts_chunk $ set(eval = as.logical(sys.getenv(“knitr_eval”,“true”)),cache = as.logical(sys.getenv(“knitr_cache”,“true”)),error = false)```作者:sonali arora(sarora@fredhutch.org.
日期:2015年7月20日至24日
本课程中的材料需要R版本3.2.1和Biocumon V9.2 ##中间实验室的Biocumon __Exercise 1__在Biocumon中找到包含从UCSC的ucsc为大鼠Norvegicus(组件RN5)生成的注释数据库,并将其加载并保存在a中变量称为“TXDB”。使用此对象,执行以下内容,找到此程序集中包含的所有基因,并将其保存在一个名为'Ratenes'中的变形。b)大鼠中含有多少个序列?(提示:?seqinfo)c)'Ratenes'含有脚手架 - 你如何将物体置于标准染色体中仅包含序列?b)我对基因'ACSL5'(Entrez Gene ID = 94340)感兴趣。这是否存在于'Ratenes'中?它的染色体坐标是什么?__Exercise 2__发现生物导体中的包裹,即用UCSC(RN5,2012)提供的Rattus Norvegicus(大鼠)的全基因组序列(RN5,2012年3月)a)加载包装并将其保存在一个名为'ratseq'b)的对象中是什么存储在此序列信息? c) get the dna sequence information for acsl5 and store it in 'acsl5_sequence' d) calculate the GC content from this sequnence. __Exercise 3__ In the 'ratGenes' object above, you get only the entrez gene ids, can you get the gene names for each gene ? __Exercise 4__ Get the annotation databases from NCBI for Homo sapiens (assembly build GRCh38.80), create a txdb object (similar to what we saw in question 3 above) and get the genes. ( Hint - involves starting from scratch with a GTF file) __Exercise 5__ The liftOver facilities developed in conjunction with the UCSC browser track infrastructure are available for transforming data in GRanges formats. We want to transform data from rn4 to the latest asembly rn6. a) The transformation to rn6 coordinates is defined by a chain file provided by UCSC. Get the chain file which contains transformations from rn5 to rn6. b) Perform the liftover after getting the chain file. ## Solutions __Answer 1__ ```{r txdb} suppressPackageStartupMessages({ library("TxDb.Rnorvegicus.UCSC.rn5.refGene") }) txdb <- TxDb.Rnorvegicus.UCSC.rn5.refGene ## find all genes ratGenes <- genes(txdb) ## list all sequences seqinfo(ratGenes) ## subset to contain only standard chromosomes ratGenes <- keepStandardChromosomes(ratGenes) ## find gene 'Acsl5' acsl5 <- ratGenes[which(mcols(ratGenes)$gene_id==94340),] acsl5 ``` __Answer 2__ ```{r bsgenome} suppressPackageStartupMessages({ library(BSgenome.Rnorvegicus.UCSC.rn5) }) ratSeq <- BSgenome.Rnorvegicus.UCSC.rn5 class(ratSeq) ## get the sequence acsl5_sequence <- getSeq(ratSeq, acsl5) ## calculate the GC content letterFrequency(acsl5_sequence, "GC", as.prob=TRUE) ``` __Answer 3__ ```{r select-rat} library("Rattus.norvegicus") ## get a mapping between all entrex id and gene names ratGeneNames <- select(Rattus.norvegicus, ratGenes$gene_id, columns=c("SYMBOL", 'GENEID'), keytype="GENEID") ## match the entrz id with result before subsetting idx <- match(ratGeneNames$GENEID, ratGenes$gene_id) ## add mactched result to GRanges mcols(ratGenes) <- ratGeneNames[idx,] ratGenes ``` __Answer 4__ Steps include a) Getting the GTF file from NCBI for a particular build of Homo sapiens that you're interested in. ( AnnotationHub is a package inside Bioconductor which automatically gets the file for you) b) create a txdb object from this GTF file (which is read in as a GRanges) c) extract genes from the txdb object as before. These steps are beneficial if you cannot find pre-packaged genome annotations for your favourite organism as a package inside Bioconductor. ```{r gtf-to-gr, eval=FALSE} library(AnnotationHub) ah = AnnotationHub() ## find the file gtf_humans <- query(ah , c("gtf", "Homo sapiens", "grch38","80")) gtf_humans ## download the file gtfFile <- ah[["AH47066"]] ## create a txdb library(GenomicFeatures) txdb <- makeTxDbFromGRanges(gtfFile) #may take some time.. txdb ## get the genes from the taxdb object humanGenes <- genes(txdb) ``` __Answer 5__ One way of getting a chain file would be to find the file in ucsc, download it and read it in using `rtracklayer::import.chain()`. An easier solution would be to find the files via `AnnotationHub` ```{r chainfile} ## a) get the chain file ## load the package and query the files to find the file we want library(AnnotationHub) ah = AnnotationHub() query(ah , c("rattus", "rn5", "rn6")) ## learn more about the file you want ah["AH14761"] ## download the file ratChain <- ah[["AH14761"]] ratChain ## b) perform the liftover library(rtracklayer) lft <- liftOver(acsl5, ratChain) lft ``` ## References - [Basic Introduction to GenomicRanges](//www.andersvercelli.com/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf) - [Manipulating GRanges](//www.andersvercelli.com/packages/devel/bioc/vignettes/GenomeInfoDb/inst/doc/GenomeInfoDb.pdf) - [GenomicRangesHOWTOs (Advanced)](//www.andersvercelli.com/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesHOWTOs.pdf) - [Introduction to Bioconductor for Sequence Data](//www.andersvercelli.com/help/workflows/sequencing/) - [Working with BSgenome Packages](//www.andersvercelli.com/packages/devel/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf) - [Utilizing TxDb objects with GenomicFeatures](//www.andersvercelli.com/packages/devel/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf) - [AnnotationHub Introduction](//www.andersvercelli.com/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub.html), [Case study](//www.andersvercelli.com/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html) - [liftOver workflow](//www.andersvercelli.com/help/workflows/liftOver/) ## What to not miss at BioC2015 ! If you liked this lab and want to learn more in this area, do not miss the following labs at BioC2015 - [_Bioconductor annotation resources_](https://github.com/mrjc42/BiocAnnotRes2015) by Marc Carlson, Sonali Arora. (Wednesday, Session 3, 1:00pm-2:45pm) - _Practical introduction to Bioconductor foundational data structures for high throughput sequencing analysis_ by Herve Pages, Michael Lawrence. (Wednesday, Session 3, 3:15pm-5:00pm) ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```