---标题:“1.生物导体介绍”作者:“Sonali Arora”输出:生物陶瓷:: html_document:toc:true toc_depth:2小插图:>%\ vignetteIndexentry {1。Biocumon}%\ vignetteengine {knitr :: Rarmardown})knitr :: Opts_chunk $ set(eval = as.logical(sys.getenv(sys.getenv(“true”)),cache = as.logical(sys.getenv(“knitr_cache”,“true”)),错误=false)```作者:Sonali Arora(sarora@fredhutch.org.
日期:2015年7月20日至24日
本课程中的材料需要R版本3.2.1和Biocumon V9.2``` {R echo = false,message = false} suppressPackageStArtUpMessages({库(基因组)库(GenomicalIgning)库('rnaseqdata.hnrnpc.bam.chr14'))库(AnnotationHub)库(“TXDB.hsapiens.ucsc.hg19.knowngene”)库(org.hs.eg.db)})ah = annotationhub()fa < - ah [[“ah18522”]]```##是什么是生物导体分析和对高通量基因组数据的理解 - 统计分析:大数据,技术伪影,设计实验;严谨 - 理解:生物学背景,可视化,可重复性 - 高通量+测序:RNASEQ,Chipseq,Variants,拷贝数,... +微阵列:表达,SNP,...... +流式细胞术,蛋白质组学,图像,......包装,渐晕,工作流动 - '释放'(每六个月)和'devel'分支 - 选择1045包。+通过[BiocViews](http://biocumon.org/packages/devel/biocviews.html#___software)+每包具有标题,作者/维护者,简短描述,引用,安装说明的包装“着陆页”包装,屏蔽,文档,下载统计信息+所有用户可见功能都有帮助页面,大多数都使用Runnable examples +'Vignettes'在Biocumondiond中的一个重要功能 - 叙述文档说明如何使用该包装,其中包含集成代码有用链接 - 文档[上一页课程材料](http://biocondudion.org/help/course-materials/),[工作流程](http://biocidodder.org/help/workflows/),[视频](https://www.youtube。COM / USER / BIOCOCTERS)[2021欧洲杯体育投注开户开发人员](http://biocumon.org/developers/) - 询问问题[支持站点](https://support.biocadiond.org) - 与我们连接[Twitter](HTTPS://twitter.com/biocumon),[通讯](http://biocondudard.org/help/newsletters/2015_july/)##整体工作流程典型的工作流由f中间的步骤。- Experimental design - Wet-lab preparation - High-throughput sequencing + Output: FASTQ files of reads and their quality scores - Alignment + Many different aligners, some specialized for different purposes + Output: BAM files of aligned reads - Summary + e.g., _count_ of reads overlapping regions of interest (e.g., genes) - Statistical analysis - Comprehension ![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png) ## Where does Bioconductor fit in ### Infrastructure One of the biggest strengths of Bioconductor is the classes defined to make simple tasks extremely easy and streamlined. #### GenomicRanges objects - Represent **annotations** -- genes, variants, regulatory elements, copy number regions, ... - Represent **data** -- aligned reads, ChIP peaks, called variants, ... ![Alt Genomic Ranges](our_figures/GRanges.png) Many biologically interesting questions represent operations on ranges - Count overlaps between aligned reads and known genes -- `GenomicRanges::summarizeOverlaps()` - Genes nearest to regulatory regions -- `GenomicRanges::nearest()`, [ChIPseeker][] - Called variants relevant to clinical phenotypes -- [VariantFiltering][] _GRanges_ Algebra - Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`, `restrict()`, `trim()` - See `?"intra-range-methods"` - Inter-range methods - Depends on other ranges in the same object - `range()`, `reduce()`, `gaps()`, `disjoin()` - `coverage()` (!) - see `?"inter-range-methods"` - Between-range methods - Functions of two (or more) range objects - `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`, `%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`, `pintersect()`, `psetdiff()` #### SummarizedExperiment The SummarizedExperiment class is a matrix-like container where rows represent ranges of interest (as a 'GRanges or GRangesList-class') and columns represent samples (with sample data summarized as a 'DataFrame-class') ![Alt Ranges Algebra](our_figures/SummarizedExperiment.png) ### Reading in Various file formats using R/Bioconductor ![Alt Ranges Algebra](our_figures/FilesToPackages.png) __Example - Reading in BAM files__ The `r Biocpkg("GenomicAlignments")` package is used to input reads aligned to a reference genome. In this next example, we will read in a BAM file and specifically read in reads supporting an apparent exon splice junction spanning position 19653773 of chromosome 14. The package `r Biocexptpkg("RNAseqData.HNRNPC.bam.chr14_BAMFILES")` contains 8 BAM files. We will use only the first BAM file. We will load the software packages and the data package, construct a _GRanges_ with our region of interest, and use `summarizeJunctions()` to find reads in our region of interest. ```{r} ## 1. load software packages library(GenomicRanges) library(GenomicAlignments) ## 2. load sample data library('RNAseqData.HNRNPC.bam.chr14') bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE) ## 3. define our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## 4. alignments, junctions, overlapping our roi paln <- readGAlignmentsList(bf) j <- summarizeJunctions(paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ## 5. supporting reads paln[j_overlap$revmap[[1]]] ``` ### Annotations #### __AnnotationHub: Bioconductor Package to Manage & Download files__ `r Biocpkg("AnnotationHub")` is a web client with which one can browse and download biological files from various databases such as UCSC, NCBI. Using this package allows the user to directly get the file, without needing to figure out where the file is located on UCSC, downloading it and managing multiple files on their local machine. ```{r ahdemo, eval=FALSE} library(AnnotationHub) ah = AnnotationHub() ``` ```{r ah2} ## data is available from the following sources unique(ah$dataprovider) ## following types of files can be retrieved from the hub unique(ah$sourcetype) ## We will download all _Homo sapiens_ cDNA sequences from the FASTA file ## 'Homo_sapiens.GRCh38.cdna.all.fa' from Ensembl using ## `r Biocpkg("AnnotationHub")`. ah2 <- query(ah, c("fasta", "homo sapiens", "Ensembl")) fa <- ah2[["AH18522"]] fa ``` ![Alt Annotation Packages](our_figures/AnotationPackages.png) #### __TxDb objects__ - Curatated annotation resources -- //www.andersvercelli.com/packages/biocViews - Underlying sqlite database -- `dbfile(txdb)` - Make your own: `GenomicFeatures::makeTxDbFrom*()` - Accessing gene models - `exons()`, `transcripts()`, `genes()`, `cds()` (coding sequence) - `promoters()` & friends - `exonsBy()` & friends -- exons by gene, transcript, ... - 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`, `mapIds()` ```{r gene-model-discovery} library("TxDb.Hsapiens.UCSC.hg19.knownGene") txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txdb methods(class=class(txdb)) genes(txdb) ``` #### __OrgDb objects__ - Curated resources, underlying sqlite data base, like `TxDb` - 'select' interface: `keytypes()`, `columns()`, `keys()`, `select()`, `mapIds()` - Vector of keys, desired columns - Specification of key type ```{r select} library(org.Hs.eg.db) select(org.Hs.eg.db, c("BRCA1", "PTEN"), c("ENTREZID", "GENENAME"), "SYMBOL") keytypes(org.Hs.eg.db) columns(org.Hs.eg.db) ``` #### Other internet resources - [biomaRt](http://biomart.org) Ensembl and other annotations - [PSICQUIC](https://code.google.com/p/psicquic) Protein interactions - [uniprot.ws](http://uniprot.org) Protein annotations - [KEGGREST](http://www.genome.jp/kegg) KEGG pathways - [SRAdb](http://www.ncbi.nlm.nih.gov/sra) Sequencing experiments - [rtracklayer](http://genome.ucsc.edu) USCS genome tracks - [GEOquery](http://www.ncbi.nlm.nih.gov/geo/) Array and other data - [ArrayExpress](http://www.ebi.ac.uk/arrayexpress/) Array and other data - ... ### Downstream Statistical Analysis _Bioconductor_ packages are organized by [biocViews](//www.andersvercelli.com/packages/devel/BiocViews.html#___Software). One can answer a number of [Biological Questions](//www.andersvercelli.com/packages/devel/BiocViews.html#___BiologicalQuestion) using various packages. Some of the entries under [Sequencing](//www.andersvercelli.com/packages/biocViews.html#__Sequencing) and other terms, and representative packages, include: * [RNASeq](//www.andersvercelli.com/packages/biocViews.html#__RNASeq), e.g., `r Biocpkg("edgeR")`, `r Biocpkg("DESeq2")`, `r Biocpkg("edgeR")`, `r Biocpkg("derfinder")`, and `r Biocpkg("QuasR")`. * [ChIPSeq](//www.andersvercelli.com/packages/biocViews.html#__ChIPSeq), e.g.,`r Biocpkg("DiffBind")`, `r Biocpkg("csaw")`, `r Biocpkg("ChIPseeker")`, `r Biocpkg("ChIPQC")`. * [SNPs](//www.andersvercelli.com/packages/biocViews.html#__SNP) and other variants, e.g., `r Biocpkg("VariantAnnotation")`, `r Biocpkg("VariantFiltering")`, `r Biocpkg("h5vc")`. * [CopyNumberVariation](//www.andersvercelli.com/packages/biocViews.html#__CopyNumberVariation) e.g., `r Biocpkg("DNAcopy")`, `r Biocpkg("crlmm")`, `r Biocpkg("fastseg")`. * [Microbiome](//www.andersvercelli.com/packages/biocViews.html#__Microbiome) and metagenome sequencing, e.g., `r Biocpkg("metagenomeSeq")`, `r Biocpkg("phyloseq")`, `r Biocpkg("DirichletMultinomial")`. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```