——标题:“3。基因组规模数据和注释的基因组范围"作者:"Martin Morgan (martin.morgan@roswellpark.org)
罗斯威尔公园癌症研究所,布法罗,纽约
5 - 9 October, 2015" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{3。% \VignetteEngine{r::rmarkdown}——' ' ' {r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set(eval=as.logical(Sys. print=1000);采用“KNITR_EVAL”,“真正的”)),缓存= as.logical (Sys。采用“KNITR_CACHE”,“真正的”)))``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({library(genome ranges) library(GenomicAlignments)})本课程中的材料要求Rversion 3.2和Bioconductor version 3.2 {R configure-test} stopifnot(getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() == "3.2")本节重点介绍类、方法和包,目标是学习导航帮助系统和交互式发现设施。# #动机序列分析专业——大数据需要处理,那么内存和时间的方式,具体算法开发了独特的特征序列数据额外的考虑——重用现有的、经过测试的代码更容易比重复的做,不易出错。—当包共享相似的数据结构时,包之间的互操作性更容易。解决方案:使用定义良好的_classes_来表示复杂的数据;_methods_操作类来执行有用的函数。类和方法被放在一起并作为_packages_分发,这样我们都可以从其他人的辛苦工作和经过测试的代码中获益。 # Core packages
v v BSgenome v GenomicFeatures VariantAnnotation | | | rtracklayer | v GenomicAlignments | | v v SummarizedExperiment Rsamtools ShortRead | | | | v v v v GenomicRanges Biostrings | | v v GenomeInfoDb (XVector) | | v v IRanges | v (S4Vectors)
#核心课程##案例研究:{r IRanges}库(IRanges) ir <- IRanges(start=c(10,20,30), width=5) ir ' ' '有许多有趣的操作可以在range上执行,例如,' flanages_ () ' ' ' {r IRanges - flanages_ ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' 'IRanges类是类层次结构的一部分。要查看这一点,请向R请求' ir '的类,以及' IRanges '类的类定义。注意' IRanges '扩展了' Ranges '类。Show Now try into ' ?(如果不使用_RStudio_,输入' ?”旁边, “在哪里 `手段按Tab键询问选项卡完成)。您可以看到有关`侧面的帮助页面在几个不同的课程上运行。选择完成```{r iranges-flanc-method,eval = false}?“侧翼,范围 - 方法”```,并验证您在描述与`irangess实例相关的方法的页面上。探索其他基于范围的操作。[Genomicranges] []封装扩展了范围的概念,包括与序列分析中的范围应用相关的特征,特别是将范围与序列名称(例如,染色体)和股线相关联的能力。根据我们的“讽刺”的实例,创建一个`granges`实例,如下所示```{r granges}库(基因组)gr < - granges(c(“chr1”,“chr1”,“Chr2”),IR,Strand= C(“+”,“ - ”,“+”))Gr```侧翼序列的概念在生物学中具有更细微的含义。特别是我们可能期望“+”股线上的侧翼序列在范围内,但在负线上会遵循它。确认“侧翼”适用于“格兰人”对象具有这种行为。````{R granges-flank}侧翼(gr,3)```探索什么课程`granges`延伸,在应用于`granges`对象时,找出书面行为的帮助页面,似乎可能是使用基因组范围合作的许多有用的方法; we can discover some of these from the command line, indicating that the methods should be on the current `search()` path ```{r granges-methods} methods(class="GRanges") ``` Notice that the available `flank()` methods have been augmented by the methods defined in the _GenomicRanges_ package, including those that are relevant (via inheritance) to the _GRanges_ class. ```{r granges-flank-method} grep("flank", methods(class="GRanges"), value=TRUE) ``` Verify that the help page documents the behavior we just observed. ```{r granges-flank-method-help, eval=FALSE} ?"flank,GenomicRanges-method" ``` Use `help()` to list the help pages in the `GenomicRanges` package, and `vignettes()` to view and access available vignettes; these are also available in the Rstudio 'Help' tab. ```{r granges-man-and-vignettes, eval=FALSE} help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ``` ## _GenomicRanges_ ### The `GRanges` and `GRangesList` classes Aside: 'TxDb' packages provide an R representation of gene models ```{r txdb} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ``` `exons()`: _GRanges_ ```{r txdb-exons} exons(txdb) ``` ![Alt Genomic Ranges](our_figures/GRanges.png) `exonsBy()`: _GRangesList_ ```{r txdb-exonsby} exonsBy(txdb, "tx") ``` ![Alt Genomic Ranges List](our_figures/GRangesList.png) _GRanges_ / _GRangesList_ are incredibly useful - Represent **annotations** -- genes, variants, regulatory elements, copy number regions, ... - Represent **data** -- aligned reads, ChIP peaks, called variants, ... ### Algebra of genomic ranges Many biologically interesting questions represent operations on ranges - Count overlaps between aligned reads and known genes -- `GenomicRanges::summarizeOverlaps()` - Genes nearest to regulatory regions -- `GenomicRanges::nearest()`, [ChIPseeker][] - Called variants relevant to clinical phenotypes -- [VariantFiltering][] _GRanges_ Algebra - Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`, `restrict()`, `trim()` - See `?"intra-range-methods"` - Inter-range methods - Depends on other ranges in the same object - `range()`, `reduce()`, `gaps()`, `disjoin()` - `coverage()` (!) - see `?"inter-range-methods"` - Between-range methods - Functions of two (or more) range objects - `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`, `%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`, `pintersect()`, `psetdiff()` ![Alt Ranges Algebra](our_figures/RangeOperations.png) ## _Biostrings_ (DNA or amino acid sequences) Classes - XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads) Methods -- - [Cheat sheat](//www.andersvercelli.com/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf) - Manipulation, e.g., `reverseComplement()` - Summary, e.g., `letterFrequency()` - Matching, e.g., `matchPDict()`, `matchPWM()` Related packages - [BSgenome][] - Whole-genome representations - Model and custom - [ShortRead][] - FASTQ files Example - Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others as FASTA files; model organism whole genome sequences are packaged into more user-friendly `BSgenome` packages. The following calculates GC content across chr14. ```{r BSgenome-require, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"])) chr14_dna <- getSeq(Hsapiens, chr14_range) letterFrequency(chr14_dna, "GC", as.prob=TRUE) ``` ## _GenomicAlignments_ (Aligned reads) Classes -- GenomicRanges-like behaivor - GAlignments, GAlignmentPairs, GAlignmentsList Methods - `readGAlignments()`, `readGAlignmentsList()` - Easy to restrict input, iterate in chunks - `summarizeOverlaps()` Example - Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14 ```{r bam-require} library(GenomicRanges) library(GenomicAlignments) library(Rsamtools) ## our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data library('RNAseqData.HNRNPC.bam.chr14') bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE) ## alignments, junctions, overlapping our roi paln <- readGAlignmentsList(bf) j <- summarizeJunctions(paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ## supporting reads paln[j_overlap$revmap[[1]]] ``` ## _VariantAnnotation_ (Called variants) Classes -- GenomicRanges-like behavior - VCF -- 'wide' - VRanges -- 'tall' Functions and methods - I/O and filtering: `readVcf()`, `readGeno()`, `readInfo()`, `readGT()`, `writeVcf()`, `filterVcf()` - Annotation: `locateVariants()` (variants overlapping ranges), `predictCoding()`, `summarizeVariants()` - SNPs: `genotypeToSnpMatrix()`, `snpSummary()` Example - Read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model library(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` Related packages - [ensemblVEP][] - Forward variants to Ensembl Variant Effect Predictor - [VariantTools][], [h5vc][] - Call variants Reference - Obenchain, V, Lawrence, M, Carey, V, Gogarten, S, Shannon, P, and Morgan, M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics, first published online March 28, 2014 [doi:10.1093/bioinformatics/btu168](http://bioinformatics.oxfordjournals.org/content/early/2014/04/21/bioinformatics.btu168) ## _rtracklayer_ (Genome annotations) - `import()`: BED, GTF, WIG, 2bit, etc - `export()`: GRanges to BED, GTF, WIG, ... - Access UCSC genome browser ## _SummarizedExperiment_ - Integrate experimental data with sample, feature, and experiment-wide annotations - Matrix where rows are indexed by genomic ranges, columns by a DataFrame. ![Alt SummarizedExperiment](our_figures/SE_Description.png) Functions and methods - Accessors: `assay()` / `assays()`, `rowData()` / `rowRanges()`, `colData()`, `metadata()` - Range-based operations, especially `subsetByOverlaps()` # Input & representation of standard file formats ## BAM files of aligned reads -- `GenomicAlignments` Recall: overall workflow 1. Experimental design 2. Wet-lab preparation 3. High-throughput sequencing 4. Alignment - Whole genome, vs. transcriptome 5. Summary 6. Statistical analysis 7. Comprehension BAM files of aligned reads - Header @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 ... @SQ SN:chrY LN:59373566 @PG ID:TopHat VN:2.0.8b CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq - Alignments - ID, flag, alignment and mate ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413 ... ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720 ... - Sequence and quality ... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%)) ... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)**** - Tags ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:2 CC:Z:chr22 CP:i:16189276 HI:i:0 ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:3 CC:Z:= CP:i:19921600 HI:i:0 - Typically, sorted (by position) and indexed ('.bai' files) [GenomicAlignments][] - Use an example BAM file (`fl` could be the path to your own BAM file) ```{r genomicalignments} ## example BAM data library(RNAseqData.HNRNPC.bam.chr14) ## one BAM file fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] ## Let R know that this is a BAM file, not just a character vector library(Rsamtools) bfl <- BamFile(fl) ``` - Input the data into R ```{r readgalignments} aln <- readGAlignments(bfl) aln ``` - `readGAlignmentPairs()` / `readGAlignmentsList()` if paired-end data - Lots of things to do, including all the _GRanges_ / _GRangesList_ operations ```{r galignments-methods} methods(class=class(aln)) ``` - **Caveat emptor**: BAM files are large. Normally you will _restrict_ the input to particular genomic ranges, or _iterate_ through the BAM file. Key _Bioconductor_ functions (e.g., `GenomicAlignments::summarizeOverlaps()` do this data management step for you. See next section! ## Other formats and packages ![Alt Files and the Bioconductor packages that input them](our_figures/FilesToPackages.png) # Resources Acknowledgements - Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum. - The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ``` [AnnotationDbi]: //www.andersvercelli.com/packages/AnnotationDbi [BSgenome]: //www.andersvercelli.com/packages/BSgenome [BiocParallel]: //www.andersvercelli.com/packages/BiocParallel [Biostrings]: //www.andersvercelli.com/packages/Biostrings [CNTools]: //www.andersvercelli.com/packages/CNTools [ChIPQC]: //www.andersvercelli.com/packages/ChIPQC [ChIPpeakAnno]: //www.andersvercelli.com/packages/ChIPpeakAnno [DESeq2]: //www.andersvercelli.com/packages/DESeq2 [DiffBind]: //www.andersvercelli.com/packages/DiffBind [GenomicAlignments]: //www.andersvercelli.com/packages/GenomicAlignments [GenomicRanges]: //www.andersvercelli.com/packages/GenomicRanges [IRanges]: //www.andersvercelli.com/packages/IRanges [KEGGREST]: //www.andersvercelli.com/packages/KEGGREST [PSICQUIC]: //www.andersvercelli.com/packages/PSICQUIC [rtracklayer]: //www.andersvercelli.com/packages/rtracklayer [Rsamtools]: //www.andersvercelli.com/packages/Rsamtools [ShortRead]: //www.andersvercelli.com/packages/ShortRead [VariantAnnotation]: //www.andersvercelli.com/packages/VariantAnnotation [VariantFiltering]: //www.andersvercelli.com/packages/VariantFiltering [VariantTools]: //www.andersvercelli.com/packages/VariantTools [biomaRt]: //www.andersvercelli.com/packages/biomaRt [cn.mops]: //www.andersvercelli.com/packages/cn.mops [h5vc]: //www.andersvercelli.com/packages/h5vc [edgeR]: //www.andersvercelli.com/packages/edgeR [ensemblVEP]: //www.andersvercelli.com/packages/ensemblVEP [limma]: //www.andersvercelli.com/packages/limma [metagenomeSeq]: //www.andersvercelli.com/packages/metagenomeSeq [phyloseq]: //www.andersvercelli.com/packages/phyloseq [snpStats]: //www.andersvercelli.com/packages/snpStats [org.Hs.eg.db]: //www.andersvercelli.com/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg19.knownGene]: //www.andersvercelli.com/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [BSgenome.Hsapiens.UCSC.hg19]: //www.andersvercelli.com/packages/BSgenome.Hsapiens.UCSC.hg19