- title: "Workshop: W2 -序列,对齐和大数据"输出:BiocStyle::html_document: toc: true vignette: > % \VignetteIndexEntry{Workshop:W3 - sequence, Alignments, and Large Data} % \VignetteEngine{knitr::rmarkdown}——' ' ' {r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set(eval=as.logical(Sys. Data)) $set(eval=as.logical(Sys. Data)采用“KNITR_EVAL”,“真正的”)),缓存= as.logical (Sys。采用“KNITR_CACHE”,“真正的”)))``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({library(Biostrings) library(GenomicRanges) library(GenomicFiles) library(BiocParallel) library(TxDb.Hsapiens.UCSC.hg19.knownGene)})作者:马丁·摩根(mtmorgan@fredhutch.org)
日期:2015年9月7日
回到[Workshop Outline](Developer-Meeting-Workshop.html)
本文档材料要求_R_ version 3.2和_Bioconductor_ version 3.1 ' ' ' {r configuration -test} stopifnot(getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() >= "3.1")本节重点介绍类、方法和包,目的是学习如何使用帮助系统和交互式发现工具。# #动机序列分析专业——大数据需要处理,那么内存和时间的方式,具体算法开发了独特的特征序列数据额外的考虑——重用现有的、经过测试的代码更容易比重复的做,不易出错。—当包共享相似的数据结构时,更容易实现包之间的互操作性。解决方案:使用定义良好的_classes_来表示复杂数据;_methods_操作类来执行有用的函数。类和方法被放在一起,以_packages的形式分发,这样我们都可以从其他人的辛勤工作和测试代码中受益。#核心包
v v BSgenome v GenomicFeatures VariantAnnotation | | | rtracklayer | v GenomicAlignments | | v v SummarizedExperiment Rsamtools ShortRead | | | | v v v v GenomicRanges Biostrings | | v v GenomeInfoDb (XVector) | | v v IRanges | v (S4Vectors)
#核心课程##案例研究:{r IRanges} library(IRanges) ir <- IRanges(start=c(10, 20, 30), width=5) ir ' '区间有很多有趣的操作,例如' side() '标识相邻的区间' ' {r IRanges - side} side (ir, 3)' ' ' ' IRanges '类是类层次结构的一部分。要查看这个,请向R请求' ir '的类,以及' IRanges '类' '的类定义。' ' '注意' IRanges '扩展了' Ranges '类。现在试着进入?旁边' (' ?”旁边, " '如果没有使用_RStudio, where ' `手段按Tab键询问选项卡完成)。您可以看到有关`侧面的帮助页面在几个不同的课程上运行。选择完成```{r iranges-flanc-method,eval = false}?“侧翼,范围 - 方法”```,并验证您在描述与`irangess实例相关的方法的页面上。探索其他基于范围的操作。[Genomicranges] []封装扩展了范围的概念,包括与序列分析中的范围应用相关的特征,特别是将范围与序列名称(例如,染色体)和股线相关联的能力。根据我们的“讽刺”的实例,创建一个`granges`实例,如下所示```{r granges}库(基因组)gr < - granges(c(“chr1”,“chr1”,“Chr2”),IR,Strand= C(“+”,“ - ”,“+”))Gr```侧翼序列的概念在生物学中具有更细微的含义。特别是我们可能期望“+”股线上的侧翼序列在范围内,但在负线上会遵循它。确认“侧翼”适用于“格兰人”对象具有这种行为。``````{r granges-flank}侧翼(gr,3)```探索什么类别展示了什么课程,发现在应用于`granges`对象时记录“侧翼”行为的帮助页面,并验证帮助页面文件我们刚刚观察到的行为。 ```{r granges-class} class(gr) getClass(class(gr)) ``` ```{r granges-flank-method, eval=FALSE} ?"flank,GenomicRanges-method" ``` Notice that the available `flank()` methods have been augmented by the methods defined in the _GenomicRanges_ package. It seems like there might be a number of helpful methods available for working with genomic ranges; we can discover some of these from the command line, indicating that the methods should be on the current `search()` path ```{r granges-methods} methods(class="GRanges") ``` Use `help()` to list the help pages in the `GenomicRanges` package, and `vignettes()` to view and access available vignettes; these are also available in the Rstudio 'Help' tab. ```{r granges-man-and-vignettes, eval=FALSE} help(package="GenomicRanges") vignette(package="GenomicRanges") vignette(package="GenomicRanges", "GenomicRangesHOWTOs") ``` ## _GenomicRanges_ ### The `GRanges` and `GRangesList` classes Aside: 'TxDb' packages provide an R representation of gene models ```{r txdb} library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ``` `exons()`: _GRanges_ ```{r txdb-exons} exons(txdb) ``` ![Alt Genomic Ranges](our_figures/GRanges.png) `exonsBy()`: _GRangesList_ ```{r txdb-exonsby} exonsBy(txdb, "tx") ``` ![Alt Genomic Ranges List](our_figures/GRangesList.png) _GRanges_ / _GRangesList_ are incredibly useful - Represent **annotations** -- genes, variants, regulatory elements, copy number regions, ... - Represent **data** -- aligned reads, ChIP peaks, called variants, ... ### Algebra of genomic ranges Many biologically interesting questions represent operations on ranges - Count overlaps between aligned reads and known genes -- `GenomicRanges::summarizeOverlaps()` - Genes nearest to regulatory regions -- `GenomicRanges::nearest()`, [ChIPseeker][] - Called variants relevant to clinical phenotypes -- [VariantFiltering][] _GRanges_ Algebra - Intra-range methods - Independent of other ranges in the same object - GRanges variants strand-aware - `shift()`, `narrow()`, `flank()`, `promoters()`, `resize()`, `restrict()`, `trim()` - See `?"intra-range-methods"` - Inter-range methods - Depends on other ranges in the same object - `range()`, `reduce()`, `gaps()`, `disjoin()` - `coverage()` (!) - see `?"inter-range-methods"` - Between-range methods - Functions of two (or more) range objects - `findOverlaps()`, `countOverlaps()`, ..., `%over%`, `%within%`, `%outside%`; `union()`, `intersect()`, `setdiff()`, `punion()`, `pintersect()`, `psetdiff()` ![Alt Ranges Algebra](our_figures/RangeOperations.png) ## _Biostrings_ (DNA or amino acid sequences) Classes - XString, XStringSet, e.g., DNAString (genomes), DNAStringSet (reads) Methods -- - [Cheat sheat](//www.andersvercelli.com/packages/release/bioc/vignettes/Biostrings/inst/doc/BiostringsQuickOverview.pdf) - Manipulation, e.g., `reverseComplement()` - Summary, e.g., `letterFrequency()` - Matching, e.g., `matchPDict()`, `matchPWM()` Related packages - [BSgenome][] - Whole-genome representations - Model and custom - [ShortRead][] - FASTQ files Example - Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others as FASTA files; model organism whole genome sequences are packaged into more user-friendly `BSgenome` packages. The following calculates GC content across chr14. ```{r BSgenome-require, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg19) chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"])) chr14_dna <- getSeq(Hsapiens, chr14_range) letterFrequency(chr14_dna, "GC", as.prob=TRUE) ``` ## _GenomicAlignments_ (Aligned reads) Classes -- GenomicRanges-like behaivor - GAlignments, GAlignmentPairs, GAlignmentsList Methods - `readGAlignments()`, `readGAlignmentsList()` - Easy to restrict input, iterate in chunks - `summarizeOverlaps()` Example - Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14 ```{r bam-require} library(GenomicRanges) library(GenomicAlignments) library(Rsamtools) ## our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data library('RNAseqData.HNRNPC.bam.chr14') bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE) ## alignments, junctions, overlapping our roi paln <- readGAlignmentsList(bf) j <- summarizeJunctions(paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ## supporting reads paln[j_overlap$revmap[[1]]] ``` ## _VariantAnnotation_ (Called variants) Classes -- GenomicRanges-like behavior - VCF -- 'wide' - VRanges -- 'tall' Functions and methods - I/O and filtering: `readVcf()`, `readGeno()`, `readInfo()`, `readGT()`, `writeVcf()`, `filterVcf()` - Annotation: `locateVariants()` (variants overlapping ranges), `predictCoding()`, `summarizeVariants()` - SNPs: `genotypeToSnpMatrix()`, `snpSummary()` Example - Read variants from a VCF file, and annotate with respect to a known gene model ```{r vcf, message=FALSE} ## input variants library(VariantAnnotation) fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") seqlevels(vcf) <- "chr22" ## known gene model library(TxDb.Hsapiens.UCSC.hg19.knownGene) coding <- locateVariants(rowRanges(vcf), TxDb.Hsapiens.UCSC.hg19.knownGene, CodingVariants()) head(coding) ``` Related packages - [ensemblVEP][] - Forward variants to Ensembl Variant Effect Predictor - [VariantTools][], [h5vc][] - Call variants Reference - Obenchain, V, Lawrence, M, Carey, V, Gogarten, S, Shannon, P, and Morgan, M. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics, first published online March 28, 2014 [doi:10.1093/bioinformatics/btu168](http://bioinformatics.oxfordjournals.org/content/early/2014/04/21/bioinformatics.btu168) ## _rtracklayer_ (Genome annotations) - `import()`: BED, GTF, WIG, 2bit, etc - `export()`: GRanges to BED, GTF, WIG, ... - Access UCSC genome browser ## _SummarizedExperiment_ - Integrate experimental data with sample, feature, and experiment-wide annotations - Matrix where rows are indexed by genomic ranges, columns by a DataFrame. ![Alt SummarizedExperiment](our_figures/SE_Description.png) Functions and methods - Accessors: `assay()` / `assays()`, `rowData()` / `rowRanges()`, `colData()`, `metadata()` - Range-based operations, especially `subsetByOverlaps()` # Input & representation of standard file formats ## BAM files of aligned reads -- `GenomicAlignments` Recall: overall workflow 1. Experimental design 2. Wet-lab preparation 3. High-throughput sequencing 4. Alignment - Whole genome, vs. transcriptome 5. Summary 6. Statistical analysis 7. Comprehension BAM files of aligned reads - Header @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 ... @SQ SN:chrY LN:59373566 @PG ID:TopHat VN:2.0.8b CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq - Alignments - ID, flag, alignment and mate ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413 ... ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720 ... - Sequence and quality ... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%)) ... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)**** - Tags ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:2 CC:Z:chr22 CP:i:16189276 HI:i:0 ... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:3 CC:Z:= CP:i:19921600 HI:i:0 - Typically, sorted (by position) and indexed ('.bai' files) [GenomicAlignments][] - Use an example BAM file (`fl` could be the path to your own BAM file) ```{r genomicalignments} ## example BAM data library(RNAseqData.HNRNPC.bam.chr14) ## one BAM file fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] ## Let R know that this is a BAM file, not just a character vector library(Rsamtools) bfl <- BamFile(fl) ``` - Input the data into R ```{r readgalignments} aln <- readGAlignments(bfl) aln ``` - `readGAlignmentPairs()` / `readGAlignmentsList()` if paired-end data - Lots of things to do, including all the _GRanges_ / _GRangesList_ operations ```{r galignments-methods} methods(class=class(aln)) ``` - **Caveat emptor**: BAM files are large. Normally you will _restrict_ the input to particular genomic ranges, or _iterate_ through the BAM file. Key _Bioconductor_ functions (e.g., `GenomicAlignments::summarizeOverlaps()` do this data management step for you. See next section! ## Other formats and packages ![Alt Files and the Bioconductor packages that input them](our_figures/FilesToPackages.png) ## Large data -- `BiocParallel`, `GenomicFiles` ### Restriction - Input only the data necessary, e.g., `ScanBamParam()` - `which`: genomic ranges of interest - `what`: 'columns' of BAM file, e.g., 'seq', 'flag' ### Iteration - Read entire file, but in chunks - Chunk size small enough to fit easily in memory, - Chunk size large enough to benefit from _R_'s vectorized operations -- 10k to 1M records at at time - e.g., `BamFile(..., yieldSize=100000)` Iterative programming model - _yield_ a chunk of data - _map_ input data to convenient representation, often summarizing input to simplified form - E.g., Aligned read coordinates to counts overlapping regions of interest - E.g., Aligned read sequenced to GC content - _reduce_ across mapped chunks - Use `GenomicFiles::reduceByYield()` ```{r iteration} library(GenomicFiles) yield <- function(bfl) { ## input a chunk of alignments library(GenomicAlignments) readGAlignments(bfl, param=ScanBamParam(what="seq")) } map <- function(aln) { ## Count G or C nucleotides per read library(Biostrings) gc <- letterFrequency(mcols(aln)$seq, "GC") ## Summarize number of reads with 0, 1, ... G or C nucleotides tabulate(1 + gc, 73) # max. read length: 72 } reduce <- `+` ``` - Example ```{r iteration-doit} library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES bf <- BamFile(fls[1], yieldSize=100000) gc <- reduceByYield(bf, yield, map, reduce) plot(gc, type="h", xlab="GC Content per Aligned Read", ylab="Number of Reads") ``` ### Parallel evaluation - Cores, computers, clusters, clouds - Generally, requires memory management techniques like restriction or iteration -- parallel processes competing for shared memory - Many problems are _embarassingly parallel_ -- `lapply()`-like -- especially in bioinformatics where parallel evaluation is across files - Example: GC content in several BAM files ```{r parallel-doit} library(BiocParallel) gc <- bplapply(BamFileList(fls), reduceByYield, yield, map, reduce) library(ggplot2) df <- stack(as.data.frame(lapply(gc, cumsum))) df$GC <- 0:72 ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) + xlab("Number of GC Nucleotides per Read") + ylab("Number of Reads") ``` [biocViews]: //www.andersvercelli.com/packages/BiocViews.html#___Software [AnnotationData]: //www.andersvercelli.com/packages/BiocViews.html#___AnnotationData [aprof]: http://cran.r-project.org/web/packages/aprof/index.html [hexbin]: http://cran.r-project.org/web/packages/hexbin/index.html [lineprof]: https://github.com/hadley/lineprof [microbenchmark]: http://cran.r-project.org/web/packages/microbenchmark/index.html [AnnotationDbi]: //www.andersvercelli.com/packages/AnnotationDbi [BSgenome]: //www.andersvercelli.com/packages/BSgenome [BiocParallel]: //www.andersvercelli.com/packages/BiocParallel [Biostrings]: //www.andersvercelli.com/packages/Biostrings [CNTools]: //www.andersvercelli.com/packages/CNTools [ChIPQC]: //www.andersvercelli.com/packages/ChIPQC [ChIPpeakAnno]: //www.andersvercelli.com/packages/ChIPpeakAnno [DESeq2]: //www.andersvercelli.com/packages/DESeq2 [DiffBind]: //www.andersvercelli.com/packages/DiffBind [GenomicAlignments]: //www.andersvercelli.com/packages/GenomicAlignments [GenomicRanges]: //www.andersvercelli.com/packages/GenomicRanges [IRanges]: //www.andersvercelli.com/packages/IRanges [KEGGREST]: //www.andersvercelli.com/packages/KEGGREST [PSICQUIC]: //www.andersvercelli.com/packages/PSICQUIC [rtracklayer]: //www.andersvercelli.com/packages/rtracklayer [Rsamtools]: //www.andersvercelli.com/packages/Rsamtools [ShortRead]: //www.andersvercelli.com/packages/ShortRead [VariantAnnotation]: //www.andersvercelli.com/packages/VariantAnnotation [VariantFiltering]: //www.andersvercelli.com/packages/VariantFiltering [VariantTools]: //www.andersvercelli.com/packages/VariantTools [biomaRt]: //www.andersvercelli.com/packages/biomaRt [cn.mops]: //www.andersvercelli.com/packages/cn.mops [h5vc]: //www.andersvercelli.com/packages/h5vc [edgeR]: //www.andersvercelli.com/packages/edgeR [ensemblVEP]: //www.andersvercelli.com/packages/ensemblVEP [limma]: //www.andersvercelli.com/packages/limma [metagenomeSeq]: //www.andersvercelli.com/packages/metagenomeSeq [phyloseq]: //www.andersvercelli.com/packages/phyloseq [snpStats]: //www.andersvercelli.com/packages/snpStats [org.Hs.eg.db]: //www.andersvercelli.com/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg19.knownGene]: //www.andersvercelli.com/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [BSgenome.Hsapiens.UCSC.hg19]: //www.andersvercelli.com/packages/BSgenome.Hsapiens.UCSC.hg19