{r setup, echo=FALSE}库(LearnBioconductor) stopifnot(BiocInstaller::biocVersion() == "3.0")``` ```{r style, echo = FALSE, results = 'asis'}" " #与大数据打交道马丁·摩根
10月29日,2014 ##可伸缩计算高效的_R_代码-向量化!-重复使用其他人的工作- ' r Biocpkg("DESeq2") ', ' r Biocpkg(" genome ranges ") ', ' r Biocpkg("Biostrings") ',…‘r CRANpkg(“dplyr”)”,”r CRANpkg(“data.table”)“,”r CRANpkg(“Rcpp”)——有用的工具:“system.time()”,“Rprof()”、“r CRANpkg(“微基准测试”)——详细(宗罪)(//www.andersvercelli.com/help/course-materials/2014/CSAMA2014/1_Monday/labs/IntermediateR.html)高效的代码之前的课程。2.迭代-块明智- ' open() ',读取块(s), ' close() '。-例如,Rsamtools::BamFile()的' yieldSize '参数限制-限制到感兴趣的列和/或行-利用特定领域的格式,例如,BAM文件和“Rsamtools::ScanBamParam()”-使用数据库采样-迭代大数据,保留一个可管理的样本,例如,' ShortRead::FastqSampler() '通常情况下,' lapply() '之类的操作——单核机器上('easy'); clusters (more tedious); clouds ## Parallel evaluation in _Bioconductor_ - [BiocParallel][] -- `bplapply()` for `lapply()`-like functions, increasingly used by package developers to provide easy, standard way of gaining parallel evaluation. - [GenomicFiles][] -- Framework for working on groups of files, ranges, or ranges x files - Bioconductor [AMI](//www.andersvercelli.com/help/bioconductor-cloud-ami/) (Amazon Machine Instance) including pre-configured StarCluster. ## Lab ### Efficient code Write the following as a function. Use `system.time()` to explore how long this takes to execute as `n` increases from 100 to 10000. Use `identical()` and `r CRANpkg("microbenchmark")` to compare alternatives `f1()`, `f2()`, and `f3()` for both correctness and performance of these three different functions. What strategies are these functions using? ```{r benchmark} f0 <- function(n) { ## inefficient! ans <- numeric() for (i in seq_len(n)) ans <- c(ans, exp(i)) ans } f1 <- function(n) { ans <- numeric(n) for (i in seq_len(n)) ans[[i]] <- exp(i) ans } f2 <- function(n) sapply(seq_len(n), exp) f3 <- function(n) exp(seq_len(n)) ``` ### Sleeping serially and in parallel Go to sleep for 1 second, then return `i`. This takes 8 seconds. ```{r parallel-sleep} library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun) ``` ### Counting overlaps -- our own version Regions of interest, named like the chromosomes in the bam file. ```{r count-overlaps-roi, eval=FALSE} library(TxDb.Hsapiens.UCSC.hg19.knownGene) exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx") map0 <- read.delim("~/igv/genomes/hg19_alias.tab", header=FALSE, stringsAsFactors=FALSE) map <- setNames(map0$V1, map0$V2) seqlevels(exByTx, force=TRUE) <- map ``` A function to iterate through a bam file ```{r count-overlaps, eval=FALSE} count1 <- function(filename, roi) { ## Create and open BAM file bf <- BamFile(filename, yieldSize=1000000) open(bf) ## initialize variables n <- 0 # number of reads examined count <- integer(length(roi)) # running count of reads overlapping roi names(counts) <- names(roi) ## read in and count chunks of data, until done repeat { ## input aln <- readGAlignments(bf) # input next chunk if (length(aln) == 0) # stopping condition break n <- n + length(aln) # how are we doing? message(n) ## overlaps olaps <- findOverlaps(aln, roi, type="within", ignore.strand=TRUE) count <- count + tabulate(subjectHits(olaps), subjectLength(olaps)) } ## finish and return result close(bf) count } ``` In action ```{r count-overlaps-doit, eval=FALSE} filename <- "~/bam/SRR1039508_sorted.bam" count <- count1(filename, exByTx) ``` Parallelize ```{r count-overlaps-parallel, eval=FALSE} library(BiocParallel) ## all bam files filenames <- dir("~/bam", pattern="bam$", full=TRUE) names(filenames) <- sub("_sorted.bam", "", basename(filenames)) ## iterate counts <- bplapply(filenames, count1, exByTx) counts <- simplify2array(counts) head(counts) ``` ## Resources - Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 214-226. http://arxiv.org/abs/1409.2864v1 [BiocParallel]: //www.andersvercelli.com/packages/release/bioc/html/BiocParallel.html [GenomicFiles]: //www.andersvercelli.com/packages/release/bioc/html/GenomicFiles.html