——title:“大型基因组数据的管理和分析”输出:BiocStyle::html_document: toc: true number_sections: false vignette: > % \VignetteIndexEntry{实验室:大型基因组数据的管理和分析}% \VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}——' ' ' {r style, echo= false, results='asis'} BiocStyle::markdown()``` ```{r setup, echo=FALSE, results='hide'} library(knitr) opts_chunk$set(缓存=TRUE,错误=FALSE)原作者:Valerie Obenchain, Martin Morgan
主讲人:Martin Morgan (martin.morgan@roswellpark.org日期:2016年6月25日#高效的R代码本节的目标是强调编写正确、健壮和高效的R代码的实践。##优先级正确:与手工示例一致(' same () ', ' all.equal() ')鲁棒性:支持现实输入,例如,0长度向量,' NA '值,…3.简单:简单易懂下个月;很容易描述它对同事的影响;容易发现逻辑错误;易于增强。4. Fast, or at least reasonable given the speed of modern computers. ## Strategies 1. Profile - _Look_ at the script to understand in general terms what it is doing. - _Step_ through the code to see how it is executed, and to gain an understanding of the speed of each line. - _Time_ evaluation of select lines or simple chunks of code with `system.time()` or the `r CRANpkg("microbenchmark")` package. - _Profile_ the code with a tool that indicates how much time is spent in each function call or line -- the built-in `Rprof()` function, or packages such as `r CRANpkg("lineprof")` or `r CRANpkg("aprof")` 2. Vectorize -- operate on vectors, rather than explicit loops ```{r vectorize} x <- 1:10 log(x) ## NOT for (i in seq_along) x[i] <- log(x[i]) ``` 3. Pre-allocate memory, then fill in the result ```{r pre-allocate} result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result ``` 4. Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once - Simple, e.g., 'hoist' constant multiplications from a `for` loop - Higher-level, e.g., use `lm.fit()` rather than repeatedly fitting the same design matrix. 5. Re-use existing, tested code - Efficient implementations of common operations -- `tabulate()`, `rowSums()` and friends, `%in%`, ... - Efficient domain-specific implementations, e.g., `r Biocpkg("snpStats")` for GWAS linear models; `r Biocpkg("limma")` for microarray linear models; `r Biocpkg("edgeR")`, `r Biocpkg("DESeq2")` for negative binomial GLMs relevant to RNASeq. - Reuse others' work -- `r Biocpkg("DESeq2")`, `r Biocpkg("GenomicRanges")`, `r Biocpkg("Biostrings")`, ..., `r CRANpkg("dplyr")`, `r CRANpkg("data.table")`, `r CRANpkg("Rcpp")`回到顶部这里有一个明显低效的函数:' ' ' {r低效}f0 <- function(n, a=2) {## stopifnot(is.integer(n) && (length(n) == 1) && ## !is.na(n) && (n > 0)) result <- numeric() for (i in seq_len(n)) result[[i]] <- a * log(i) result} ' ' '使用' system.time() '来研究该算法如何随着' n '扩展,重点关注经过的时间。' ' ' {r system-time} system.time(f0(10000)) n <- 1000 * seq(1,20,2) t <- sapply(n, function(i) system.time(f0(i))[[3]]) plot(t ~ n, type="b")' ' ' '记住当前的'正确'值,以及一个近似时间' ' ' {r correct-init} n <- 10000系统。时间(期望<- f0(n))修改函数,将公共乘法器' a '提升到循环之外。确保“优化”的结果与原始计算相同。使用' r CRANpkg("microbenchmark") '包比较两个版本' ' ' {r hoist} f1 <- function(n, a=2) {result <- numeric() for (i in seq_len(n)) result[[i]] <- log(i) a * result} same (expected, f1(n)) library(microbenchmark) microbenchmark(f0(n), f1(n), times=5)' ' '采用'预分配和填充'策略' ' ' {r预分配和填充}f2 <- function(n, a=2) {result <- numeric(n) for (i in seq_len(n)) result[[i]] <- log(i) a * result}相同(预期,f2(n))微基准(f0(n), f2(n), times=5)使用' *apply() '函数来避免必须显式地预分配,并使向量化的机会更加明显。' ' ' {r use-apply} f3 <- function(n, a=2) a * sapply(seq_len(n), log) same (expected, f3(n)) microbenchmark(f0(n), f2(n), f3(n), times=10)既然代码是在一行中呈现的,显然它可以很容易地向量化。 Seize the opportunity to vectorize it: ```{r use-vectorize} f4 <- function(n, a=2) a * log(seq_len(n)) identical(expected, f4(n)) microbenchmark(f0(n), f3(n), f4(n), times=10) ``` `f4()` definitely seems to be the winner. How does it scale with `n`? (Repeat several times) ```{r vectorized-scale} n <- 10 ^ (5:8) # 100x larger than f0 t <- sapply(n, function(i) system.time(f4(i))[[3]]) plot(t ~ n, log="xy", type="b") ``` Any explanations for the different pattern of response? Lessons learned: 1. Vectorizing offers a huge improvement over iteration 2. Pre-allocate-and-fill is very helpful when explicit iteration is required. 3. `*apply()` functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it 4. Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required.回到顶部管理内存的迭代和限制当数据太大而无法装入内存时,我们可以以块的形式遍历文件,或按字段或基因组位置对数据进行子集。迭代-块- '打开()',读取块(),'关闭()'。-例如,' yieldSize '参数到' Rsamtools::BamFile() ' -框架:' GenomicFiles::reduceByYield() '限制-限制到感兴趣的列和/或行-利用领域特定格式- BAM文件和' Rsamtools::ScanBamParam() ' - BAM文件和' Rsamtools::PileupParam() ' - VCF文件和' VariantAnnotation::ScanVcfParam() ' -使用数据库##练习:计算重叠遍历文件:GenomicFiles:: reduceByYield()(1)产生一个块(2)映射的输入块(3)可能改变表示减少映射块' ' ' {r reduceByYield-setup} suppressPackageStartupMessages({库(GenomicFiles)库(GenomicAlignments)图书馆(Rsamtools)图书馆(TxDb.Hsapiens.UCSC.hg19.knownGene)})收益率< - #如何输入的下一块数据函数(X,…){readGAlignments (X)} < - #如何映射到每个块功能(价值,…, roi) {olaps <- findOverlaps(VALUE, roi, type="within", ignore.strand=TRUE) count <- tabulate(subjectHits(olaps), subjectLength(olaps)) setNames(count, names(roi))} reduce <- ' + ' #如何组合映射块' ' '' ' ' {r yieldFactory} yieldFactory <- #返回一个具有本地状态的函数函数(){n_records <- 0L function(X,…){aln <- readGAlignments(X) n_records <<- n_records + length(aln) message(n_records) aln}} ' ' '感兴趣的区域,像bam文件中的染色体一样命名。{r count-overlap -roi, eval=FALSE} exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19. exonby)。knownGene, "tx") fl <- "/home/ubuntu/data/vobencha/LargeData/srarchive/hg19_alias. "tab" map0 <- read.delim(fl, header=FALSE, stringsAsFactors=FALSE) seqlevels(exByTx, force=TRUE) <- setNames(map0$V1, map0$V2)一个遍历bam文件的函数。{r count-overlaps, eval=FALSE} count1 <- function(filename, roi) {message(filename) ##创建并打开BAM文件bf <- BamFile(filename, yieldSize=1000000) reduceByYield(bf, yieldFactory(), map, reduce, roi=roi)} ' ' ' In action ' ' ' {r count-overlaps-doit, eval=FALSE} BAM <- "/home/ubuntu/data/vobencha/LargeData/srarchive/SRR1039508_sorted. "count <- count1(bam, exByTx) ```回到顶部#文件管理# #文件类| |型的例子使用| |包名称 | |-------|-----------------------|-----------------------------|----------------------------------| | . 床| |范围注释BedFile () ' | ' r Biocpkg(“rtracklayer”)“| | .wig | |报道“WigFile()”,“BigWigFile () ' | ' r Biocpkg(“rtracklayer”)“| | .gtf | |记录模型的GTFFile()”|“r Biocpkg(“rtracklayer ")` | | | | ` makeTxDbFromGFF () ' | ' r Biocpkg(“GenomicFeatures”)“| | .2bit | |基因组序列的TwoBitFile () ' | ' r Biocpkg(“rtracklayer”)“| | .fastq |读&品质|“FastqFile () ' | ' r Biocpkg(“ShortRead”)“| |本|一致读|“BamFile () ' | ' r Biocpkg(“Rsamtools”)“| | .tbx |索引一样|“TabixFile () ' | ' r Biocpkg(“Rsamtools”)“| | .vcf电话| |变体' VcfFile() ' | ' r Biocpkg("VariantAnnotation") ' | ' ' {r rtracklayer-file-classes} ## rtracklayer menagerie suppressPackageStartupMessages(library(rtracklayer)) names(getClass("RTLFile")@subclasses)注释-不是一致的接口- ' open() ', ' close() ', ' import() ' / ' yield() ' / ' read*() ' -有些:通过索引选择导入(例如,'。Bai ', bam指数);选择(“列”);限制('rows') ##管理文件的集合' *FileList() '类- ' reduceByYield() ' -迭代通过单个大文件- ' bplapply() ' ('r Biocpkg("BiocParallel") ') -对几个文件执行独立操作,在并行' GenomicFiles() ' - '行'作为基因组范围限制,'列'作为文件-每一行x列是一个_map_从文件数据到有用的表示_R_ - ' reduceByRange() ', ' reduceByFile() ':折叠映射为摘要表示-参见基因组文件小插图[图1](//www.andersvercelli.com/packages/devel/bioc/vignettes/GenomicFiles/inst/doc/GenomicFiles.pdf)' VcfStack() ' -将VCF文件拆分为每个染色体的常见做法。-将其视为单一实体的简单方法-参见' ?VcfStack”回到顶部一些注意事项-迭代/限制技术使内存需求处于控制之下,而并行计算将计算负载分配到各个节点。请记住,并行计算仍然受到每个节点上可用内存量的限制。设置和拆除集群有开销,在分布式内存中计算时更是如此。对于小型计算,并行开销可能超过收益,而性能没有提高。从并行执行中受益最多的作业是cpu密集型作业,操作的数据块适合内存。BiocParallel“r Biocpkg(“BiocParallel”)”提供了一个标准化的并行计算接口,并支持主要的并行计算风格:单台计算机上的分叉和进程,临时集群,批调度程序和云计算。默认情况下,' r Biocpkg("BiocParallel") '选择适合操作系统的并行后端,在Unix、Mac和Windows上都支持。基本思想:-使用' bplapply() '而不是' lapply() ' -参数' BPPARAM '影响并行计算发生的方式- ' multicoream() ':单个(非windows)机器上的线程- ' SnowParam() ':相同或不同机器上的进程- ' BatchJobsParam() ':集群上的资源调度器###练习:串行和并行睡眠这个小示例激发了并行执行的使用,并演示了' bplapply() '如何可以成为' lapply '的一部分。使用' system.time() '来研究当' n '从1增加到10时,执行需要多长时间。使用' same() '和' r CRANpkg("microbenchmark") '比较替代方案' f0() '和' f1() '的正确性和性能。 `fun` sleeps for 1 second, then returns `i`. ```{r parallel-sleep} library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun) ```回到顶部###练习:错误处理和' BPREDO ' ' r Biocpkg("BiocParallel") ' '捕获并返回'错误以及成功的结果。本练习演示如何访问失败任务的' traceback() ',以及如何使用'BPREDO'重新运行失败任务。关于错误处理、日志记录和调试的详细信息见[错误、日志和调试][]。{r parallel-bpredo-param} param <- MulticoreParam(workers = 3)在'X'上调用' sqrt() '函数;第二个元素是一个字符,将抛出和错误。{r parallel-bpredo-bplapply, error=TRUE} X <- list(1, "2", 3) res <- bplapply(X, sqrt, BPPARAM = param){r parallel-bptry} res <- bptry(bplapply(X, sqrt, BPPARAM=param)) res ' ' '通过重复调用' bplapply() '重新运行失败的结果,这次使用更正的输入数据和部分结果作为'BPREDO'。只重新运行失败的值。' ' ' {r parallel-bpredo} X.redo <- list(1,2,3) bplapply(X。redo, sqrt, BPREDO = res) ``` Alternatively, switch to a `SerialParam()` and debug the specific element that caused the error. ```{r parallel-debug, eval=FALSE} > fun = function(i) { browser(); sqrt(i) } > bplapply(X, fun, BPREDO=res, BPPARAM=SerialParam()) resuming previous calculation ... Called from: FUN(...) Browse[1]> debug at #1: sqrt(i) Browse[2]> i [1] "2" Browse[2]> i = 2 Browse[2]> c [[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051 ```回到顶部###练习:日志' r Biocpkg("BiocParallel") '使用[futile.logger](http://cran.r-project.org/web/packages/futile.logger/index.html)包进行日志记录。该包有一个灵活的系统,用于过滤不同严重阈值的消息,如INFO、DEBUG、ERROR等(有关所有阈值的列表,请参阅?bpthreshold手册页)。“r Biocpkg(“BiocParallel”)”捕获在futile中写入的消息。记录器格式以及写入stdout和stderr的消息。这个函数做一些参数检查,并有DEBUG, WARN和info级别的日志消息。' ' ' {r logging, eval=FALSE} FUN <- function(i) {flag .debug(paste0(" 'i'的值:",i)) if (!length(i)) {警告("'i' is missing") NA} else if (!is(i, "numeric")) {flag .info("coercing to numeric") as.numeric(i)} else {i}} ' ' '在参数中打开日志记录,并将阈值设置为WARN。' ' ' {r logging-WARN, eval=FALSE} param <- SnowParam(3, log = TRUE, threshold = "WARN") bplapply(list(1, "2", integer()), FUN, BPPARAM = param)将阈值降低到INFO和DEBUG(即使用' bpthreshold<- '),以查看消息是如何根据严重程度进行过滤的。回到顶部练习:工作超时对于长时间运行的作业或未经测试的代码,设置一个时间限制是很有用的。_timeout_字段是允许每个工作人员完成任务的时间,单位为秒。如果一个任务花费的时间超过_timeout_, worker将返回一个错误。_timeout_可以在参数构造期间设置,' ' ' {r timeout_constructor} param <- SnowParam(timeout = 20) param ' '或与\Rcode{bptimeout} setter: ' ' ' {r timeout_setter} bptimeout(param) <- 2 param ' ' '使用此函数在'X'值的数字向量上探索不同的_timeout_s。'X'值小于_timeout_返回成功,大于_threshold_返回错误。{r timeout_bplapply} fun <- function(i) {Sys.sleep(i) i} ' ' '回到顶部' GenomicFiles::reduceByFile() '前面的计数示例使用' GenomicFiles::reduceByYield() ',它对单个文件进行操作,并实现yield、map、reduce范式。在这个练习中,我们将使用' GenomicFiles::reduceByFile() ',它在底层使用' bplaply() '并行操作多个文件。' reduceByFile() '的主要参数是一组文件和一组范围。文件被发送到工人和数据子集提取基于范围。大部分工作是在_MAP_函数中完成的,可选的_REDUCE_函数将每个worker的输出组合在一起。{r co-setup} suppressPackageStartupMessages({library(BiocParallel) library(GenomicFiles) library(GenomicAlignments) library(Rsamtools)})在Unix或Mac上,配置带有4个worker的multicorepam()。打开日志记录并设置60秒的超时。' ' ' {r co-param} param <- multicorepam (4, log = TRUE, timeout = 60)在Windows上使用' SnowParam() ': ' ' ' {r co-param-snow, eval=FALSE} param <- SnowParam(4, log = TRUE, timeout = 60) ``` Point to the collection of bam files. ```{r co-bams} fls <- dir("/home/ubuntu/data/vobencha/LargeData/copynumber", ".bam$", full=TRUE) names(fls) <- basename(fls) bfl <- BamFileList(fls) ``` Defining ranges (region of interest) restricts the amount of data on the workers and keeps memory requirements under control. We'll use a set of ranges on the Major Histocompatibility Complex locus on chromosome 6. ```{r co-GRanges} ranges <- GRanges("chr6", IRanges(c(28477797, 29527797, 32448354), c(29477797, 30527797, 33448354))) ``` The _MAP_ function reads in records and counts overlaps. `readGAlignments()` reads in bam records that overlap with any portion of the ranges defined in the _scanBamParam_ (i.e., they could be overlapping the start or the end). Once we've got the records in _R_, we want to count only those that fall 'within' the ranges. ```{r co-map, eval=FALSE} MAP <- function(range, file, ...) { library(GenomicAlignments) ## readGAlignments(), ScanBamParam() param = ScanBamParam(which=range) ## restriction gal = readGAlignments(file, param=param) ## log messages flog.info(paste0("file: ", basename(file))) flog.debug(paste0("records: ", length(gal))) ## overlaps olaps <- findOverlaps(gal, range, type="within", ignore.strand=TRUE) tabulate(subjectHits(olaps), subjectLength(olaps)) } ``` Count ... ```{r co-doit, eval=FALSE} cts <- reduceByFile(ranges, fls, MAP, BPPARAM = param) ``` The result is a list the same length as the number of files. ```{r co-length, eval=FALSE} length(cts) ``` Each list element is the length of the number of ranges. ```{r co-elementlengths, eval=FALSE} elementLengths(cts) ``` Tables of counts for each range are extracted with '[[': ```{r co-tables, eval=FALSE} cts[[1]] ```回到顶部##其他资源- [Bioconductor Amazon AMI](//www.andersvercelli.com/help/bioconductor-cloud-ami/) -轻松“旋转”10的实例-预配置Bioconductor包和StarCluster管理- ' r Biocpkg(“GoogleGenomics”)'与谷歌计算云和资源交互回到顶部#资源-劳伦斯,M,和摩根,M. 2014。可扩展基因组与R和Bioconductor。统计科学,2014年第29卷第2期,214-226。http://arxiv.org/abs/1409.2864v1 - BiocParallel: //www.andersvercelli.com/packages/release/bioc/html/BiocParallel.html - GenomicFiles: //www.andersvercelli.com/packages/release/bioc/html/GenomicFiles.html回到顶部[错误,日志和调试]://www.andersvercelli.com/packages/3.2/bioc/vignettes/BiocParallel/inst/doc/Errors_Logs_And_Debugging.pdf