EfficientRcode
- Priorities
- Strategies
  - Case Study: Pre-allocate and vectorize
Iteration and restriction to manage memory
- 练习:清纯甜美ting overlaps
File management
- File classes
- Managing a collection of files
Parallel evaluation
- BiocParallel
- Other resources
Resources

Original Authors: Martin Morgan, Sonali Arora
Presenting Author:Martin Morgan
Date: 26 July, 2019

EfficientRcode

The goal of this section is to highlight practices for writing correct, understandable, robust and efficient R code.

Priorities

Correct: consistent with hand-worked examples (identical(),all.equal())
Understandable: easy to understand next month; easy to describe what it does to a colleague; easy to spot logical errors; easy to enhance.
Robust: supports realistic inputs, e.g., 0-length vectors,NAvalues, …
Fast, or at least reasonable given the speed of modern computers.

Strategies

Profile
- Lookat the script to understand in general terms what it is doing.
- Stepthrough the code to see how it is executed, and to gain an understanding of the speed of each line.
- Timeevaluation of select lines or simple chunks of code withsystem.time()or themicrobenchmarkpackage.
- Profilethe code with a tool that indicates how much time is spent in each function call or line – the built-inRprof()function, or packages such aslineproforaprof

Vectorize – operate on vectors, rather than explicit loops

x <- 1:10 log(x) ## NOT for (i in seq_along) x[i] <- log(x[i])

## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 ## [8] 2.0794415 2.1972246 2.3025851

Pre-allocate memory, then fill in the result

result <- numeric(10) result[1] <- runif(1) for (i in 2:length(result)) result[i] <- runif(1) * result[i - 1] result

## [1] 0.2112226484 0.0837410720 0.0375219451 0.0087721750 0.0070520504 ## [6] 0.0052281030 0.0040075761 0.0020400718 0.0007054378 0.0002237723

Hoist common sub-expressions outside of repeated calculations, so that the sub-expression is only calculated once
- Simple, e.g., ‘hoist’ constant multiplications from aforloop
- Higher-level, e.g., uselm.fit()rather than repeatedly fitting the same design matrix.
Re-use existing, tested code
- Efficient implementations of common operations –tabulate(),rowSums()and friends,%in%, …
- Efficient domain-specific implementations, e.g.,snpStatsfor GWAS linear models;limmafor microarray linear models;edgeR,DESeq2for negative binomial GLMs relevant to RNASeq.
- Reuse others’ work –DESeq2,GenomicRanges,Biostrings, …,dplyr,data.table,Rcpp

Case Study: Pre-allocate and vectorize

Here’s an inefficient function:

f0 <- function(n, a=2) { ## stopifnot(is.integer(n) && (length(n) == 1) && ## !is.na(n) && (n > 0)) result <- numeric() for (i in seq_len(n)) result <- c(result, a * log(i)) result }

Usesystem.time()to investigate how this algorithm scales withn, focusing on elapsed time.

system.time(f0(10000))

## user system elapsed ## 0.168 0.012 0.180

n <- 1000 * seq(1, 20, 2) t <- sapply(n, function(i) system.time(f0(i))[[3]]) plot(t ~ n, type="b")

Remember the current ‘correct’ value, and an approximate time

n <- 10000 system.time(expected <- f0(n))

## user system elapsed ## 0.208 0.012 0.218

head(expected)

## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519

Revise the function to hoist the common multiplier,a, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use themicrobenchmark包比较两个版本

f1 < -函数(n = 2){结果< -数字()(i in seq_len(n)) result <- c(result, log(i)) a * result } identical(expected, f1(n))

## [1] TRUE

library(microbenchmark) microbenchmark(f0(n), f1(n), times=5)

## Unit: milliseconds ## expr min lq mean median uq max neval ## f0(n) 115.2943 123.1257 125.0825 124.2671 128.2014 134.5239 5 ## f1(n) 116.1748 118.6345 120.6672 119.3346 124.0238 125.1682 5

Adopt a ‘pre-allocate and fill’ strategy

f2 <- function(n, a=2) { result <- numeric(n) for (i in seq_len(n)) result[[i]] <- log(i) a * result } identical(expected, f2(n))

## [1] TRUE

microbenchmark(f0(n), f2(n), times=5)

## Unit: microseconds ## expr min lq mean median uq max ## f0(n) 113041.957 117630.568 138406.110 129078.581 151187.614 181091.832 ## f2(n) 934.442 938.751 1212.101 1144.617 1394.306 1648.391 ## neval ## 5 ## 5

Use an*apply()function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.

f3 <- function(n, a=2) a * sapply(seq_len(n), log) identical(expected, f3(n))

## [1] TRUE

microbenchmark(f0(n), f2(n), f3(n), times=10)

## Unit: microseconds ## expr min lq mean median uq max ## f0(n) 114909.182 116023.724 119635.840 120344.3835 123610.775 123799.268 ## f2(n) 926.806 931.197 1117.099 956.8945 1316.960 1689.219 ## f3(n) 3330.438 3414.339 4259.659 3588.0740 5019.677 7848.792 ## neval ## 10 ## 10 ## 10

Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:

f4 <- function(n, a=2) a * log(seq_len(n)) identical(expected, f4(n))

## [1] TRUE

microbenchmark(f0(n), f3(n), f4(n), times=10)

## Unit: microseconds ## expr min lq mean median uq max ## f0(n) 113447.054 118092.762 129562.6032 121978.758 134142.406 173583.208 ## f3(n) 3310.615 3353.244 4198.5312 3502.275 5476.352 5836.909 ## f4(n) 225.965 229.301 385.5061 235.880 253.538 1693.834 ## neval ## 10 ## 10 ## 10

f4()definitely seems to be the winner. How does it scale withn? (Repeat several times)

n <- 10 ^ (5:8) # 100x larger than f0 t <- sapply(n, function(i) system.time(f4(i))[[3]]) plot(t ~ n, log="xy", type="b")

Any explanations for the different pattern of response?

Lessons learned:

Vectorizing offers a huge improvement over iteration
Pre-allocate-and-fill is very helpful when explicit iteration is required.
*apply()functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth it
Hoisting common sub-expressions can be helpful for improving performance when explicit iteration is required.

Iteration and restriction to manage memory

When data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.

Iteration

Chunk-wise
open(), read chunk(s),close().
e.g.,yieldSizeargument toRsamtools::BamFile()
Framework:GenomicFiles::reduceByYield()

Restriction

Limit to columns and / or rows of interest
Exploit domain-specific formats
- BAM files andRsamtools::ScanBamParam()
- BAM files andRsamtools::PileupParam()
- VCF files andVariantAnnotation::ScanVcfParam()
Use a data base

练习:清纯甜美ting overlaps

Iterate through files:GenomicFiles::reduceByYield()

Yield a chunk
从输入块映射到一个可能改变representation
Reduce mapped chunks

suppressPackageStartupMessages({ library(GenomicFiles) library(GenomicAlignments) library(Rsamtools) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(RNAseqData.HNRNPC.bam.chr14) }) yield <- # how to input the next chunk of data function(X, ...) { readGAlignments(X) } map <- # what to do to each chunk function(VALUE, ..., roi) { olaps <- findOverlaps(VALUE, roi, type="within", ignore.strand=TRUE) count <- tabulate(subjectHits(olaps), subjectLength(olaps)) setNames(count, names(roi)) } reduce <- `+` # how to combine mapped chunks

Improvement: “yield factory” to keep track of how many records input

yieldFactory <- # return a function with local state function() { n_records <- 0L function(X, ...) { aln <- readGAlignments(X) n_records <<- n_records + length(aln) message(n_records) aln } }

Regions of interest, named like the chromosomes in the bam file.

exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")

A function to iterate through a bam file.

count1 <- function(filename, roi) { message(filename) ## Create and open BAM file bf <- BamFile(filename, yieldSize=1000000) reduceByYield(bf, yieldFactory(), map, reduce, roi=roi) }

In action

## path to example bam files library(RNAseqData.HNRNPC.bam.chr14) bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES count <- count1(bam[[1]], exByTx)

## /home/csama/R/x86_64-pc-linux-gnu-library/3.6/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam

## 800484 ## 800484

hist(log10(count)) # drops 0's

File management

File classes

Type	Example use	Name	Package
.bed	Range annotations	`BedFile()`	rtracklayer
.wig	Coverage	`WigFile()`,`BigWigFile()`	rtracklayer
.gtf	Transcript models	`GTFFile()`	rtracklayer
		`makeTxDbFromGFF()`	GenomicFeatures
.2bit	Genomic Sequence	`TwoBitFile()`	rtracklayer
.fastq	Reads & qualities	`FastqFile()`	ShortRead
.bam	Aligned reads	`BamFile()`	Rsamtools
.tbx	Indexed tab-delimited	`TabixFile()`	Rsamtools
.vcf	Variant calls	`VcfFile()`	VariantAnnotation

## rtracklayer menagerie suppressPackageStartupMessages(library(rtracklayer)) names(getClass("RTLFile")@subclasses)

## [1] "UCSCFile" "GFFFile" "BEDFile" ## [4] "WIGFile" "BigWigFile" "ChainFile" ## [7] "TwoBitFile" "FastaFile" "TabSeparatedFile" ## [10] "CompressedFile" "GFF1File" "GFF2File" ## [13] "GFF3File" "BEDGraphFile" "BED15File" ## [16] "BEDPEFile" "NarrowPeakFile" "BroadPeakFile" ## [19] "BWFile" "2BitFile" "GTFFile" ## [22] "GVFFile" "GZFile" "BGZFile" ## [25] "BZ2File" "XZFile"

Notes

Not a consistent interface
open(),close(),import()/yield()/read*()
Some: selective import via index (e.g.,.bai, bam index); selection (‘columns’); restriction (‘rows’)

Managing a collection of files

*FileList()classes

reduceByYield()– iterate through a single large file
bplapply()(BiocParallel) – perform independent operations on several files, in parallel

GenomicFiles()

‘rows’ as genomic range restrictions, ‘columns’ as files
Each row x column is amapfrom file data to useful representation inR
reduceByRange(),reduceByFile(): collapse maps into summary representation
see the GenomicFiles vignetteFigure 1

VcfStack()

Common practice of spliting VCF files into one-per-chromosome.
Easy way to treat as a signle entitty
see?VcfStack

Parallel evaluation

A couple of caveats -

Iteration / restriction techniques keep the memory requirements under control while parallel evaluation distributes computational load across nodes. Keep in mind that parallel computations are still restricted by the amount of memory available on each node.

There is overhead in setting up and tearing down a cluster and more so when computing in distributed memory. For small calculations, the parallel overhead may outweigh the benefits with no improvement in performance.

Jobs that benefit the most from parallel execution are CPU-intensive and operate on data chunks that fits into memory.

BiocParallel

BiocParallelprovides a standardized interface for parallel evaluation and supports the major parallel computing styles: forks and processes on a single computer, ad hoc clusters, batch schedulers and cloud computing. By default,BiocParallelchooses a parallel back-end appropriate for the OS and is supported across Unix, Mac and Windows.

General ideas:

Usebplapply()instead oflapply()
ArgumentBPPARAMinfluences how parallel evaluation occurs
- MulticoreParam(): threads on a single (non-Windows) machine
- SnowParam(): processes on the same or different machines
- BatchJobsParam(): resource scheduler on a cluster

Exercise: Sleeping serially and in parallel

This small example motivates the use of parallel execution and demonstrates howbplapply()can be a drop in forlapply.

Usesystem.time()to explore how long this takes to execute asnincreases from 1 to 10. Useidentical()andmicrobenchmarkto compare alternativesf0()andf1()for both correctness and performance.

funsleeps for 1 second, then returnsi.

library(BiocParallel) fun <- function(i) { Sys.sleep(1) i } ## serial f0 <- function(n) lapply(seq_len(n), fun) ## parallel f1 <- function(n) bplapply(seq_len(n), fun)

Exercise: error handling and`BPREDO`

BiocParallel“catches and returns” errors along with successful results. This exercise demonstrates how to access thetraceback()of a failed task and how to re-run the failed tasks with ‘BPREDO’. Full details on error handling, logging and debugging are in theErrors, Logs and Debuggingvignette.

param <- MulticoreParam(workers = 3) # Windows: SnowParam(workers=3)

Call thesqrt()function across ‘X’; the second element is a character and will throw and error.

X < -列表(1“2”,3)res < - bplapply s (X)qrt, BPPARAM = param)

## Error: BiocParallel errors ## element index: 2 ## first error: non-numeric argument to mathematical function

It’s also possible to catch the error and partially evaluated result

res <- bptry(bplapply(X, sqrt, BPPARAM=param)) res

## [[1]] ## [1] 1 ## ## [[2]] ##  ## traceback() available as 'attr(x, "traceback")' ## ## [[3]] ## [1] 1.732051

Re-run the failed results by repeating the call tobplapply()这一次与纠正输入数据和partial results as ‘BPREDO’. Only the failed values are re-run.

X.redo <- list(1, 2, 3) bplapply(X.redo, sqrt, BPREDO = res)

## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051

Alternatively, switch to aSerialParam()and debug the specific element that caused the error.

> fun = function(i) { browser(); sqrt(i) } > bplapply(X, fun, BPREDO=res, BPPARAM=SerialParam()) resuming previous calculation ... Called from: FUN(...) Browse[1]> debug at #1: sqrt(i) Browse[2]> i [1] "2" Browse[2]> i = 2 Browse[2]> c [[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051

Exercise: logging

BiocParallel使用futile.loggerpackage for logging. The package has a flexible system for filtering messages of varying severity thresholds such as INFO, DEBUG, ERROR etc. (For a list of all thresholds see the ?bpthreshold man page.)BiocParallelcaptures messages written in futile.logger format as well as messages written to stdout and stderr.

This function does some argument checking and has DEBUG, WARN and INFO-level log messages.

FUN <- function(i) { flog.debug(paste0("value of 'i': ", i)) if (!length(i)) { flog.warn("'i' is missing") NA } else if (!is(i, "numeric")) { flog.info("coercing to numeric") as.numeric(i) } else { i } }

Turn logging on in the param and set the threshold to WARN.

param <- SnowParam(3, log = TRUE, threshold = "WARN") bplapply(list(1, "2", integer()), FUN, BPPARAM = param)

Lower the threshold to INFO and DEBUG (i.e., usebpthreshold<-) to see how messages are filtered on severity.

Exercise: Worker timeout

For long running jobs or untested code it can be useful to set a time limit. Thetimeoutfield is the time, in seconds, allowed for each worker to complete a task. If a task takes longer thantimeoutthe worker returns an error.

timeoutcan be set during param construction,

param <- SnowParam(timeout = 20) param

## class: SnowParam ## bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB ## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE ## bpRNGseed: ; bptimeout: 20; bpprogressbar: FALSE ## bpexportglobals: TRUE ## bplogdir: NA ## bpresultdir: NA ## cluster type: SOCK

or with the setter:

bptimeout(param) <- 2 param

## class: SnowParam ## bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB ## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE ## bpRNGseed: ; bptimeout: 2; bpprogressbar: FALSE ## bpexportglobals: TRUE ## bplogdir: NA ## bpresultdir: NA ## cluster type: SOCK

Use this function to explore different _timeout_s over a numeric vector of ‘X’ values withbplapply(). ‘X’ values less thantimeoutreturn successfully while those overthresholdreturn an error.

fun <- function(i) { Sys.sleep(i) i }

练习:清纯甜美ting overlaps in parallel

Distribute files over workers:GenomicFiles::reduceByFile()

The previous counting example usedGenomicFiles::reduceByYield()which operates on a single file and implements a yield, map, reduce paradigm. In this exercise we’ll useGenomicFiles::reduceByFile()which usesbplaply()under the hood to operate on multiple files in parallel.

Primary arguments toreduceByFile()are a set of files and a set of ranges. Files are sent to the workers and data subsets are extracted based on the ranges. The bulk of the work is done in theMAPfunction and an optionalREDUCEfunction combines the output on each worker.

suppressPackageStartupMessages({ library(BiocParallel) library(GenomicFiles) library(GenomicAlignments) library(Rsamtools) })

On Unix or Mac, configure aMulticoreParam()with 4 workers. Turn on logging and set a timeout of 60 seconds.

param <- MulticoreParam(4, log = TRUE, timeout = 60)

On Windows do the same withSnowParam():

param <- SnowParam(4, log = TRUE, timeout = 60)

Point to the collection of bam files.

library(RNAseqData.HNRNPC.bam.chr14) fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES bfl <- BamFileList(fls)

Defining ranges (region of interest) restricts the amount of data on the workers and keeps memory requirements under control.

ranges <- GRanges("chr14", IRanges(c(28477797, 29527797, 32448354), c(29477797, 30527797, 33448354)))

TheMAPfunction reads in records and counts overlaps.readGAlignments()reads in bam records that overlap with any portion of the ranges defined in thescanBamParam(i.e., they could be overlapping the start or the end). Once we’ve got the records inR, we want to count only those that fall ‘within’ the ranges.

MAP <- function(range, file, ...) { library(GenomicAlignments) ## readGAlignments(), ScanBamParam() param = ScanBamParam(which=range) ## restriction gal = readGAlignments(file, param=param) ## overlaps olaps <- findOverlaps(gal, range, type="within", ignore.strand=TRUE) tabulate(subjectHits(olaps), subjectLength(olaps)) }

Count …

cts <- reduceByFile(ranges, fls, MAP, BPPARAM = param)

## ############### LOG OUTPUT ############### ## Task: 1 ## Node: 1 ## Timestamp: 2019-07-26 13:45:44 ## Success: TRUE ## ## Task duration: ## user system elapsed ## 0.556 0.092 0.718 ## ## Memory used: ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 8007208 427.7 14721708 786.3 9351876 499.5 ## Vcells 16688881 127.4 43009392 328.2 42958239 327.8 ## ## Log messages: ## INFO [2019-07-26 13:45:44] loading futile.logger package ## ## stderr and stdout:

## ############### LOG OUTPUT ############### ## Task: 2 ## Node: 2 ## Timestamp: 2019-07-26 13:45:45 ## Success: TRUE ## ## Task duration: ## user system elapsed ## 0.584 0.076 0.756 ## ## Memory used: ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 8007971 427.7 14721708 786.3 9351876 499.5 ## Vcells 16690575 127.4 43009392 328.2 42958239 327.8 ## ## Log messages: ## INFO [2019-07-26 13:45:44] loading futile.logger package ## ## stderr and stdout:

## ############### LOG OUTPUT ############### ## Task: 3 ## Node: 3 ## Timestamp: 2019-07-26 13:45:46 ## Success: TRUE ## ## Task duration: ## user system elapsed ## 0.576 0.096 0.805 ## ## Memory used: ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 8008004 427.7 14721708 786.3 9351876 499.5 ## Vcells 16690653 127.4 43009392 328.2 42958239 327.8 ## ## Log messages: ## INFO [2019-07-26 13:45:44] loading futile.logger package ## ## stderr and stdout:

## ############### LOG OUTPUT ############### ## Task: 4 ## Node: 4 ## Timestamp: 2019-07-26 13:45:46 ## Success: TRUE ## ## Task duration: ## user system elapsed ## 0.588 0.084 0.813 ## ## Memory used: ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 8008037 427.7 14721708 786.3 9351876 499.5 ## Vcells 16690729 127.4 43009392 328.2 42958239 327.8 ## ## Log messages: ## INFO [2019-07-26 13:45:44] loading futile.logger package ## ## stderr and stdout:

The result is a list the same length as the number of files.

length(cts)

## [1] 8

Each list element is the length of the number of ranges.

lengths(cts)

## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ## 3 3 3 3 3 3 3 ## ERR127305 ## 3

Tables of counts for each range are extracted with[[; usesimplifyToArray()to get a matrix of counts

cts[[1]]

## [[1]] ## [1] 429 ## ## [[2]] ## [1] 22 ## ## [[3]] ## [1] 1338

simplify2array(cts)

## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ## [1,] 429 477 508 452 516 721 713 ## [2,] 22 18 37 24 14 26 30 ## [3,] 1338 1938 1618 1683 1988 2526 2372 ## ERR127305 ## [1,] 544 ## [2,] 20 ## [3,] 1785

Other resources

Bioconductor Amazon AMI
- easily ‘spin up’ 10’s of instances
- Pre-configured with Bioconductor packages and StarCluster management
Bioconductor docker images

Resources

Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 214-226.http://arxiv.org/abs/1409.2864v1
BiocParallel://www.andersvercelli.com/packages/release/bioc/html/BiocParallel.html
GenomicFiles://www.andersvercelli.com/packages/release/bioc/html/GenomicFiles.html

Lab 9.1: Efficient and Parallel Evaluation

26 July 2019

Contents

EfficientRcode

Priorities

Strategies

Case Study: Pre-allocate and vectorize

Iteration and restriction to manage memory

练习:清纯甜美ting overlaps

File management

File classes

Managing a collection of files

Parallel evaluation

BiocParallel

Exercise: Sleeping serially and in parallel

Exercise: error handling and`BPREDO`

Exercise: logging

Exercise: Worker timeout

练习:清纯甜美ting overlaps in parallel

Other resources

Resources

Lab 9.1: Efficient and Parallel Evaluation

26 July 2019

Contents

EfficientRcode

Priorities

Strategies

Case Study: Pre-allocate and vectorize

Iteration and restriction to manage memory

练习:清纯甜美ting overlaps

File management

File classes

Managing a collection of files

Parallel evaluation

BiocParallel

Exercise: Sleeping serially and in parallel

Exercise: error handling andBPREDO

Exercise: logging

Exercise: Worker timeout

练习:清纯甜美ting overlaps in parallel

Other resources

Resources

Exercise: error handling and`BPREDO`