-标题:“20.1—与大数据打交道”作者:“Martin Morgan “输出:Biocstyle :: html_document:toc:true toc_depth:2 vignette:>%\ gignetteIndexentry {20.1 - 使用大数据}%\ vignetteengine {knitr :: Rarmardown} ---```{r styl anceply,echo =False,结果='ASIS'} KNITR :: OPTS_CHUNK $ SET(eval = AS.LOGICY(SYS.GETENV(“KNITR_EVAL”,“TRUE”)),Cache = AS.LOGICY(SYS.GETENV(“KNITR_CACHE”,“真实“)))”作者:[Martin Morgan] []
日期:2019年7月26日
[Martin Morgan]:mailto:martin.morgan@roswellpark.org#激励例子附加[tenxbraindata] []实验数据包```{r,message = false}库(tenxbraindata)```load一个非常大的`summarized aximent``````{r} tenx < - tenxbraindata()tenx assay(tenx)```快速执行基本操作```{r} log1p(测定(tenx))```subset并总结一个1000的库大小'细胞```{r} tenx_subset < - tenx [,1:1000] lib_size < - colsums(messay(tenx_subset))hist(log10(lib_size))```数据很小,超过92%的单元格到0``` {r} sum(测定(tenx_subset)== 0)/ prod(dim(tenx_subset))```#写或使用高效_r_代码 - 最重要的步骤!避免不必要的复制```{r} n < - 50000 ##在每次迭代集中进行`Res1`的副本.seed(123)system.time({res1 < - null for(i/ 1:n)Res1 <- C(Res1,Rnorm(1))})##预分配集.Seed(123)System.Time({Res2 < - numeric(n)for(i在1:n)res2 [i] < - rnorm(1)})相同(Res1,Res2)##无需思考分配!set.seed(123)system.time({Res3 < - Sapply(1:N,功能(i)rnorm(1))})相同(Res1,Res3)```_Vectorize_您自己的脚本```{r}n < - 2000000 ##迭代:n调用`rnorm(1)`set.seed(123)system.time({res1 < - sapply(1:n,function(i)rnorm(1))})##向量化:1呼叫`rnorm(n)`set.seed(123)system.time(res2 < - rnorm(n))相同(Res1,Res2)```_ reuse_其他人的有效代码。- E.g., `limma::lmFit()` to fit 10,000's of linear models very quickly Examples in the lab this afternoon, and from Tuesday evening # Use chunk-wise iteration Example: nucleotide frequency of mapped reads ## An example without chunking Working through example data; not really large... ```{r, message = FALSE} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] basename(fname) Rsamtools::countBam(fname) ``` Input into a `GAlignments` object, including `seq` (sequence) of each read. ```{r, message = FALSE} library(GenomicAlignments) param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) ``` Write a function to determine GC content of reads ```{r, message = FALSE} library(Biostrings) gc_content <- function(galn) { seq <- mcols(galn)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob=TRUE) as.vector(gc) } ``` Calculate and display GC content ```{r} param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) res1 <- gc_content(galn) hist(res1) ``` ## The same example with chunking Open file for reading, specifying 'yield' size ```{r} bfl <- BamFile(fname, yieldSize = 100000) open(bfl) ``` Repeatedly read chunks of data and calculate GC content ```{r} res2 <- numeric() repeat { message(".") galn <- readGAlignments(bfl, param = param) if (length(galn) == 0) break ## inefficient copy of res2, but only a few iterations... res2 <- c(res2, gc_content(galn)) } ``` Clean up and compare approaches ```{r} close(bfl) identical(res1, res2) ``` # Use (classical) parallel evaluation Many down-sides - More complicated code, e.g., to distribute data - Relies on, and requires mastery of, supporting infrastructure Maximum speed-up - Proportional to number of parallel computations - In reality - Cost of data movement from 'manager' to 'worker' - Additional overhead of orchestrating parallel computations ## [BiocParallel][] ```{r, echo=FALSE} xx <- gc(); xx <- gc() ``` ```{r, message = FALSE} fun <- function(i) { Sys.sleep(1) # a time-consuming calculation i # and then the result } system.time({ res1 <- lapply(1:10, fun) }) library(BiocParallel) system.time({ res2 <- bplapply(1:10, fun) }) identical(res1, res2) ``` - 'Forked' processes (non-Windows) - No need to distribute data from main thread to workers - Independent processes - Classic clusters, e.g., _slurm_ - Coming: cloud-based solutions ## [GenomicFiles][] Parallel, chunk-wise iteration through genomic files. Set up: ```{r, message = FALSE} library(GenomicFiles) ``` Define a `yield` function that provides a chunk of data for processing ```{r} yield <- function(x) { param <- ScanBamParam(what = "seq") readGAlignments(x, param = param) } ``` Define a `map` function that transforms the input data to the desired result ```{r} map <- function(x) { seq <- mcols(x)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob = TRUE) as.vector(gc) } ``` Define a `reduce` function that combines two successive results ```{r} reduce <- c ``` Perform the calculation, chunk-wise and in parallel ```{r} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] bfl <- BamFile(fname, yieldSize = 100000) res <- reduceByYield(bfl, yield, map, reduce, parallel = TRUE) hist(res) ``` [BiocParallel]: //www.andersvercelli.com/packages/BiocParallel [GenomicFiles]: //www.andersvercelli.com/packages/GenomicFiles # Query specific values # Provenance ```{r} sessionInfo() ```