欧洲杯冠军投注-2021欧洲杯体育投注开户-欧洲杯2021体育彩票

-标题:“20.1—与大数据打交道”作者:“Martin Morgan “输出：Biocstyle :: html_document：toc：true toc_depth：2 vignette：>％\ gignetteIndexentry {20.1 - 使用大数据}％\ vignetteengine {knitr :: Rarmardown} ---```{r styl anceply，echo =False，结果='ASIS'} KNITR :: OPTS_CHUNK $ SET（eval = AS.LOGICY（SYS.GETENV（“KNITR_EVAL”，“TRUE”）），Cache = AS.LOGICY（SYS.GETENV（“KNITR_CACHE”，“真实“）））”作者：[Martin Morgan] []
日期:2019年7月26日
[Martin Morgan]：mailto：martin.morgan@roswellpark.org＃激励例子附加[tenxbraindata] []实验数据包```{r，message = false}库（tenxbraindata）```load一个非常大的`summarized aximent``````{r} tenx < - tenxbraindata（）tenx assay（tenx）```快速执行基本操作```{r} log1p（测定（tenx））```subset并总结一个1000的库大小'细胞```{r} tenx_subset < - tenx [，1：1000] lib_size < - colsums（messay（tenx_subset））hist（log10（lib_size））```数据很小，超过92％的单元格到0``` {r} sum（测定（tenx_subset）== 0）/ prod（dim（tenx_subset））```＃写或使用高效_r_代码 - 最重要的步骤！避免不必要的复制```{r} n < - 50000 ##在每次迭代集中进行`Res1`的副本.seed（123）system.time（{res1 < - null for（i/ 1：n）Res1 <- C（Res1，Rnorm（1））}）##预分配集.Seed（123）System.Time（{Res2 < - numeric（n）for（i在1：n）res2 [i] < - rnorm（1）}）相同（Res1，Res2）##无需思考分配！set.seed（123）system.time（{Res3 < - Sapply（1：N，功能（i）rnorm（1））}）相同（Res1，Res3）```_Vectorize_您自己的脚本```{r}n < - 2000000 ##迭代：n调用`rnorm（1）`set.seed（123）system.time（{res1 < - sapply（1：n，function（i）rnorm（1））}）##向量化：1呼叫`rnorm（n）`set.seed（123）system.time（res2 < - rnorm（n））相同（Res1，Res2）```_ reuse_其他人的有效代码。- E.g., `limma::lmFit()` to fit 10,000's of linear models very quickly Examples in the lab this afternoon, and from Tuesday evening # Use chunk-wise iteration Example: nucleotide frequency of mapped reads ## An example without chunking Working through example data; not really large... ```{r, message = FALSE} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] basename(fname) Rsamtools::countBam(fname) ``` Input into a `GAlignments` object, including `seq` (sequence) of each read. ```{r, message = FALSE} library(GenomicAlignments) param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) ``` Write a function to determine GC content of reads ```{r, message = FALSE} library(Biostrings) gc_content <- function(galn) { seq <- mcols(galn)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob=TRUE) as.vector(gc) } ``` Calculate and display GC content ```{r} param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param) res1 <- gc_content(galn) hist(res1) ``` ## The same example with chunking Open file for reading, specifying 'yield' size ```{r} bfl <- BamFile(fname, yieldSize = 100000) open(bfl) ``` Repeatedly read chunks of data and calculate GC content ```{r} res2 <- numeric() repeat { message(".") galn <- readGAlignments(bfl, param = param) if (length(galn) == 0) break ## inefficient copy of res2, but only a few iterations... res2 <- c(res2, gc_content(galn)) } ``` Clean up and compare approaches ```{r} close(bfl) identical(res1, res2) ``` # Use (classical) parallel evaluation Many down-sides - More complicated code, e.g., to distribute data - Relies on, and requires mastery of, supporting infrastructure Maximum speed-up - Proportional to number of parallel computations - In reality - Cost of data movement from 'manager' to 'worker' - Additional overhead of orchestrating parallel computations ## [BiocParallel][] ```{r, echo=FALSE} xx <- gc(); xx <- gc() ``` ```{r, message = FALSE} fun <- function(i) { Sys.sleep(1) # a time-consuming calculation i # and then the result } system.time({ res1 <- lapply(1:10, fun) }) library(BiocParallel) system.time({ res2 <- bplapply(1:10, fun) }) identical(res1, res2) ``` - 'Forked' processes (non-Windows) - No need to distribute data from main thread to workers - Independent processes - Classic clusters, e.g., _slurm_ - Coming: cloud-based solutions ## [GenomicFiles][] Parallel, chunk-wise iteration through genomic files. Set up: ```{r, message = FALSE} library(GenomicFiles) ``` Define a `yield` function that provides a chunk of data for processing ```{r} yield <- function(x) { param <- ScanBamParam(what = "seq") readGAlignments(x, param = param) } ``` Define a `map` function that transforms the input data to the desired result ```{r} map <- function(x) { seq <- mcols(x)[["seq"]] gc <- letterFrequency(seq, "GC", as.prob = TRUE) as.vector(gc) } ``` Define a `reduce` function that combines two successive results ```{r} reduce <- c ``` Perform the calculation, chunk-wise and in parallel ```{r} library(RNAseqData.HNRNPC.bam.chr14) fname <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1] bfl <- BamFile(fname, yieldSize = 100000) res <- reduceByYield(bfl, yield, map, reduce, parallel = TRUE) hist(res) ``` [BiocParallel]: //www.andersvercelli.com/packages/BiocParallel [GenomicFiles]: //www.andersvercelli.com/packages/GenomicFiles # Query specific values # Provenance ```{r} sessionInfo() ```