内容

1激励例子
2写作或使用有效率R代码
3.使用chunk-wise迭代
- 3.1一个没有堆积的一个例子
- 3.2具有块的相同例子
4使用（古典）并行评估
- 4.1BiocParallel
- 4.2基因组夫妇
5查询特定值
6出处

作者:马丁•摩根
日期：2019年7月26日

1激励例子

附加[TENxBrainData][]实验数据包

图书馆（TenxBrainData）

加载非常大的概括分析

tenx < -  tenxbraindata（）

## SnapshotDate（）：2019-07-10

##查看？TenxBrainData和Browsevignettes（'tenxbraindata'）的文档

##下载0资源

##从缓存加载

ten

##类：SingleCellexPeriment ## Dim：27998 1306127 ##元数据（0）：##测定（1）：Counts ## Rownames：null ## RowData名称（2）：Ensembl符号## Colnames（1306127）：aaacctgagataggag-1 AAACCTGAGCGCGCTTC-1 ... ## TTTGTCAGTTAAAGTG-133 TTTGTCATCTGAAAGA-133 ## COLDATA名称（4）：条形码序列库鼠标## DECUNTDIMNAMES（0）：##尖峰名称（0）：

分析(tenx)

## <27998 x 1306127> DelayedMatrix object of type "integer": ## aaacctgagataggag1…##[1，] 0。0 ##[2，] 0。0 ##[3，] 0。0 ##[4，] 0。0 ##[5，] 0。0 ## ... ...##[27994，] 0。0 ##[27995，] 1。0 ##[27996，] 0。 0 ## [27997,] 0 . 0 ## [27998,] 0 . 0

快速执行基本操作

log1p(化验(tenx))

## <27998 x 1306127> DelayedMatrix object of type "double": ## aaacctgagataggag1…##[1，] 0。0 ##[2，] 0。0 ##[3，] 0。0 ##[4，] 0。0 ##[5，] 0。0 ## ... ...##[27994，] 0.0000000。0 ##[27995，] 0.6931472。0 ##[27996，] 0.0000000。 0 ## [27997,] 0.0000000 . 0 ## [27998,] 0.0000000 . 0

子集和总结1000个单元格的“库大小”

tenx_subset < -  tenx [，1：1000] lib_size < -  colsums（messay（tenx_subset））hist（log10（lib_size））

数据是稀疏的，超过92%的单元格等于0

总和（测定（tenx_subset）== 0）/ prod（dim（tenx_subset））

## [1] 0.9276541

2写作或使用有效率R代码

这是最重要的一步!

避免不必要的复制

n < -  50000 ##在每次迭代集中进行`Res1`的副本.seed（123）system.time（{res1 < -  null for（i/ 1：n）Res1 < -  c（Res1，Rnorm（1））}）

## 7.098 2.908 10.013

##预分配集.Seed（123）System.Time（{Res2 < -  numeric（n）for（i/ 1：n）res2 [i] < -  rnorm（1）}）

##用户系统经过## 0.078 0.004 0.082

相同（Res1，Res2）

# # [1]

##无需思考分配！set.seed（123）system.time（{res3 < -  sapply（1：n，函数（i）rnorm（1））}）

##用户系统经过## 0.094 0.002 0.096

相同（Res1，Res3）

# # [1]

Vectorize你自己的脚本

N <- 2000000 ##迭代:N次调用' rnorm(1) ' set.seed(123) system。时间({res1 <- sapply(1:n, function(i) rnorm(1))})

##用户系统经过## 5.678 0.693 6.374

## vectorize: 1调用rnorm(n)设置种子(123)系统。时间(res2 <- rnorm(n))

##用户系统经过## 0.137 0.000 0.136

相同（Res1，Res2）

# # [1]

重复使用别人的高效代码。

例如。，limma: lmFit ()非常快速地适合10,000件线性模型

今天下午实验室的例子，周二晚上的

3.使用chunk-wise迭代

例如:核苷酸频率的映射读

3.1一个没有堆积的一个例子

通过示例数据工作;不是很大......

fname <- RNAseqData.HNRNPC.bam.chr14 . RNAseqData.HNRNPC.bam. rnaseqdata .bam. bamchr14_BAMFILES [1] basename(帧)

## [1]“err127306_chr14.bam”

Rsamtools: countBam(帧)

##空间开始结束宽度文件记录核苷酸## 1 na na na na na err127306_chr14.bam 800484 57634848

输入一个GAlignments对象，包括seq(顺序)的每个读取。

library(GenomicAlignments) param <- ScanBamParam(what = "seq") galn <- readGAlignments(fname, param = param)

写一个函数来确定读取的GC内容

gc_content <- function(galn) {seq <- mcols(galn)[["seq"]] gc <- letterFrequency(seq， " gc "， as.prob=TRUE) as.vector(gc)}

计算并显示GC内容

Param < -  ScanBamparam（什么=“SEQ”）Galn < -  Readgalignments（FNAME，PARAM = PARAM）RES1 < -  GC_CONTENT（GALN）HOST（RES1）

3.2具有块的相同例子

打开读取的文件，指定“收益”尺寸

bfl <- BamFile(fname, yieldSize = 100000)打开(bfl)

重复读取数据块并计算GC内容

res2 < -  numeric（）重复{message（“。）galn < -  readgalignments（bfl，param = param）if（length（galn）== 0）中断res2的低效副本，但只有几个迭代。。RES2 < -  C（RES2，GC_CONTENT（GALN））}

##。##。##。##。##。##。##。##。##。##。

清理和比较方法

关闭(bfl)相同(res1它)

# # [1]

4使用（古典）并行评估

许多不好的一面

更复杂的代码，例如，分发数据
依赖并需要掌握支持基础设施

最大加速

与并行计算数量成比例
事实上
- 数据从“经理”转移到“工人”的成本
- 协调并行计算的额外开销

4.1BiocParallel

有趣< - 函数（i）{sys.sleep（1）＃耗时的计算I＃，然后结果} system.time（{Res1 < -  Lapply（1:10，Fun）}）

##用户系统运行## 0.003 0.000 10.036

图书馆（Biocomallels）System.time（{Res2 < -  Bpppply（1:10，Fun）}）

##用户系统经过## 2.034 0.062 3.170

相同（Res1，Res2）

# # [1]

'叉子'进程（非窗户）
- 无需将数据从主线程分发给工人
独立进程
典型的集群,例如,sl
未来:基于云的解决方案

4.2基因组夫妇

通过基因组文件的并行、块级迭代。设置:

图书馆（基因组）

定义A.收益率提供供处理的数据块的函数

产量< - 函数（x）{param < -  scanbamparam（什么=“seq”）Readgalignments（x，param = param）}

定义A.地图将输入数据转换为所需结果的函数

映射< - 函数（x）{seq < -  mcols（x）[[SEQ“] GC < -  Letterfrequency（SEQ，”GC“，AS.Prob = True）AS.Vector（GC）}

定义A.减少组合两个连续结果的函数

减少< - c

执行计算，块和并行

库（rnaseqdata.hnrnpc.bam.chr14）fname < -  rnaseqdata.hnrnpc.bam.chr14_bamfiles [1] bfl < -  bamfile（fname，fabysize = 100000）res < -  dreambybyyield（bfl，产量，映射，减少，并行= true）hist（res）

5查询特定值

6出处

sessionInfo ()

## R version 3.6.1补丁(2019-07-16 r76845) ## Platform: x86_64-apple-darwin17.7.0 (64-bit) ## Running under: macOS High Sierra 10.13.6 ## ## Matrix products: default ## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas。/ user /ma38727/bin/R-3-6-branch/lib/libRlapackdylib # # # #语言环境:# # [1]en_US.UTF-8 / en_US.UTF-8 en_US.UTF-8 / C / en_US.UTF-8 / en_US。UTF-8 ## ## attached base packages: ## [1] parallel stats4 stats graphics grDevices utils datasets ## [8] methods base ## ### # # # [1] GenomicFiles_1.21.0 rtracklayer_1.45.2 [3] GenomicAlignments_1.21.4 Rsamtools_2.1.3 # # [5] Biostrings_2.53.2 XVector_0.25.0 # # [7] RNAseqData.HNRNPC.bam.chr14_0.23.0 TENxBrainData_1.5.0 # # [9] HDF5Array_1.13.4 rhdf5_2.29.0 # # [11] SingleCellExperiment_1.7.0 SummarizedExperiment_1.15.5 # # [13] DelayedArray_0.11.4 BiocParallel_1.19.0 # ### [19] IRanges_2.19.10 S4Vectors_0.23.17 ## [21] BiocGenerics_0.31.5 BiocStyle_2.13.2 ## ##通过命名空间加载(和没有附加):# # # # [1] httr_1.4.0 bit64_0.9-7 [3] AnnotationHub_2.17.6 shiny_1.3.2 # # [5] assertthat_0.2.1 askpass_1.1 # # [7] interactiveDisplayBase_1.23.0 BiocManager_1.30.4 # # [9] BiocFileCache_1.9.1 blob_1.2.0 # # [11] BSgenome_1.53.0 GenomeInfoDbData_1.2.1 # # [13] progress_1.2.2 yaml_2.2.0 # # [15] pillar_1.4.2 RSQLite_2.1.2 # # [17] backports_1.1.4lattice_0.20-38 # # [19] glue_1.3.1 digest_0.6.20 # # [21] promises_1.0.1 htmltools_0.3.6 # # [23] httpuv_1.5.1 Matrix_1.2-17 # # [25] xml_3.98 - 1.20 pkgconfig_2.0.2 # # [27] biomaRt_2.41.7 bookdown_0.12 # # [29] zlibbioc_1.31.0 purrr_0.3.2 # # [31] xtable_1.8-4 later_0.8.0 # # [33] openssl_1.4.1 tibble_2.1.3 # # [35] GenomicFeatures_1.37.4 magrittr_1.5 # #[37] crayon_1.3.4 mime_0.7 # # [39] memoise_1.1.0 evaluate_0.14 # # [41] prettyunits_1.0.2 tools_3.6.1 # # [43] hms_0.5.0 stringr_1.4.0 # # [45] Rhdf5lib_1.7.3 AnnotationDbi_1.47.0 # # [47] compiler_3.6.1 rlang_0.4.0 # # [49] grid_3.6.1 rcurl_1.95 - 4.12 # # [51] VariantAnnotation_1.31.3 rappdirs_0.3.1 # # [53] bitops_1.0-6 rmarkdown_1.14 # # [55]# [63] zeallot_0.1.0 stringi_1.4.3 ## [65] Rcpp_1.0.1 vctrs_0.2.0 ## [67] dbplyr_1.4.2 tidyselect_0.2.5 ## [69] xfun_0.8

20.1 -处理大数据

2019年7月26日

内容