---标题:“_r_沟通可重复研究的软件包”作者: - 姓名:马丁摩根附属:罗斯韦尔公园综合癌症中心日期:“2019年12月26日”Vignette:> VignetteIndexentry {R包用于传播可重复的研究}%\VignTeTeNgine {knitr :: Rarmmardown}%\ Vignetteencoding {UTF-8}输出:Biocstyle :: Html_Document:Number_sections:是toc:true ---最后修改:2019年5月31日,2019年5月31日```{r style,echo = false,结果='ASIS'} KNITR :: OPTS_CHUNK $ SET(eval = AS.LOGICAL(SYS.GETENV(“KNITR_EVAL”,“TRUE”)),Cache = AS.LOGICY(SYS.GETENV(“knitr_cache”,“true”)),折叠= true)选项(width = 75)```#摘要本教程适用于所有希望编写_r_软件包的人。_R_是一种奇妙的语言,为您开发新的统计方法,用于分析和理解现实数据。_r_软件包提供了一种在可重复的记录单元中捕获新方法的方法。_r_包令人惊讶地易于创建,并创建_r_包具有许多优势。在本教程中,我们创建了一个_r_包。我们从数据集开始,并以有用的方式转换数据;也许你有自己的数据集和脚本?我们用_function_替换脚本,将函数和数据放入_r_ _package_中。然后,我们添加文档,以便我们的用户(以及我们的未来自我)了解该功能的功能以及功能如何适用于新数据集。 With an _R_ package in hand, we can tackle more advance challenges: _vignettes_ for rich narrative description of the package; _unit tests_ to make our package more robust; and _version control_ to document how we change the package. The final step in the development of our package is to share it with others, through github, through CRAN, or though domain-specific channels such as _Bioconductor_. # Biological and statistical motivation ## Measuring gene expression on single cells: single-cell RNA seq The 'central dogma' of molecular biology: genes encoded in DNA (chromosomes) are transcribed to mRNA and then translated to protiens. ![](our_figures/2b597889d05bc601803a3b4d9ec5ccd5e7b8d3af.png) - https://cdn.kastatic.org/ka-perseus-images/2b597889d05bc601803a3b4d9ec5ccd5e7b8d3af.png All Khan Academy content is available for free at www.khanacademy.org RNA-sequencing (bulk RNA-seq) - Isolate mRNA from a large sample of cells - Reverse transcibe to cDNA, fragment, and sequence - Align sequenced fragments to reference genome - More fragments aligned interpretted as higher gene expression. ![](our_figures/rna4.JPG.jpg) - http://bio.lundberg.gu.se/courses/vt13/rnaseq.html Single-cell RNA-seq - Isolate individual cells - Associate each cell with bar-coded beads - Sequence bar-coded cDNA - Most current methods lead to very sparse 'coverage' ![](our_figures/microfluidics.png) - Hwang et al., 2018 https://doi.org/10.1038/s12276-018-0071-8 ## Simulated data Parameters: ```{r} n_genes <- 20000 n_cells <- 100 ## gamma-distributed gene means rate <- .1 shape <- .1 ## negative binomial counts dispersion <- 0.1 ``` A very rough simulation: ```{r} set.seed(123) gene_means <- rgamma(n_genes, shape = shape, rate = rate) cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5) cell_means <- outer(gene_means, cell_size_factors, `*`) counts <- matrix( rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion), nrow = n_genes, ncol = n_cells ) ``` Basic properties of the simulated data ```{r} range(counts) ## proportion of zeros mean(counts == 0) ## 'library size' -- reads mapped per cell hist(colSums(counts), main = "Library Size") ## average experssion per gene hist(rowMeans(log1p(counts)), main = "log Gene Expression") ``` ## Size factors ```{r} log_counts <- log(counts) centered <- log_counts - rowMeans(log_counts) filtered_median <- function(x) median(x[is.finite(x)]) size_factors <- exp(apply(centered, 2, filtered_median)) hist(size_factors) range(size_factors) median(size_factors) ``` # The basics ## _R_ packages Collection of files and directories on disk. A complete package might have a structure like that illustrated below. ``` SCSimulate DESCRIPTION NAMESPACE R/ simulate.R size_factors.R man/ simulate.Rd size_factors.Rd vignettes/ Using_this_package.Rmd tests/ testthat.R testthat/ test_simulate.R test_size_factor.R ``` DESCRIPTION - Package name, title, description (like paper abstract) - Authors and maintainer (contact author) - License - Other packages this package depends on ('dependencies') - `Depends`: Data structures or work flows required for use of this package. - `Imports`: Used inside the current package. For instance, we will use function like `rgamma()`, `rnorm()`, and `rnbinom()` from the `stats` package. - `Suggests`: Used in examples or vignettes. NAMESPACE - Functions used by this package -- `import()`, `importFrom()` - Functions this package makes available to users -- `export()` R/ - Text files containing _R_ function definitions man/ - Text files documenting functions vignettes/ - 'Markdown' or other text documents describing use of the package. tests/ - Functions used to provide 'unit' tests for the package. ## Package skeleton Create a package ```{r, eval = FALSE} devtools::create("SCSimulate") ## ✔ Creating 'SCSimulate/' ## ✔ Setting active project to '/Users/ma38727/b/github/BiocIntro/vignettes/SCSimulate' ## ✔ Creating 'R/' ## ✔ Writing 'DESCRIPTION' ## Package: SCSimulate ## Title: What the Package Does (One Line, Title Case) ## Version: 0.0.0.9000 ## Authors@R (parsed): ## * First Last [AUT,CRE](您的ORCID-ID)##描述:包所做的(一个段落)。##许可证:它使用的许可证##编码:UTF-8 ## lazydata:true ##✔写'命名空间'##✔将活动项目设置为' '````````````````````````````````````````````````````````````````):0.0.0.9000作者@ r:c(给定=“martin”,家庭=“摩根“,角色= c(”aut“,”cre“),电子邮件=”martin.morgan@roswellpark.org“,评论= c(orcid =”您的orcid-id“)),人(给定”另一个“,Family =“作者”,角色=“aut”))说明:使用基因表达值的γ分布模拟单细胞RNA SEQ数据,以及每个细胞计数的负二进制模型。该软件包还包含用于预处理的功能,包括简单计算库缩放因子。许可证:艺术-2.0导入:统计编码:UTF-8 LazyData:True```到目前为止,我们的包裹看起来像“Scsimulate描述命名空间r /``##从脚本到函数转换描述的脚本的一部分函数的模拟`simulate()`。使用函数参数捕获默认值。```{R}模拟< - 函数(n_genes = 20000,n_cells = 100,速率= 0.1,形状= 0.1,dispeations = 0.1){gene_means < - rgamma(n_genes,shape = shape,速率=速率)cell_size_factors < -2 ^ rnorm(n_cells,sd = 0.5)cell_means < - 外(gene_means,cell_size_factors,`*`)矩阵(rnbinom(n_genes * n_cells,mu = cell_means,size = 1 /色散),nrow = n_genes,ncol = n_cells)“``”将脚本的一部分转换为函数`size_factors()`的大小因子计算。“size_factors()`的唯一参数是计数矩阵。 ```{r} .filtered_median <- function(x) median(x[is.finite(x)]) size_factors <- function(counts) { log_counts <- log(counts) centered <- log_counts - rowMeans(log_counts) exp(apply(centered, 2, .filtered_median)) } ``` Check that we haven't made any mistakes. ```{r} set.seed(123) counts <- simulate() size_factors <- size_factors(counts) range(size_factors) median(size_factors) ``` ## Add function definitions to the package Place functions into files in the `R/` directory. Typically, name the file after the function / group of functions in the file. E.g., file: `R/simulate.R` ```{r} simulate <- function(n_genes = 20000, n_cells = 100, rate = 0.1, shape = 0.1, dispersion = 0.1) { gene_means <- rgamma(n_genes, shape = shape, rate = rate) cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5) cell_means <- outer(gene_means, cell_size_factors, `*`) matrix( rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion), nrow = n_genes, ncol = n_cells ) } ``` file: `R/size_factors.R` ```{r} .filtered_median <- function(x) median(x[is.finite(x)]) size_factors <- function(counts) { log_counts <- log(counts) centered <- log_counts - rowMeans(log_counts) exp(apply(centered, 2, .filtered_median)) } ``` Our package now looks like ``` SCSimulate DESCRIPTION NAMESPACE R/ simulate.R size_factors.R ``` ## Document the functions Use `roxygen2` for documentation by adding tagged lines starting with `#'` immediatly above each function. Common tags are illustrated below. - `@title` is a one-line description appearing at the top of a help page. - `@description` provides a short description of the function, presented after the title. Use `@details` for more extensive description appearing after the 'Usage' section (generated based on the signature of the function after the `@export` tag) of a help page. - Document parameter (`@param`) and return (`@return`) values carefully. The `@param` values are used to form the 'Arguments' section of the help page. The `@return` value appears in the 'Returns' section of the help page. - `@examples` are include in the 'Examples' section of the help page, and must be complete and syntactically correct R code (examples are evaluated when a package is built and checked). - Use `@importFrom` to indicate that a particular package provides specific functions used in the current package. - For readability, 'wrapped' lines to 80 columns. Use indentation and spacing consistently and generously. file: `R/simulate.R` ```{r} #' @title Simulate single-cell data #' #' @description `simulate()` produces a genes x cells count matrix of #' simulated single-cell RNA-seq data. Gene expression is modelled #' using a gamma distribution. Counts are simulated using a #' negative binomial distribution. #' #' @param n_genes integer(1) the number of genes (rows) to simulate. #' #' @param n_cells integer(1) the number of cells (columns) to simulate. #' #' @param rate numeric(1) rate parameter of the `rgamma()` distribution. #' #' @param shape numeric(1) shape parameter of the `rgamma()` distribution. #' #' @param dispersion numeric(1) size (`1 / dispersion`) parameter of #' the `rnbinom()` distribution. #' #' @return `simulate() returns a `n_genes x n_cells` matrix of #' simulated single-cell RNA-seq counts. #' #' @examples #' counts <- simulate() #' dim(counts) #' mean(counts == 0) # fraction of '0' cells #' range(counts) #' #' @importFrom stats rgamma rnorm rnbinom #' #' @export simulate <- function(n_genes = 20000, n_cells = 100, rate = 0.1, shape = 0.1, dispersion = 0.1) { gene_means <- rgamma(n_genes, shape = shape, rate = rate) cell_size_factors <- 2 ^ rnorm(n_cells, sd = 0.5) cell_means <- outer(gene_means, cell_size_factors, `*`) matrix( rnbinom(n_genes * n_cells, mu = cell_means, size = 1 / dispersion), nrow = n_genes, ncol = n_cells ) } ``` file: `R/size_factors.R` ```{r} #' @importFrom stats median .filtered_median <- function(x) median(x[is.finite(x)]) #' @title Calculate geometric mean-centered median scaled cell scaling #' factors. #' #' @description `size_factors()` centers the log counts of each row #' (gene) by the row mean of the log counts. The finite centered #' values are then used to compute column-wise geometric median #' scaling factors. #' #' @param counts matrix() of gene x cell RNA-seq counts. #' #' @return `size_factors()` returns a `numeric(ncol(counts))` vector #' of scaling factors. #' #' @examples #' set.seed(123) #' counts <- simulate() #' size_factors <- size_factors(counts) #' median(size_factors) # approximately 1 #' #' @export size_factors <- function(counts) { log_counts <- log(counts) centered <- log_counts - rowMeans(log_counts) exp(apply(centered, 2, .filtered_median)) } ``` ## Update 'NAMESPACE' and 'man' pages ```{r} devtools::document("SCSimulate") ``` - Updates the NAMESPACE file - Functions used by this package (`stats::rgamma()`, `stats::rnorm()`, `stats::rnbinom()`). - Indicates functions defined by this package and meant to be visible to the user (`simulate()` and `size_factors()`, but not `.filtered_median()`). - Transforms the documentation introduced above into stand-alone files - E.g., `man/simulate.Rd` - A plain text file, but with LaTeX-style markup understood by _R_. Our package now looks like ``` SCSimulate DESCRIPTION NAMESPACE R/ simulate.R size_factors.R man/ simulate.Rd size_factors.Rd ``` The NAMESPACE file has been updated to ```{r} cat(readLines("SCSimulate/NAMESPACE"), sep="\n") ``` ## Build & check - Build (collate) package files into a 'tar' ball, `SCSimulate_0.0.0.9000.tar.gz`. - Check that the tar ball is complete and correct. ```{r, eval = FALSE} devtools::check("SCSimulate") ## Updating SCSimulate documentation ## Writing NAMESPACE ## Loading SCSimulate ## Writing NAMESPACE ## Writing size_factors.Rd ## ── Building ────────────────────────────────────────────────────── SCSimulate ── ## Setting env vars: ## ● CFLAGS : -Wall -pedantic -fdiagnostics-color=always ## ● CXXFLAGS : -Wall -pedantic -fdiagnostics-color=always ## ● CXX11FLAGS: -Wall -pedantic -fdiagnostics-color=always ## ──────────────────────────────────────────────────────────────────────────────── ## ✔ checking for file ‘/Users/ma38727/a/github/BiocIntro/vignettes/SCSimulate/DESCRIPTION’ ## ─ preparing ‘SCSimulate’: ## ✔ checking DESCRIPTION meta-information ## ─ checking for LF line-endings in source and make files and shell scripts ## ─ checking for empty or unneeded directories ## ─ building ‘SCSimulate_0.0.0.9000.tar.gz’ ## ## ── Checking ────────────────────────────────────────────────────── SCSimulate ── ## Setting env vars: ## ● _R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE ## ● _R_CHECK_CRAN_INCOMING_REMOTE_ : FALSE ## ● _R_CHECK_CRAN_INCOMING_ : FALSE ## ● _R_CHECK_FORCE_SUGGESTS_ : FALSE ## ── R CMD check ───────────────────────────────────────────────────────────────── ## Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help ## ─ using log directory '/private/var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T/Rtmp6S4exQ/SCSimulate.Rcheck' ## ─ using R Under development (unstable) (2019-12-01 r77489) ## ─ using platform: x86_64-apple-darwin17.7.0 (64-bit) ## ─ using session charset: UTF-8 ## ─ using options '--no-manual --as-cran' ## ✔ checking for file 'SCSimulate/DESCRIPTION' ## ─ this is package 'SCSimulate' version '0.0.0.9000' ## ─ package encoding: UTF-8 ## ✔ checking package namespace information ## ✔ checking package dependencies (3.4s) ## ✔ checking if this is a source package ## ✔ checking if there is a namespace ## ✔ checking for executable files ## ✔ checking for hidden files and directories ## ✔ checking for portable file names ## ✔ checking for sufficient/correct file permissions ## ✔ checking serialization versions ## ✔ checking whether package 'SCSimulate' can be installed (1.8s) ## ✔ checking installed package size ## ✔ checking package directory ## ✔ checking for future file timestamps (505ms) ## ✔ checking DESCRIPTION meta-information ## ✔ checking top-level files ## ✔ checking for left-over files ## ✔ checking index information ## ✔ checking package subdirectories ## ✔ checking R files for non-ASCII characters ## ✔ checking R files for syntax errors ## ✔ checking whether the package can be loaded ## ✔ checking whether the package can be loaded with stated dependencies ## ✔ checking whether the package can be unloaded cleanly ## ✔ checking whether the namespace can be loaded with stated dependencies ## ✔ checking whether the namespace can be unloaded cleanly ## ✔ checking loading without being on the library search path ## ✔ checking dependencies in R code ## ✔ checking S3 generic/method consistency (651ms) ## ✔ checking replacement functions ## ✔ checking foreign function calls ## ✔ checking R code for possible problems (2.1s) ## ✔ checking Rd files ## ✔ checking Rd metadata ## ✔ checking Rd line widths ## ✔ checking Rd cross-references ## ✔ checking for missing documentation entries ## ✔ checking for code/documentation mismatches (407ms) ## ✔ checking Rd \usage sections (792ms) ## ✔ checking Rd contents ## ✔ checking for unstated dependencies in examples ## ✔ checking examples (1.7s) ## ✔ checking for non-standard things in the check directory ## ✔ checking for detritus in the temp directory ## ## ## ── R CMD check results ────────────────────────────── SCSimulate 0.0.0.9000 ──── ## Duration: 14.9s ## ## 0 errors ✔ | 0 warnings ✔ | 0 notes ✔ ``` ## Install ```{r, eval = FALSE} devtools::install("SCSimulate") ## ✔ checking for file ‘/Users/ma38727/a/github/BiocIntro/vignettes/SCSimulate/DESCRIPTION’ ## ─ preparing ‘SCSimulate’: ## ✔ checking DESCRIPTION meta-information ## ─ checking for LF line-endings in source and make files and shell scripts ## ─ checking for empty or unneeded directories ## ─ building ‘SCSimulate_0.0.0.9000.tar.gz’ ## ## Running /Users/ma38727/bin/R-devel/bin/R CMD INSTALL \ ## /var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//Rtmp6S4exQ/SCSimulate_0.0.0.9000.tar.gz \ ## --install-tests ## * installing to library ‘/Users/ma38727/Library/R/4.0/Bioc/3.11/library’ ## * installing *source* package ‘SCSimulate’ ... ## ** using staged installation ## ** R ## ** byte-compile and prepare package for lazy loading ## ** help ## *** installing help indices ## ** building package indices ## ** testing if installed package can be loaded from temporary location ## ** testing if installed package can be loaded from final location ## ** testing if installed package keeps a record of temporary installation path ## * DONE (SCSimulate) ``` ## Use! ```{r} library(SCSimulate) ``` ```{r, eval = FALSE} ?simulate ?size_factors ``` ```{r} example(size_factors) ``` # Advanced challenges ## Data sets ## Communicate: vignettes ## Robust: tests & examples ## Mature: version control ## Dissemination # `sessionInfo()` ```{r} sessionInfo() ```