- 标题:“W1 - 介绍:介绍:高吞吐基因组数据的分析和理解”输出:Biocstyle :: HTML_Document:TOC:True Vignette:> VignetteIndexentry {Workshop:W1 - 简介:高吞吐量的分析和理解基因组数据}%\ vignetteengine {knitr :: Rarmardown} ---``` {r stywe,echo = false,结果='asis'}生物ocstyle :: markdown()选项(width = 100,max.print = 1000)KNITR :: OPTS_CHUNK $ SET(eval = AS.LOGICAL(SYS.GETENV(“knitr_eval”,“true”)),缓存= as.logical(sys.getenv(“knitr_cache”,“true”)))``````{r setup,echo = false,消息= false,警告= false} ## suppresspackageStartUpMessages({})````mtmorgan@fredhutch.org.
日期:2015年9月7日
返回[Workshop大纲](开发人员会议 - 工作坊.html)
本文档中的材料需要_r_版本3.2和_biocidodder_版本3.1``` {r configure-test} stopifnot(getRversion()> ='3.2'&& getRversion()<'3.3',Biocinstaller :: Biocversion()> =“3.1“)```#整体工作流程1.实验设计 - 保持简单!- 复制!- 避免或跟踪批量效果2.湿式实验室准备3.高通量排序 - 输出:读取的FASTQ文件及其质量分数4.对齐 - 许多不同的对齐器,一些专门用于不同的目的 - 输出:对齐读取的BAM文件5. Summary - e.g., _count_ of reads overlapping regions of interest (e.g., genes) 6. Statistical analysis 7. Comprehension ![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png) Notes - de novo _Assembly_ outside the scope of _Bioconductor_ # Two simple _shiny_ apps - BAMSpector -- display gene models and underlying support across BAM (aligned read) files ```{r shiny-BAMSpector, eval=FALSE} app <- system.file(package="BiocAsiaPacific2015", "BAMSpector") shiny::runApp(app) ``` - MAPlotExplorer -- summarize two-group differential expression, including drill-down of individual genes. Based on [CSAMA 2015](http://github.com/CSAMA2015/materials/labs/4_Thursday/Interactive_data_visualization_with_Shiny/) lab by Andrzej Oles. ```{r shiny-MAPlotExplorer, eval=FALSE} app <- system.file(package="BiocAsiaPacific2015", "MAPlotExplorer") shiny::runApp(app) ``` # How _Bioconductor_ helps Annotation - Gene (e.g., exons-within-transcipts) models - Identifier mapping (e.g., 'HNRPC' to Entrez identifier used in the gene model) Standard (large) file input & manipulation, e.g., BAM files of aligned reads Statistical analysis - Differential expression - ChIPSeq - Variants - Flow cytometery - Proteomics - ... Key resources - web site: [http://biocondcutor.org](http://biocondcutor.org) - [biocViews](http://biocondcutor.org/pacakges) - Package 'landing pages', e.g., [edgeR](http://biocondcutor.org/packages/edgeR) - [Help](http://biocondcutor.org/help) - [Developer resources](http://biocondcutor.org/developers): new package submission, developer [mailing list](https://stat.ethz.ch/mailman/listinfo/bioc-devel), etc. - support: [https://support.bioconductor.org](https://support.bioconductor.org) # _R_ refresher ## Language and environment for statistical computing and graphics - Full-featured programming language - Interactive and *interpretted* -- convenient and forgiving - Coherent, extensive documentation - Statistical, e.g. `factor()`, `NA` - Extensible -- CRAN, Bioconductor, github, ... ## Vector, class, object - Efficient _vectorized_ calculations on 'atomic' vectors `logical`, `integer`, `numeric`, `complex`, `character`, `byte` - Atomic vectors are building blocks for more complicated _objects_ - `matrix` -- atomic vector with 'dim' attribute - `data.frame` -- list of equal length atomic vectors - Formal _classes_ represent complicated combinations of vectors, e.g., the return value of `lm()`, below ## Function, generic, method - Functions transform inputs to outputs, perhaps with side effects, e.g., `rnorm(1000)` - Argument matching first by name, then by position - Functions may define (some) arguments to have default values - _Generic_ functions dispatch to specific _methods_ based on class of argument(s), e.g., `print()`. - Methods are functions that implement specific generics, e.g., `print.factor`; methods are invoked _indirectly_, via the generic. - Many but not all functions able to manipulate a particular class are methods, e.g., `abline()` used below is a plain-old-funciton. ## Programming Iteration: - `lapply()` ```{r lapply-args} args(lapply) ``` - Meaning: for a vector `X` (typically a `list()`), apply a function `FUN` to each vector element, returning the result as a **l**ist. `...` are additional arguments to `FUN`. - `FUN` can be built-in, or a user-defined function ```{r lapply-eg} lst <- list(a=1:2, b=2:4) lapply(lst, log) # 'base' argument default; natural log lapply(lst, log, 10) # '10' is second argument to 'log()', i.e., log base 10 ``` - `sapply()` -- like `lapply()`, but simplify the result to a vector, matrix, or array, if possible. - `vapply()` -- like `sapply()`, but requires that the return type of `FUN` is specified; this can be safer -- an error when the result is of an unexpected type. - `mapply()` (also `Map()`) ```{r} args(mapply) ``` - `...` are one or more vectors, recycled to be of the same length. `FUN` is a function that takes as many arguments as there are components of `...`. `mapply` returns the result of applying `FUN` to the elements of the vectors in `...`. ```{r mapply-eg} mapply(seq, 1:3, 4:6, SIMPLIFY=FALSE) # seq(1, 4); seq(2, 5); seq(3, 6) ``` - `apply()` ```{r apply} args(apply) ``` - For a matrix or array `X`, apply `FUN` to each `MARGIN` (dimension, e.g., `MARGIN=1` means apply `FUN` to each row, `MARGIN=2` means apply `FUN` to each column) - Traditional iteration programming constructs `repeat {}`, `for () {}` - Almost always more error-prone, less efficient, and harder to understand than `lapply()` ! Conditional ```{r, eval=FALSE} if (test) { ## code if TEST == TRUE } else { ## code if TEST == FALSE } ``` Functions (see table below for a few favorites) - Easy to define your own functions ```{r myfun} fun <- function(x) { length(unique(x)) } ## list of length 5, each containsing a sample (with replacement) of letters lets <- replicate(5, sample(letters, 50, TRUE), simplify=FALSE) sapply(lets, fun) ``` ## Introspection & Help Introspection - General properties, e.g., `class()`, `str()` - Class-specific properties, e.g., `dim()` Help - `?"print"`: help on the generic print - `?"print.data.frame"`: help on print method for objects of class data.frame. - `help(package="GenomeInfoDb")` - `browseVignettes("GenomicRanges")` - `methods("plot")` - `methods(class="lm")` ## Examples _R_ vectors, vectorized operations, `data.frame()`, formulas, functions, objects, class and method discovery (introspection). ```{r} x <- rnorm(1000) # atomic vectors y <- x + rnorm(1000, sd=.5) df <- data.frame(x=x, y=y) # object of class 'data.frame' plot(y ~ x, df) # generic plot, method plot.formula fit <- lm(y ~x, df) # object of class 'lm' methods(class=class(fit)) # introspection anova(fit) plot(y ~ x, df) # methods(plot); ?plot.formula abline(fit, col="red", lwd=3, lty=2) # a function, not generic.method ``` Programming example -- group 1000 SYMBOLs into GO identifiers ```{r lapply-setup, echo=FALSE} fl <- system.file(package="BiocAsiaPacific2015", "extdata", "symgo.csv") ``` ```{r lapply-user-setup, eval=FALSE} ## example data fl <- file.choose() ## symgo.csv ``` ```{r lapply} symgo <- read.csv(fl, row.names=1, stringsAsFactors=FALSE) head(symgo) dim(symgo) length(unique(symgo$SYMBOL)) ## split-sapply go2sym <- split(symgo$SYMBOL, symgo$GO) len1 <- sapply(go2sym, length) # compare with lapply, vapply ## built-in functions for common actions len2 <- lengths(go2sym) identical(len1, len2) ## smarter built-in functions, e.g., omiting NAs len3 <- aggregate(SYMBOL ~ GO, symgo, length) head(len3) ## more fun with aggregate() head(aggregate(GO ~ SYMBOL, symgo, length)) head(aggregate(SYMBOL ~ GO, symgo, c)) ## your own function -- unique, lower-case identifiers uidfun <- function(x) { unique(tolower(x)) } head(aggregate(SYMBOL ~ GO , symgo, uidfun)) ## as an 'anonymous' function head(aggregate(SYMBOL ~ GO, symgo, function(x) { unique(tolower(x)) })) ``` ## Case study: ALL phenotypic data These case studies serve as refreshers on _R_ input and manipulation of data. Input a file that contains ALL (acute lymphoblastic leukemia) patient information ```{r echo=FALSE} fname <- system.file(package="BiocAsiaPacific2015", "extdata", "ALLphenoData.tsv") stopifnot(file.exists(fname)) pdata <- read.delim(fname) ``` ```{r echo=TRUE, eval=FALSE} fname <- file.choose() ## "ALLphenoData.tsv" stopifnot(file.exists(fname)) pdata <- read.delim(fname) ``` Check out the help page `?read.delim` for input options, and explore basic properties of the object you've created, for instance... ```{r ALL-properties} class(pdata) colnames(pdata) dim(pdata) head(pdata) summary(pdata$sex) summary(pdata$cyto.normal) ``` Remind yourselves about various ways to subset and access columns of a data.frame ```{r ALL-subset} pdata[1:5, 3:4] pdata[1:5, ] head(pdata[, 3:5]) tail(pdata[, 3:5], 3) head(pdata$age) head(pdata$sex) head(pdata[pdata$age > 21,]) ``` It seems from below that there are 17 females over 40 in the data set, but when sub-setting `pdata` to contain just those individuals 19 rows are selected. Why? What can we do to correct this? ```{r ALL-subset-NA} idx <- pdata$sex == "F" & pdata$age > 40 table(idx) dim(pdata[idx,]) ``` Use the `mol.biol` column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g., ```{r ALL-BCR/ABL-subset} bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),] ``` The `mol.biol` column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels? ```{r ALL-BCR/ABL-drop-unused} bcrabl$mol.biol <- factor(bcrabl$mol.biol) ``` The `BT` column is a factor describing B- and T-cell subtypes ```{r ALL-BT} levels(bcrabl$BT) ``` How might one collapse B1, B2, ... to a single type B, and likewise for T1, T2, ..., so there are only two subtypes, B and T ```{r ALL-BT-recode} table(bcrabl$BT) levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1) table(bcrabl$BT) ``` Use `xtabs()` (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups ```{r ALL-BCR/ABL-BT} xtabs(~ BT + mol.biol, bcrabl) ``` Use `aggregate()` to calculate the average age of males and females in the BCR/ABL and NEG treatment groups. ```{r ALL-aggregate} aggregate(age ~ mol.biol + sex, bcrabl, mean) ``` Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? ```{r ALL-age} t.test(age ~ mol.biol, bcrabl) boxplot(age ~ mol.biol, bcrabl) ``` [biocViews]: //www.andersvercelli.com/packages/BiocViews.html#___Software [AnnotationData]: //www.andersvercelli.com/packages/BiocViews.html#___AnnotationData [aprof]: http://cran.r-project.org/web/packages/aprof/index.html [hexbin]: http://cran.r-project.org/web/packages/hexbin/index.html [lineprof]: https://github.com/hadley/lineprof [microbenchmark]: http://cran.r-project.org/web/packages/microbenchmark/index.html [AnnotationDbi]: //www.andersvercelli.com/packages/AnnotationDbi [BSgenome]: //www.andersvercelli.com/packages/BSgenome [Biostrings]: //www.andersvercelli.com/packages/Biostrings [CNTools]: //www.andersvercelli.com/packages/CNTools [ChIPQC]: //www.andersvercelli.com/packages/ChIPQC [ChIPpeakAnno]: //www.andersvercelli.com/packages/ChIPpeakAnno [DESeq2]: //www.andersvercelli.com/packages/DESeq2 [DiffBind]: //www.andersvercelli.com/packages/DiffBind [GenomicAlignments]: //www.andersvercelli.com/packages/GenomicAlignments [GenomicRanges]: //www.andersvercelli.com/packages/GenomicRanges [IRanges]: //www.andersvercelli.com/packages/IRanges [KEGGREST]: //www.andersvercelli.com/packages/KEGGREST [PSICQUIC]: //www.andersvercelli.com/packages/PSICQUIC [Rsamtools]: //www.andersvercelli.com/packages/Rsamtools [ShortRead]: //www.andersvercelli.com/packages/ShortRead [VariantAnnotation]: //www.andersvercelli.com/packages/VariantAnnotation [VariantFiltering]: //www.andersvercelli.com/packages/VariantFiltering [VariantTools]: //www.andersvercelli.com/packages/VariantTools [biomaRt]: //www.andersvercelli.com/packages/biomaRt [cn.mops]: //www.andersvercelli.com/packages/cn.mops [h5vc]: //www.andersvercelli.com/packages/h5vc [edgeR]: //www.andersvercelli.com/packages/edgeR [ensemblVEP]: //www.andersvercelli.com/packages/ensemblVEP [limma]: //www.andersvercelli.com/packages/limma [metagenomeSeq]: //www.andersvercelli.com/packages/metagenomeSeq [phyloseq]: //www.andersvercelli.com/packages/phyloseq [snpStats]: //www.andersvercelli.com/packages/snpStats [org.Hs.eg.db]: //www.andersvercelli.com/packages/org.Hs.eg.db [TxDb.Hsapiens.UCSC.hg19.knownGene]: //www.andersvercelli.com/packages/TxDb.Hsapiens.UCSC.hg19.knownGene [BSgenome.Hsapiens.UCSC.hg19]: //www.andersvercelli.com/packages/BSgenome.Hsapiens.UCSC.hg19