——标题:“6。补充2:RNA-Seq统计问题"作者:"Martin Morgan (martin.morgan@roswellpark.org)
罗斯威尔帕克癌症研究所,布法罗,纽约
5 - 9 October, 2015" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{6。RNA-Seq Differential Expression} % \VignetteEngine{{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set(eval=as.logical(Sys. print);采用“KNITR_EVAL”,“真正的”)),缓存= as.logical (Sys。采用“KNITR_CACHE”,“真正的”)))``` ```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE} suppressPackageStartupMessages({library(DESeq2) library(limma)})本课程中的材料要求Rversion 3.2和Bioconductor version 3.2 {R configure-test} stopifnot(getRversion() >= '3.2' && getRversion() < '3.3', BiocInstaller::biocVersion() == "3.2")示例:t-test ' t.t test() ' - ' x ':单变量测量向量- ' y ':描述实验设计的因素- ' var. test() '=TRUE ':适用于没有附加信息的相对较小的实验?- ' formula ':替代表示形式,' y ~ x '。“‘{r sleep-t。} head(sleep) plot(extra ~ group, data = sleep) ## Traditional interface with(sleep, t.test(extra[group == 1], extra[group == 2]) ##公式接口t.test(extra ~ group, sleep) ## equal variance between groups . t.test(extra ~ group, sleep, var.equal=TRUE)' ' ' ' lm()”和“方差分析()”——“lm()”:适合_linear model_等。- anova():统计评估。 ```{r sleep-lm} ## linear model; compare to t.test(var.equal=TRUE) fit <- lm(extra ~ group, sleep) anova(fit) ``` - Under the hood: `formula`: translated into _model matrix_, used in `lm.fit()`. - With (implicit) intercept 1, last coefficient of model matrix reflects group effect - With intercept 0, _contrast_ between effects of coefficient 1 and coefficient 2 reflect group effect ```{r sleep-model.matrix} ## underlying model, used in `lm.fit()` model.matrix(extra ~ group, sleep) # last column indicates group effect model.matrix(extra ~ 0 + group, sleep) # contrast between columns ``` - Covariate -- fit base model containing only covariate, test improvement in fit when model includes factor of interest ```{r sleep-diff} fit0 <- lm(extra ~ ID, sleep) fit1 <- lm(extra ~ ID + group, sleep) anova(fit0, fit1) t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE) ``` `genefilter::rowttests()` - t-tests for gene expression data - useful for exploratory analysis, but statistically sub-optimal - `x`: matrix of expression values - features x samples (reverse of how a 'statistician' would represent the data -- samples x features) - `fac`: factor of one or two levels describing experimental design Limitations - Assumes features are _independent_ - Ignores common experimental design - Ignores multiple testing Consequences - Poor estimate of between-group variance for each feature - Elevated false discovery rate # Common experimental designs - t-test: `count ~ factor`. Alternative: `count ~ 0 + factor` and contrasts - covariates: `count ~ covariate + factor` - Single factor, multiple levels (one-way ANOVA) -- statistical contrasts: specify model as `count ~ factor` or `count ~ 0 + factor` - Factorial designs -- main effects, `count ~ factor1 + factor2`; main effects and interactions, `count ~ factor1 * factor2`. Contrasts to ask specific questions - Paired designs: include ID as covariate (approximate, since ID is a random effect); `r Biocpkg("limma")` approach: `duplicateCorrelation()`