标题:“A.3—统计与图形”作者:Martin Morgan
输出:BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{A。3——Statistics and Graphics} % \VignetteEngine{knitr::rmarkdown}——' ' ' {r style, echo = FALSE, results = 'asis'}采用“KNITR_EVAL”,“真正的”)),缓存= as.logical (Sys。getenv("KNITR_CACHE", "TRUE")) suppressPackageStartupMessages({library(tidyverse)})#探索、单变量和双变量统计和可视化输入干净的数据,以“性别”和“年份”为因素。' ' ' {r全部选择,eval = FALSE}路径< - file.choose() #寻找BRFSS-subset.csv ' ' ' ' ' ' {r所有输入}stopifnot (file.exists(路径))图书馆(tidyverse) col_types <关口(年龄= col_integer(),重量= col_double(),性= col_factor (c(“女性”、“男性”)),身高= col_double()年= col_factor (c (" 1990 ", " 2010 "))) brfss < read_csv(路径,过滤数据以只包括女性,并使用基本的' plot() '函数和公式界面来可视化' Weight '和' Year '之间的关系。' ' ' {r brfss- Female -plot} brfss %>% filter(Sex %in% "Female") %>% plot(Weight ~ Year, data = .){r brfss-female-t-test} brfss %>% filter(Sex %in% " female ") %>% t.test(weight ~ Year, data =。)双变量:2010年男性的体重和身高过滤数据包含2010年男性。使用' plot() '来可视化关系,使用' lm() '来建模。' ' ' {r brfss- Male} male2010 <- brfss %>% filter(Year %in% "2010", Sex %in% "Male") male2010 %>% plot(Weight ~ Height, data = .) fit <- male2010 %>% lm(Weight ~ Height, data = .) fit summary(fit){r brfss-male- Year -and- Height} male <- brfss %>% filter(Sex %in% " male ") male %>% lm(Weight ~ Year + Height, data = .) %>% summary()“年份”和“身高”之间有交互作用吗?”' ' {r brfss-male-interaction} male %>% lm(Weight ~ Year * Height, data = .) %>% summary() ``` Check out other things to do with fitted model: - `broom::tidy()`: P-value, etc., as data.frame - `broom::augment()`: fitted values, residuals, etc ```{r brfss-male-augment, warning=FALSE} library(broom) male %>% lm(Weight ~ Year + Height, data = .) %>% augment() %>% as.tibble() ``` ## Visualization: [ggplot2][] *gg*plot: "Grammar of Graphics" - data: `ggplot2()` - *aes*thetics: `aes()`, 'x' and 'y' values, point colors, etc. - *geom*metric summaries, layered - `geom_point()`: points - `geom_smooth()`: fitted line - `geom_*`: ... - *facet* plots (e.g., `facet_grid()`) to create 'panels' based on factor levels, with shared axes. Create a plot with data points ```{r male-geom_point, warning = FALSE} ggplot(male, aes(x=Height, y = Weight)) + geom_point() ``` Capture the base plot and points, and explore different smoothed relationships, e.g., linear model, non-parameteric smoother ```{r male-ggplot, warning = FALSE} plt <- ggplot(male, aes(x=Height, y = Weight)) + geom_point() plt + geom_smooth(method = "lm") plt + geom_smooth() # default: generalized additive model ``` Use an `aes()`thetic to color smoothed lines based on `Year`, or `facet_grid()` to separate years. ```{r male-facet, warning = FALSE} ggplot(male, aes(x = Weight)) + geom_density(aes(fill = Year), alpha = .2) plt + geom_smooth(method = "lm", aes(color = Year)) plt + facet_grid( ~ Year ) + geom_smooth(method = "lm") ``` [ggplot2]: https://cran.r-project.org/package=ggplot2 # Multivariate analysis This is a classic microarray experiment. Microarrays consist of 'probesets' that interogate genes for their level of expression. In the experiment we're looking at, there are 12625 probesets measured on each of the 128 samples. The raw expression levels estimated by microarray assays require considerable pre-processing, the data we'll work with has been pre-processed. ## Input and setup Start by finding the expression data file on disk.' ' ' {r ALL-choose-again, eval=FALSE} path <- file.choose() #查找ALL-expression.csv stopifnot(file.exists(path))数据以逗号分隔的值格式存储,每个探针占用一行,该探针中每个样本的表达式值用逗号分隔。使用' read_csv() '输入数据。示例标识符出现在第一列中。' ' ' {r ALL-input-exprs} exprs <- read_csv(path)' ' '我们还将输入描述每个列的数据' ' ' {r ALL-phenoData.csv-clustering-student, eval=FALSE} path <- file.choose() #查找ALL-phenoData.csv stopifnot(file.exists(path))The expression data is presented in what is有时被称为'wide' format;另一种格式是“tall”,样本和基因将单个观察表达分组。使用' tidyr::gather() '将宽格式的列收集成两个表示高格式的列,将' Gene '列排除在收集操作中。' ' ' {r ALL-gather} exprs <- exprs %>% gather("Sample", "Expression", - gene)“稍微研究一下数据,比如,表达值的摘要和直方图,以及每个基因平均表达值的直方图。”expprs %>% select(Expression) %>% summary() exprs $ Expression %>% hist() exprs %>% group_by(Gene) %>% summary(AveExprs = mean(Expression)) %$% AveExprs %>% hist(breaks=50){r B_or_T} pdata <- pdata %>% mutate(B_or_T = factor(substr(BT, 1,1))))无监督机器学习——多维尺度我们希望将高维数据降至低维,以便可视化。为此,我们需要在样本之间设置dist()。 From `?dist`, the input can be a data.frame where rows represent `Sample` and columns represent `Expression` values. Use `spread()` to create appropriate data from `exprs`, and pipe the result to `dist()`ance.x ```{r spread} input <- exprs %>% spread(Gene, Expression) samples <- input $ Sample input <- input %>% select(-Sample) %>% as.matrix rownames(input) <- samples ``` Calculate distance between samples, and use that for MDS scaling ```{r cmdscale} mds <- dist(input) %>% cmdscale() ``` The result is a matrix; make it 'tidy' by coercing to a tibble; add the Sample identifiers as a distinct column. ```{r mds-to-tibble} mds <- mds %>% as.tibble() %>% mutate(Sample = rownames(mds)) ``` Visualize the result ```{r} ggplot(mds, aes(x=V1, y = V2)) + geom_point() ``` With the 'eye of faith', it seems like there are two groups of points. To explore this, join the MDS scaling with the phenotypic data ```{r join} joined <- inner_join(mds, pdata) ``` and use the `B_or_T` column as an aesthetic to color points ```{r mds-color} ggplot(joined, aes(x = V1, y = V2)) + geom_point(aes(color = B_or_T)) ```