内容

1探索，单变量和双变量统计和可视化
2多变量分析

1探索，单变量和双变量统计和可视化

输入干净的数据性和一年作为因素。

path <- file.choose() #查找BRFSS-subset.csv

stopifnot(file.exists(path)) library(tidyverse) col_types <- cols(Age = col_integer()， Weight = col_double()， Sex = col_factor(c("Female"， "Male"))， Height = col_double()， Year = col_factor(c("1990"， "2010"))) brfss <- read_csv(path, col_types = col_types) brfss

一口咬:20000 x 5 # #年龄体重性别高度年# # < int > <双> < fctr > <双> < fctr > 31 48.98798女性157.48 # # 1990 # 57 # 2 81.64663女性157.48 1990 # # 1990 # # 43 80.28585男性177.80 1990 170.18 72 70.30682男性31 49.89516女性154.94 # # 1990 # # 6 58 54.43108女45 69.85323 154.94 1990 # # 7男男性180.34 68.03886 172.72 1990 # # 8 371990 ## 9 33 65.77089 Female 170.18 1990 ## 10 75 70.76041 Female 152.40 1990 ## # ... with 19,990 more rows

1.1单变量：`t.test（）`1990年和2010年女性的体重

过滤数据仅包含女性，并使用基础阴谋（）功能和公式界面可视化之间的关系重量和一年。

brfss %>% filter(Sex %in% "Female") %>% plot(Weight ~ Year, data = .)

使用一个t.test（）为了验证2010年和1990年女性体重相同的假设

brfss %>% filter(Sex %in% "Female") %>% t.test(Weight ~ Year, data = .)

# # # #韦尔奇两样本t检验# # # #数据:体重在# # t = -27.133, df = 11079, p值< 2.2 e-16 # #备择假设:真正的均数差不等于0 # # 95%置信区间:-8.723607 - -7.548102 # # # #样本估计:# #是1990年集团是2010年集团# # 64.81838 - 72.95424

1.2双变量:2010年体重和身高

过滤数据以包含2010个男性。使用阴谋（）使关系形象化，然后lm ()模型。

male2010 <- brfss %>% filter(年份%in%“2010”，性别%in%“Male”)male2010 %>% plot(Weight ~ Height, data = .)

fit <- male2010 %>% lm(体重~身高，数据= .

## ##呼叫：## LM（公式=权重〜高度，数据=。）## ##系数：##（拦截）高度## -86.8747 0.9873

总结(适合)

# # # #叫:# # lm(公式=体重~身高数据  = .) ## ## 残差:# #最小1 q值3 q最大# # -54.867 -11.349 - -2.677 8.263 - 180.227 # # # #系数:# #估计性病。错误t值公关(> | t |) # #(拦截)-86.87470 6.67835 -13.01 < 2 e-16 * * * # # 0.98727 0.03748 26.34 < 2 e-16高度  *** ## --- ## Signif。编码:0 '***' 0.001 '**' 0.01 '*' 0.05 '。' 0.1 ' ' 1 ## ##残差标准误差:16.88在3617自由度##(60个观察由于缺失删除)##多重r平方:0.1609，调整r平方:0.1607 ## f统计量:693.8在1和3617 DF, p值:< 2.2e-16

多重回归:体重和身高，解释了年龄之间的差异

男性< -  BRFSS％>％过滤器（性别％“男性”）男性％>％LM（重量〜年+高度，数据=。）％>％概要（）

## ## Call: ## lm(formula = Weight ~ Year + Height, data = .) ## ##残差:## Min 1Q Median 3Q Max ## -53.037 -9.032 -1.546 6.846 181.022 ## ##系数:##预估Std. Error t值Pr(>|t|) ##(截距)-80.56568 3.88047 -20.76 <2e-16 *** ## Height 0.90764 0.02174 41.75 <2e-16 *** ##—## Signif。编码:0 '***' 0.001 '**' 0.01 '*' 0.05 '。' 0.1 ' ' 1 ## ##残差标准误差:14.44对7854自由度##(104个观察由于缺失删除)##多重r平方:0.2262，调整r平方:0.2261 ## f统计量:1148对2和7854 DF, p值:< 2.2e-16

两者之间有互动吗一年和高度吗?

male %>% lm(体重~年龄*身高，数据= .)

## ## Call: ## lm(formula = Weight ~ Year * Height, data = .) ## ##残差:## Min 1Q Median 3Q Max ## -54.867 -9.080 -1.731 6.796 180.227 ## ##系数:##预估Std. Error t value Pr(>|t|) ##(截距)-68.49361 5.27146 -12.993 < 2e-16 *** ## Height 0.83990 0.02955 28.421 < 2e-16 *** ## 2010年:高度0.14737 0.04359 3.381 0.000726 *** ##——## Signif。编码:0 '***' 0.001 '**' 0.01 '*' 0.05 '。' 0.1 ' ' 1 ## ##残差标准误差:14.43在7853自由度##(104个观察由于缺失删除)##多重r平方:0.2274，调整r平方:0.2271 ## f统计量:770.3在3和7853 DF, p值:< 2.2e-16

查看使用适合型号的其他活动：

扫帚:整洁(): p值等，如data.frame

扫帚:增加():拟合值、残差等

library(broom) male %>% lm(Weight ~ Year + Height, data = .) %>% augment() %>% as.tibble()

## # A tibble: 7,857 x 11 ## .rownames体重年身高。适合.resid # # <空空的> <双> < fctr > <双> <双> <双> <双> # # 1 1990 80.28585 177.80 80.81238 0.2219842 -0.526530 70.30682 # # 2 2 1990 170.18 73.89618 0.2823537 -3.589360 # # 3 1990 172.72 76.20158 0.2519471 69.85323 -6.348353 # # 4 1990 68.03886 180.34 83.11778 0.2265455 -15.078925 # # 5 1990 88.45051 180.34 83.11778 0.2265455 5.332732 # # 6 6## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##增加了7,847行，增加了4个变量:.hat ， .sigma ， ## # .cooksd ， .std。渣油<双>

1.3可视化:ggplot2

GG.情节:“图形语法”

数据:ggplot2 ()
AES.武断的:aes ()， ' x '和' y '值，点颜色等。
几何学度量总结,分层
- geom_point（）:点
- geom_smooth ():安装线
- geom_ *:…
方面情节(如facet_grid ()）根据因子级别创建“面板”，具有共享轴。

用数据点创建一个图

ggplot(male, aes(x=Height, y = Weight)) + geom_point()

获取基本图和点，并探索不同的平滑关系，如线性模型，非参数平滑

plt <- ggplot(male, aes(x=Height, y = Weight)) + geom_point() plt + geom_smooth(method = "lm")

PLT + geom_smooth() #默认:广义的附加模型

## ' geom_smooth() '使用方法= 'gam'和公式'y ~ s(x, bs = "cs")'

使用一个aes ()外观以色彩平滑线条为基础一年,或facet_grid ()单独的年。

ggplot(male, aes(x = Weight)) + geom_density(aes(fill = Year)， alpha = .2)

plt + geom_smooth(method = "lm"， aes(color = Year))

plt + facet_grid(~ Year) + geom_smooth(method = "lm")

2多变量分析

这是一个经典的微阵列实验。微阵列由“探针”组成，其为它们的表达水平流出基因。在实验中，我们正在研究，在128个样本中的每一个上测量了12625个探针。MicroArray测定估计的原始表达水平需要相当大的预处理，我们将进行预处理的数据。

2.1输入和设置

首先在磁盘上找到表达式数据文件。

path <- file.choose() #查找ALL-expression.csv stopifnot(file.exists(path))

数据以“逗号分隔值”的格式存储，每个探针占用一行，该探针中每个样本的表达式值用逗号分隔。使用read_csv ()。示例标识符出现在第一列中。

exprs < - read_csv(路径)

## cols(## .default = col_double()， ## Gene = col_character() ##)

##请参阅全列规范的规格（...）。

我们还将输入描述每个列的数据

#查找ALL-phenoData.csv stopifnot(file.exists(path))

pdata < - read_csv(路径)

##使用列规范解析:## cols(## .default = col_character()， ## age = col_integer()， ## ' t(4;11) ' = col_logical()， ## ' t(9;22) ' = col_logical()， ## cyto。Normal = col_logical()， ## CCR = col_logical()， ## relapse = col_logical()， ## transplant = col_logical() ##

##请参阅全列规范的规格（...）。

pdata

样本:128 x 22鳕鱼诊断性别年龄BT缓解CR日期cr # # <空空的> <空空的> <空空的> <空空的> < int > <空空的> <空空的> <空空的> < >从而向# # 1 01005 1005 5/21/1997 53 B2 cr cr 8/6/1997 # # 2 01010 1010 3/29/2000 19 B2 cr cr 6/27/2000 # # 3 03002 3002 6/24/1998 F 52 B4 cr cr 8/17/1998 # # 4 04006 4006 7/17/1997 38 B1 cr cr 9/8/1997 # # 5 04007 4007 7/22/1997 57 B2 cr cr 9/17/1997 # # 6 04008 4008 7/30/1997 17 B1 cr cr9/27/1997 ## 10 08001 8001 1/15/1997 m 40 b2 cr cr 1/ 17/2000 ## 9 06002 6002 3/19/1997 m 15 b2 cr cr 6/9/1997 ## 10 08001 8001 1/15/1997 m 40 b2 cr cr 3/26/1997 ## ## #…多118行，多13个变量:' t(4;11) ' ， ## # ' t(9;22) ' ， cyto。Normal ， citog ， mol.biol ， ' fusion ## # protein ' ， MDR ， kinet ， CCR ，复发， ## #移植， f.u ， ' date last seen '

2.2清洁和探索

表达式数据有时被称为“宽”格式;另一种格式是“tall”，样本和基因将单个观察表达分组。使用tidyr:收集()将宽格式的列聚集成表示高格式的两列，不包括基因列从收集操作。

exprs <- exprs %>% gather("Sample"， "Expression"， - gene)

稍微探索一下数据，例如，表达值的摘要和直方图，以及每个基因平均表达值的直方图。

expprs %>% select(表达式)%>% summary()

##中位数:5.469 # Mean: 5.625 # 3rd Qu.: 6.827 # Max: 14.127

exprs $ Expression %>% hist()

expprs %>% group_by(Gene) %>% summarize(AveExprs = mean(Expression)) %$% AveExprs %>% hist(breaks=50)

为了后续的分析，我们还想简化' B或T '细胞类型的分类

PDATA < -  PDATA％>％变异（B_OR_T =因子（SUBSTR（BT，1,1））））

2.3无监督机器学习-多维尺度

我们希望将高维数据简化为低维数据，以便可视化。为此，我们需要dist ()样品之间的状态。从? dist，输入可以是data.frame，其中行表示样本和列代表表达式值。使用传播()创建适当的数据exprs.，并将结果管道到dist ()ance.x

input <- exprs %>% spread(Gene, Expression) samples <- input $ Sample input <- input %>% select(-Sample) %>% as。矩阵行名(输入)<- samples

计算样本之间的距离，并使用该距离进行MDS缩放

MDS <- dist(input) %>% cmdscale()

结果是一个矩阵;让它“整洁”强制吃;将Sample标识符添加为不同的列。

mds <- mds %>% as.tibble() %>% mutate(Sample = rownames(mds))

可视化的结果

GGPLOT（MDS，AES（x = v1，y = v2））+ geom_point（）

在“信仰之眼”中，似乎有两组观点。为了探索这一点，将MDS标度与表型数据结合起来

加入< -  inner_join（mds，pdata）

##加入，by = "样本"

并使用b_or_t.柱子作为一种审美色彩点

ggplot(join, aes(x = V1, y = V2)) + geom_point(aes(color = B_or_T))

A.3 - 统计和图形

2017年9月11 - 12日

内容

1探索，单变量和双变量统计和可视化

1.1单变量：`t.test（）`1990年和2010年女性的体重

1.2双变量:2010年体重和身高

1.3可视化:ggplot2

2多变量分析

2.1输入和设置

2.2清洁和探索

2.3无监督机器学习-多维尺度

A.3 - 统计和图形

2017年9月11 - 12日

内容

1探索，单变量和双变量统计和可视化

1.1单变量：t.test（）1990年和2010年女性的体重

1.2双变量:2010年体重和身高

1.3可视化:ggplot2

2多变量分析

2.1输入和设置

2.2清洁和探索

2.3无监督机器学习-多维尺度

1.1单变量：`t.test（）`1990年和2010年女性的体重