题目:“A.2—数据输入和操作”作者:“马丁·摩根 " date: "11 - 12 September 2017"输出:BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{A。2——数据输入和操作}% \VignetteEngine{knitr::rmarkdown}——' ' ' {r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set(eval=as.logical(Sys. group);采用“KNITR_EVAL”,“真正的”)),缓存= as.logical (Sys。getenv("KNITR_CACHE", "TRUE"))) suppressPackageStartupMessages({library(tidyverse)})我们将采用一种特殊的方法来进行数据输入、整理和基本分析,称为“整理法”。首先加载[tidyverse][]包。' ' ' {r}库(tidyverse)我们将涵盖以下函数:-数据输入- ' read_csv() ':从逗号分隔的值文件中输入的数据,作为(' Data .frame '类)“宠物猫”。—管道—' %>% ':'将数据从源管道到函数。—“%$%”:提取管道中的列。 - `.`: refer to the incoming data. - Data manipulation - `group_by()`: define groups of rows based on column values - `summarize()`: apply functions to groups of data to produce a summary of the data. - `filter()`: filter rows to match criteria in columns - `select()`: select columns for subsequent use - `mutate()`: update or add columns of data. - Other functions and concepts - `%in%`: identify elements of the left-hand vector that are elements of the set defined by the right-hand vector. - `t.test()`: perform a t-test. - `boxplot()`, `hist()`: basic visualization. - `~`: specify a formula describing the relationship between a dependent (left-hand side) variable and independent (right-hand side) variable(s). [tidyverse]: https://cran.r-project.org/package=tidyverse [magrittr]: https://cran.r-project.org/package=magrittr # Exercise 1: BRFSS Survey Data We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System ([BRFSS][]) telephone survey. Check out the link for more information. We'll look at a subset of the data. 1. Use `file.choose()` to find the path to the file 'BRFSS-subset.csv' ```{r file.choose, eval=FALSE} path <- file.choose() ```2.使用' read_csv() '输入数据,赋值给变量' brfss '并可视化前几行。' ' {r read.csv} BRFSS <- read_csv(path) BRFSS ' ' ' 3。从数据来看……-样本中有多少个体-测量了哪些变量?你能猜出所用的单位吗,例如:体重和身高?4.tidyverse使用'管道' ' %>% '将数据从一个命令发送到另一个命令。有少量用于操作数据的关键函数。我们将使用' group_by() '按' Sex '对数据进行分组,然后使用' summary (n=n()) '来计算每组的观察数。 ```{r brfss-sex} brfss %>% group_by(Sex) %>% summarize(N = n()) ``` 5. Use `group_by(Year, Sex)` and `summarize(N = n())` to summarize the number of individuals from each year and sex. ```{r brfss-sex-year} brfss %>% group_by(Year, Sex) %>% summarize(N = n()) ``` 6. Calculate the average age in each year and sex by adding the argument `Age = mean(Age, na.rm=TRUE)` to `summarize()` ```{r brfss-mean-age} brfss %>% group_by(Year, Sex) %>% summarize(N = n(), Age = mean(Age, na.rm=TRUE)) ``` 6. `Year` is input as an integer vector, and Sex as a character vector. Actually, though, these are both factors. Use `mutate()` and `factor()` to update the type of these columns. Re-assign the updated tibble to `brfss` ```{r brfss-mutate} brfss %>% mutate(Year = factor(Year), Sex = factor(Sex)) brfss <- brfss %>% mutate(Year = factor(Year), Sex = factor(Sex)) ``` 7. There are several other pipes available (see also the [magrittr][] package). `%$%` extracts a column. Here we look at the `levels()` of the factor that we created. ```{r brfss-levels} brfss %$% Sex %>% levels() brfss %$% Year%>% levels() ``` 8. It's usually better to 'clean' data as soon as possible. Visit the help page `?read_csv`, look at the `col_types =` argument, and the help pages `?cols` and `?col_factor`. Input the data in it's correct format, with Sex and Year as factors ```{r brfss-read-cols} col_types <- cols( Age = col_integer(), Weight = col_double(), Sex = col_factor(c("Female", "Male")), Height = col_double(), Year = col_factor(c("1990", "2010")) ) brfss <- read_csv(path, col_types = col_types) brfss brfss %>% summary() ``` 9. Use `filter()` to create a subset of the data consisting of only the 1990 observations (`Year` in the set that consists of the single element `1990`, `Year %in% 1990`). Optionally, save this to a new variable `brfss_1990`. ```{r filter} brfss %>% filter(Year %in% 1990) brfss_1990 <- brfss %>% filter(Year %in% 1990) ``` 10. Pipe this subset to `t.test()` to ask whether Weight depends on Sex. The first argument to `t.test` is a 'formula' describing the relation between dependent and independent variables; we use the formula `Weight ~ Sex`. The second argument to `t.test` is the data set to use -- indicate the data from the pipe with `data = .` ```{r t-test-1990} brfss %>% filter(Year %in% 1990) %>% t.test(Weight ~ Sex, data = .) ``` What about differences between weights of males (or females) in 1990 versus 2010? 11. Use `boxplot()` to plot the weights of the Male individuals. Can you transform weight, e.g., taking the square root, before plotting? Interpret the results. Do similar boxplots for the t-tests of the previous question. ```{r brfss-boxplot, fig.width=5, fig.height=5} brfss %>% filter(Sex %in% "Male") %>% boxplot(Weight ~ Year, data = .) brfss %>% filter(Sex %in% "Male") %>% mutate(SqrtWeight = sqrt(Weight)) %>% boxplot(SqrtWeight ~ Year, data = .) ``` 12. Use `hist()` to plot a histogram of weights of the 1990 Female individuals. From `?hist`, the function is expecting a vector of values, so use `%$%` to select the `Weight` column and pipe to `hist()`. ```{r brfss-hist, fig.width=5, fig.height=5} brfss %>% filter(Year %in% "1990", Sex %in% "Female") %$% Weight %>% hist(main="1990 Female Weight") ``` [BRFSS]: http://www.cdc.gov/brfss/about/index.htm # Exercise 2: ALL Phenotypic Data This data comes from an (old) Acute Lymphoid Leukemia microarray data set. Choose the file that contains ALL (acute lymphoblastic leukemia) patient information and input the date using `read.csv()`; for `read.csv()`, use `row.names=1` to indicate that the first column contains row names. ```{r ALL-choose, eval=FALSE} path <- file.choose() # look for ALL-phenoData.csv ```' ' ' {r ALL-input} stopifnot(file.exists(path)) pdata <- read_csv(path) pdata ' ' '使用' select() '选择一些列,例如' mol. input '。杂志”和“转基因”。使用' filter() '来过滤,只包括40岁以上的女性。{r All-select} pdata %>% select(mol。biol, BT) pdata %>%过滤器(性别%在%“F”,年龄> 40)' ' '使用' mol '。biol’列过滤数据以包含集合“c(“BCR/ABL”,“NEG”)”中的个体(即,他们有“mol. biol”)。biol '等于' BCR/ABL '或' NEG '))' ' ' {r ALL-BCR/ abl -子集}bcrabl <- pdata %>% filter(mol。生物%在% c("BCR/ABL", "NEG"))' ' '我们想通过改变摩尔来整理数据。生物是一个因素。我们还想将“BT”列(B-或T细胞子类型)改为“B”或“T”,使用“substr(BT, 1,1)”(即,对于“BT”的每个元素,取从字母1开始到字母1的子字符串—第一个字母)' ' ' {r bcrabl-mutate} bcrabl <- bcrabl %>% mutate(mol.biol = factor(mol.biol), B_or_T = factor(substr(BT, 1,1))) ``` How many bcrabl samples have B- and T-cell types in each of the BCR/ABL and NEG groups? ```{r ALL-BCR/ABL-BT} bcrabl %>% group_by(B_or_T, mol.biol) %>% summarize(N = n()) ``` Calculate the average age of males and females in the BCR/ABL and NEG treatment groups. ```{r ALL-aggregate} bcrabl %>% group_by(sex, mol.biol) %>% summarize(age = mean(age, na.rm=TRUE)) ``` Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface and `.` to refer to the incoming data set. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? ```{r ALL-age} bcrabl %>% t.test(age ~ mol.biol, .) bcrabl %>% boxplot(age ~ mol.biol, .) ```