---标题:“2.使用数据:”摘要化学者“”作者:“Martin Morgan(Martin.morgan@roswellpark.org)
罗斯威尔公园癌症研究所,布法罗,纽约
2015年10月5日 - 9“输出:Biocstyle :: html_document:toc:true toc_depth:2 vignette:>%\ vignetteIndexentry {2。使用数据:摘要,概述}%\ vignetteengine {knitr :: Rarmard} ---```{r style,echo = false,结果='asis'} biocstyle :: markdown()选项(width = 100,max.print = 1000)knitr :: opts_chunk $ set(eval = as.logical(sys.getenv()KNITR_EVAL“,”真“)),缓存= AS.LOGICY(SYS.GETENV(”KNITR_CACHE“,”TRUE“)))`````{R设置,echo = false,消息= false,警告= false} suppressPackageStartUpMessages({库(所有)库(Airway)})```本课程中的材料需要R版本3.2和Biocumon V9.2``` {R配置-test} stopifnot(getRversion()> ='3.2'&&getRversion()<'3.3',Biocinstaller :: Biocverser()==“3.2”)```你的老板一直在致急性淋巴细胞肺炎(全部)多年。一个数据集由微阵列基因表达值组成12625128种不同样品中的基因。您的老板想分析DAT的不同子集A,并给了你几个选项卡分隔的文件。一个文件(_allphenodata.tsv_)描述了样本,另一个(_allassay.tsv_)包含预处理的基因表达数据。你应该想出一种方法来创造你的老板询问的子集。您意识到您可以阅读Excel并在那里读取数据,但您担心能够做可重复的研究,并且您对纪念错误似乎总是似乎似乎的簿记错误。所以你认为你会给_biocidodder_ try ... ##读取到_r_的数据,将[allphenodata.tsv] []和[allassay.tsv] []文件下载到当前的Workign目录,`getwd()`。##使用`read.table()`读取_allphenodata.tsv_``` {r read.table} fname =“allphenodata.tsv”##使用file.choose()以查找文件pdata = read.table(fname)```````````(PDATA)DIM(PDATA)头(PDATA)摘要(PDATA $性别)摘要(PDATA $ CYTO.NORMAL)```“提醒您的各种方式来包括数据的子集和访问列的列.FRAME`` {R全部子集} PDATA [1:5,3:4] PDATA [1:5,]头(PDATA [,3,3:5])尾部(PDATA [,3,3:5],3)头(PDATA $ AGE)头(PDATA$sex) head(pdata[pdata$age > 21,]) ``` ## Use `read.table()` to read the expression values ```{r exprs} fname <- "ALLassay.tsv" exprs <- as.matrix(read.table(fname, check.names=FALSE)) ``` Use `dim()` to figure out the number of rows and columns in the expression data. Use subscripts to look at the first few rows and columns `exprs[1:5, 1:5]`. What are the row names? Do the column names agree with the row names of the `pdata` object? What is the `range()` of the expression data? Can you create a histogram (hint: `hist()`) of the data? What is `plot(density(exprs))`? Can you use `plot()` and `lines()` to plot the density of each sample, in a single figure? # Make a _SummarizedExperiment_ object You could work with the matrix and data frame directly, but it is better to put these related parts of the data into a single object, a _SummarizedExperiment_. Load the appropriate _Bioconductor_ package ```{r SummarizedExperiment} if (BiocInstaller::biocVersion() >= "3.2") { library(SummarizedExperiment) } else { library(GenomicRanges) } ``` and create a single _SummarizedExperiment_ object from the two parts of the data. Some _Bioconductor_ objects enhance the behavior of base _R_ objects; an example of this is `DataFrame()` ```{r make-SE} se <- SummarizedExperiment(exprs, colData=DataFrame(pdata)) ``` Explore the object, noting that you can retrieve the original elements, and can subset in a coordinated fashion. ```{r se-ops} head(colData(se)) assay(se)[1:5, 1:5] se$sex %in% "M" males <- se[,se$sex %in% "M"] males assay(males)[1:5, 1:5] ``` Use `vignette("SummarizedExperiment")` to read about other operations on _SummarizedExperiment_. # Show off your skills Quickly create the following subsets of data for your boss: 1. All women in the study. 2. All women over 40 3. An object `bcrabl` containing individuals with `mol.biol` belonging either to "BCR/ABL" or "NEG". Can you...? 1. Create a new column that simplifies the `BT` column (which lists different B- and T-cell subtypes) to contain just `B` or `T`, e.g., re-coding B, B1, B2, B3 and B4 to simply `B`, and likewise for `T`? 2. Use `aggregate()` to calculate the average age of males and females in the BCR/ABL and NEG treatment groups? 3. Use `t.test()` to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using `boxplot()`. In both cases, use the `formula` interface. Consult the help page `?t.test` and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change? # Document your work Summarize the exercises above in a simple script. Can you figure out how to write a 'markdown' document that includes R code chunks, as well as text describing what you did, and figures and tables showing the results? # Resources Acknowledgements - Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum. - The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation. ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ``` [ALLphenoData.tsv]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/ALLphenoData.tsv [ALLassay.tsv]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/ALLassay.tsv