```{r setup,echo = false}库(SearchBiociondumon)StopIfNot(Biocinstaller :: Biocversion()==“3.1”)`````{R样式,Echo = False,结果='ASIS'}生物科学:: knrdown()knitr :: Opts_chunk $ set(tidy = false)```#常见序列分析工作流动Martin Morgan,Sonali Arora
2015年2月3日## RNA-SEQ参见[讲义笔记](b02.1_rnaseq.html)和[实验室](b02.1_rnaseqlab.html)。RNA-SEQ差异表达已知_genes_ - 最简单的场景 - 实验设计:简单,复制;追踪协变量并了解批量效应 - 测序:读取的中等长度和数量;单个或配对端(虽然可能配对)。- 对齐:基本拼接感知对齐器,例如_bowtie2_,_star_。可行的_bioconductor_方法:`r biocpkg(“rsubread”)`,`r biocpkg(“rbowtie”)`(特别是通过`r biocpkg(“quasr”)`包)。- 减少:`genomicranges :: summarizeoverlaps()`或外部工具,使用来自`txdb的基因模型。*`包或gff / gtf文件。最终结果:计数矩阵。- 分析:`r Biocpkg(“deseq2”)`,`r biocpkg(“edger”)`,以及其他软件。RNA-SEQ差异表达式的已知和新颖_Trancripts_ - 流行的非_R_工作流程:_RBOWTIE2_,_tophat_,_cufflinks_,_cuffdiff_。 - _Biocondutor_ options - `r Biocpkg("DEXSeq")`: differential _exon_ use. - `Rsubread::subjunc()` for aligning without requiring known gene models. - `r Biocpkg("cummeRbund")`: working with _cufflinks_ output. Single-cell expression - `r Biocpkg("monocle")` ## ChIP-seq See my recent [slides](//www.andersvercelli.com/help/course-materials/2014/CSAMA2014/4_Thursday/lectures/ChIPSeq_slides.pdf) outlining ChIP-seq and relevant _Bioconductor_ software. - Experimental design / wet lab: important to effectively enrich genomic DNA via ChIP, otherwise hard to distinguish signal peaks from background - Sequencing: moderate length and number of single-end reads very adequate. - Alignment: Basic aligners sufficient - Reduction - External software; many tools depending on application, e.g., _MACS_. - Product: BED and / or WIG files of called peaks - Analysis & Comprehension - `r Biocpkg("ChIPQC")` for quality control. - `r Biocpkg("rtracklayer")` to input BED and WIG files to standard _Bioconductor_ data structures. - `r Biocpkg("ChIPpeakAnno")`, `r Biocpkg("ChIPXpres")` for annotating peaks in relation to genes. - `r Biocpkg("DiffBind")` to assess differential representation of peaks in a designed experiment. - `r Biocpkg("AnnotationHub")` for accessing (some) consortium-level summary data. ## Copy Number Experimental design - Duplications or deletions larger than 1 kb - Germ line (primarily diploid genome, homogeneous sample, integer copy numbers) or somatic variants? - Tumor / normal pairs? Assays - aCGH, SNP, and other arrays (`r Biocpkg("CGHbase")`, `r Biocpkg("crlmm")`, `r Biocpkg("CopyNumber450K")`) - Low or high-coverage; exome or whole-genome sequencing. Reduction - Bin and count. GC and other (e.g., exon length) correction. Easily and efficiently done with, e.g., `GenomicRanges::tileGenome()` and `r Biocpkg("GenomicFiles")`. - Segment -- circular binary segmentation (often via `r Biocpkg("DNAcopy")`), HMM Analysis & comprehension - 45 packages tagged with "CopyNumberVariation" in [biocViews](//www.andersvercelli.com/packages/devel/BiocViews.html#___CopyNumberVariation); also terms "DNASeq", "ExomeSeq", "WholeGenome" - Represent duplicated regions as genomic ranges; integrates very easily in _Bioconductor_ annotation work flows. ## Variants See Michael Lawrence's variant calling with [VariantTools](//www.andersvercelli.com/help/course-materials/2014/BioC2014/Lawrence_Tutorial.pdf). and Val Obenchain's manipulation and annotation of called variants with [VariantAnnotation](//www.andersvercelli.com/help/workflows/variants/). - Sequencing: requires high-quality reads with high per-nucleotide depth of coverage -- longer, paired-end sequencing. - Alignment: requires effective aligners; _BWA_, _GMAP_, ... - `r Biocpkg("gmapR")` wraps the GMAP aligner in _R_. - Reduction: typically to VCF files summarizing variants and / or population-level variation. _GATK_ and other non-_R_ tools commonly used. - `r Biocpkg("VariantTools")` includes facilities for calling variants. - `r Biocpkg("h5vc")` targets a different intermediate step: summarize base counts at each position in the genome; use this as a starting point for calling variants, and to evaluate false positives, etc. - Analysis & comprehension - `r Biocpkg("VariantAnnotation")`, `r Biocpkg("ensemblVEP")` for querying / inputting VCF files, and for annotation of variants ("is this a coding variant?", etc.). - `r Biocpkg("SomaticSignatures")` for working with somatic signatures of single-nucleotide variants. ## Epigenomics See the short [introduction](//www.andersvercelli.com/help/course-materials/2014/Epigenomics/MethylationArrays.html) and [lab](//www.andersvercelli.com/help/course-materials/2014/Epigenomics/MethylationArrays-lab.html) centered around Illumina 450k methylation arrays and the `r Biocpkg("minfi")` package. - Analysis & comprehension: `r Biocpkg("bsseq")`, `r Biocpkg("BiSeq")` for processing and analysis; `r Biocpkg("bumphunter")` as basic tool for identifying CpG features. ## Microbiome - Experimental design: typically population-level surveys with moderate (10's-100's) of samples. - Wet lab & sequencing: often target phylogenetically-informative genes, requiring longer (overlapping) paired-end reads. Many existing studies used 454 technology, which has a different sequencing error model than Illumina (e.g., homopolymers are a common error, instead of trailing nucleotide quality deterioration). - Reduction: Pre-processing (e.g., knitting together overlapping paired-end reads) and taxonomic classification / placement in third-party software, e.g., _QIIME_, _pplacer_. End result: count table summarizing represenation of distinct taxa in each sample. - `r Biocpkg("rRDP")` provides an _R_ / _Bioconductor_ interface to the RDP classifiere. - Analysis: _R_ / _Bioconductor_ and many insights from microarray / RNA-seq analysis well suited to count table, but common pipelines have re- or dis-invented the wheel. - `r Biocpkg("phyloseq")` provides very nice tools for general analysis.