---标题:“附录:使用鲑鱼的基因表达”作者:“马丁摩根 %\ vignetteIndexentry {附录:基因表达使用鲑鱼}%\ vignetteengine {knitr :: Rarmardown} ---```{R样式,回声= false,结果='asis'} knitr :: opts_chunk $ set(eval =.Logical(sys.getenv(“knitr_eval”,“true”)),cache = as.logical(sys.getenv(“knitr_cache”,“true”)))suppresspackageStartUpMessages({库(Bigostrings)库(Shortread)库(rtracklayer)库(Tximport)库(摘要)库(Deseq2)库(Tidyverse)})```#简介介绍了使用超快速对准工作所需的已知基因差异表达分析所需的一些“上游”步骤FASTQ文件的流量在没有中间BAM文件的情况下计算数据。我们遵循[鲑鱼教程] [],_r_-ified在稍后的阶段。我们假设带有文件夹的Linux操作系统,其中包含o / a / salmon / tutorial`,以及安装在`〜/ bin / salmon`的salmon`。```{r}教程< - “〜/ a / salmon / tutorial”salmon < - “〜/ bin / salmon”```[salmon教程]:https://combine-lab.github.io/salmon/gets_started /#下载教程需要表示(配对端)排序样本的FASTQ文件。我们使用“WGET”命令行工具从短读归档(SRA)中检索它们。```#!/ bin / bash dir =〜/ a / tutorial sra = ftp://ftp.sra.ebi.ac.uk/vol1/fastq为i在`seq 28 40`中;do mkdir -p $ {dir} / dirs / drr0161 $ {i}; cd ${DIR}/data/DRR0161${i}; wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz; wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz; done cd $DIR ``` We will align the sequences to a reference transcriptome, in this case from Ensembl for _Arabidopsis thaliana_. We download the sequence and index it. ``` #! /bin/bash DIR=~/a/tutorial ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28 curl \ ${ENSEMBL}/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz \ -o ${DIR}/athal.fa.gz salmon index -t ${DIR}/athal.fa.gz -i ${DIR}/athal_index ``` If we were interested, it's easy enough to read this in to _R_ ```{r} library(Biostrings) readDNAStringSet(file.path(TUTORIAL, "athal.fa.gz")) ``` We also need to know how transcripts map to genes since we'll do a gene-level analysis. We download the annotation file (gff3) matching the transcript data base. ``` #! /bin/bash DIR=~/a/tutorial ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28 curl \ ${ENSEMBL}/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gff3.gz\ -o ${DIR}/Arabidopsis_thaliana.TAIR10.28.gff3.gz ``` # Quantification The tutorial bash script invokes `salmon` on each sample, using the paired-end FASTQ files. ``` #!/bin/bash DIR=/home/ubuntu/tutorial SALMON=/home/ubuntu/bin/salmon for fn in $DIR/data/DRR0161{25..40}; do sample=`basename ${fn}` echo "Processing sample ${samp}" ${SALMON} quant -i athal_index -l A \ -1 ${fn}/${sample}_1.fastq.gz \ -2 ${fn}/${sample}_2.fastq.gz \ -p 8 -o quants/${sample}_quant done ``` An _R_ version to quantify one sample is ```{r} quantify = function(sample_file, output, salmon, doit = TRUE) { sample = basename(sample_file) samples = file.path(sample_file, paste0(sample, "_", 1:2, ".fastq.gz")) output = file.path(output, paste0(sample, "_quant")) args = c( "quant", "-i", index, "-l A", "-1", samples[1], "-2", samples[2], "-p", parallel::detectCores(), "-o", output ) if (doit) system2(salmon, args) else { txt = paste(salmon, paste(args, collapse=" ")) message(txt) invisible(0) } } ``` Here we quantify all samples (set `doit = TRUE` to actually perform the counting step. ```{r, eval = FALSE} index = file.path(TUTORIAL, "athal_index") output = file.path(TUTORIAL, "quants") data_dir = file.path(TUTORIAL, "data") sample_files = normalizePath(dir(data_dir, full = TRUE)) args = list(salmon = SALMON, output = output, doit = FALSE) Map(quantify, sample_files, MoreArgs = args) ``` # Analysis in _R_ We need to provide a mapping between the transcripts that the reads were aligned to and the genes that we will perform the analysis on. We import the GFF annotation file, and extract the transcripts and their parents (genes) as a tibble. ## Transcript - gene map ```{r} library(rtracklayer) library(tidyverse) file = file.path(TUTORIAL, "Arabidopsis_thaliana.TAIR10.29.gff3") gff = import(file) tx2gene = tibble( txid = gff$transcript_id, geneid = as.character(gff$Parent) ) %>% na.omit() ``` ## Input To import the count data, we use the [tximport][] package to import the data. We find the relevant count files, and provide names to identify each file (these names are propagated by the import function to the column names of the output count matrix). ```{r} library(tximport) library(SummarizedExperiment) files = dir( file.path(TUTORIAL, "quants"), pattern = ".sf", recursive = TRUE, full = TRUE ) names(files) = sub("_quant", "", basename(dirname(files))) counts = tximport(files, type = "salmon", tx2gene = tx2gene) names(counts) ``` The input is a list with three identically-dimensioned matrices and a fourth element describing how the other elements were determine. We put the three matrices into a SummarizedExperiment, to connect up with the [DESeq2][] work flow ``` library(SummarizedExperiment) se = SummarizedExperiment(counts[-4]) ``` ## Experimental design Next steps: add the experimental design as `colData()` on the `SummarizedExperiment`. [tximport]: //www.andersvercelli.com/packages/tximport [DESeq2]: //www.andersvercelli.com/packages/DESeq2