用户!2014
作者:马丁摩根(mtmorgan@fhcrc.org), Sonali Arora
日期:2014年6月30日
输入和操作:生物仪器
> NM_078863_up_2000_chr2L_16764737_f chr2L: 16764737 - 16766736 gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg……nm_001201794_up_2000_chr2_8382455_f chr2L:8382455-8384454 ttatttatgtaggcgcccccgcagcaaagcactaattccggg
输入和操作:ShortReadreadfastq()
,FASTQSTREAMER()
,FASTQSAMPLER()
@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1 cctgagtgaagctgatcttgatctacgaagagatatcttgatcgtcgaggagatgctgaccttgacct + HHGHHGHHHHHHHHDGG>CE?=896=: @ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1 gcggtatgctggaaggtgctcgaatggagagagcgccagccccggctgagagccgcagcctcagagtccgccgccc + DE?DD>ED4>EEE>DE8EEEDE8B EB<@3;########################
输入和操作:“低级”Rsamtools,scanBam ()
,BamFile ()
;'高水平'GenomicAlignments
头
@HD VN:1.0 SO:坐标@SQ SN:CHR1 LN:249250621 @SQ SN:CHR10 LN:135534747 @SQ SN:CHR11 LN:135006516 ... @SQ SN:CHRY LN:59373566 @PG ID:Tophat VN:2.0.8b cl:/home/hpages/tophat-2.0.8b.linux_x86_64/tophat - mate-inner-dist 150 - solexa-quals - max-multihits 5 - 不和谐 - 不合解 -Coverage-Search - MicroExon-Search - Library型FR-Unstranded --Num-Threads 2 - 小码-Dir Tophat2_out / Err127306 / Home/hpages/Bowtie2-2.1.0/Indexes/HG19 FASTQ / ERR127306_1.FASTQ FASTQ/err127306_2.fastq.
对齐:ID,标志,对齐和配对
ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413…ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720…ERR127306.933914 339 chr14 19653707 1 66M120N6M = 19653686 -213…ERR127306.11052450 83 chr14 19653707 3 66M120N6M = 19652348 -1551…ERR127306.24611331 147 chr14 19653708 1 65M120N7M = 19653675 -225…ERR127306.2698854 419 chr14 19653717 0 56M120N16M = 19653935 290…ERR127306.2698854 163 chr14 19653717 0 56M120N16M = 19653935 2019…
排列:顺序和质量
…GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%)) ...TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)**** ...TGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCT '******&%)&)))&")')'')'*((******&)&'')'))$))'')&))$)**&&**************** ...TGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCT ##&&(#')$')'%&)%$#$%"%###&!%))'%%''%'))&))#)&%((%())))%)%)))%********* ...GAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCTT )&$'$'$%!&&%&!'%'))%''&%'&))))''$""'%'%&%'#'%'"!'')#&)))))%$)%)&'"'))) ...TTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCTTCATGTGGCT ++++++++++++++++++++++++++++++++++++++*++++++**++++**+**''**+*+*'*)))*)# ...TTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCTTCATGTGGCT ++++++++++++++++++++++++++++++++++++++*++++++**++++**+**''**+*+*'*)))*)#
阵营:标签
...作为:I:0:0 XN:I:0 XM:I:0 XO:I:0 XG:I:0 NM:I:0 MD:Z:72 YT:Z:UU NH:I:2 CC:Z:IC:I:16189276嗨:I:0 ...如下:I:I:0 XN:I:0 XO:I:0 XO:I:0 XG:I:0 NM:I:0 MD:Z:72 YT:Z:UU NH:I:3 CC:Z:= CP:I:19921600嗨:I:0 ...如:I:0:0 XN:I:0 XM:I:0 XO:0 XG:i:0 nm:i:4 md:z:72 yt:z:uu xs:a:+ nh:i:3 cc:z:z:= cp:i:19921465嗨:i:0 ...如:我:0 xn:i:0 xm:i:0 xo:i:0 xg:i:0 nm:i:0 nm:i:4 md:z:72 yt:z:uu xs:a:+ nh:i:2 cc:z:IC:I:16189138嗨:I:0 ...如:I:0:0:0:0 XM:I:0 XO:I:0 XG:I:0 NM:0:5 MD:Z:72YT:Z:UU XS:A:+ NH:I:3 CC:Z:= CP:I:19921464嗨:I:0 ...如下:I:0:0 XM:I:0 XO:I:0 XG:I:0 MD:Z:72 NM:I:0 XS:A:+ NH:I:5 CC:Z:= CP:I:19653717嗨:I:0 ...如下:I:0:0:0 XM:I:0 XO:I:0 XG:I:0 MD:Z:72 NM:I:0 XS:A:+ NH:I:5 CC:Z:= CP:I:19921455嗨:I:1
输入和操作:VariantAnnotation.readVcf ()
,readInfo()
,readgeno()
选择性地与scanvcfparam()
。
头
## fileformat = vcfv4.2 ## filedate = 20090805 ## source = myimputationprogramv3.1 ##参考=文件:///seq/references/1000genomespilot-ncbi36.fasta ## contig = ## info = partial ## info = ##info = ... ##滤波器= ##滤波器= ... ##FORMAT= ##FORMAT=
位置
#CHROM POS ID REF ALT QUAL FILTER…20 14370 rs6054257 G A 29 PASS…17330年20。tq10…20 1110696 rs6040355 A G,T 67 PASS…20 1230237。T。47传递……20 1234567 microsat1 GTC G,GTCT 50 PASS…
变体信息
#Chrom POS ...信息...... 20 14370 ... NS = 3; DP = 14; AF = 0.5; DB; H2 ... 20 17330 ... NS = 3; DP = 11; AF = 0.017。。。20 1110696 ... NS=2;DP=10;AF=0.333,0.667;AA=T;DB ... 20 1230237 ... NS=3;DP=13;AA=T ... 20 1234567 ... NS=3;DP=9;AA=G ...
基因型格式和样本
…POS……格式NA00001 NA00002 NA00003…14370年……GT:《GQ》:DP:总部0 | 0:48:1:51,51 1 | 0:48:8:51,51 1/1:43:5:,……17330年……GT:《GQ》:DP:总部0 | 0:49:3:58,50 0 | 1:3:5:65,3 0/0:41:3…1110696……GT:《GQ》:DP:总部1 | 2:21:6:23,27日2 | 1:2:0:18,2 2/2:35:4…1230237…… GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 ... 1234567 ... GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
输入:rtracklayer.进口()
GTF:基因模型
组件坐标
7蛋白蛋白蛋白_Coding基因27221129 27224842。- 。...... 7蛋白质_Coding脚本27221134 27224835。- 。... 7蛋白质_Coding外显子27224055 27224835。- 。... 7蛋白质_Coding CDS 27224055 27224763。- 0 ... 7 Protoding Start_Codon 27224761 27224763。- 0 ... 7 Protoding外显子27221134 27222647。- 。 ... 7 protein_coding CDS 27222418 27222647 . - 2 ... 7 protein_coding stop_codon 27222415 27222417 . - 0 ... 7 protein_coding UTR 27224764 27224835 . - . ... 7 protein_coding UTR 27221134 27222414 . - . ...
注释
gene_id“ENSG00000005073”;gene_name“HOXA11”;gene_source“ensembl_havana”;gene_biotype“protein_coding”;……transcript_id“ENST00000006015”;transcript_name“hoxa11 - 001”;transcript_source“ensembl_havana”;标记“ccd”;ccds_id“CCDS5411”; ... exon_number "1"; exon_id "ENSE00001147062"; ... exon_number "1"; protein_id "ENSP00000006015"; ... exon_number "1"; ... exon_number "2"; exon_id "ENSE00002099557"; ... exon_number "2"; protein_id "ENSP00000006015"; ... exon_number "2"; ... ...
范围
start ()
/结束()
/宽度()
长度()
,子集等。mcols ()
Seqinfo
,包括seqlevels
和seqlengths
Intra-range方法
转移()
,狭窄的()
,侧翼()
,发起人()
,调整大小()
,限制()
,削减()
" ? intra-range-methods
Inter-range方法
range ()
,reduce ()
,空白()
,分离()
覆盖()
(!)" ? inter-range-methods
Between-range方法
findoverlaps()
,countOverlaps ()
、……%超过%
,%在%
,% %外
;联盟()
,相交()
,setdiff ()
,punion ()
,pintersect()
,psetdiff()
例子
require(GenomicRanges) gr <- GRanges("A", IRanges(c(10, 20, 22),宽度=5),"+")shift(gr, 1) # 1-based坐标!
##带有3个范围和0元数据列的嘉宾:## SEQNAMES范围股线## <铁钢> ## [1] A [11,15] + ## [2] A [21,25]+ ## [3] A [23,27] + ## --- ## SEQLENGTH:## A ## NA
范围(gr)#内部范围内
##带有1个范围和0元数据列的经验:## SEQNAMES范围绞线## <铁钢> ## [1] A [10,26] + ## --- ## SEQLENGTES:##A ## NA.
减少(gr) # inter-range
## seqnames ranges strand ## ## [1] A [10,14] + ## [2] A[20,26] + ##——## seqlength: ## A ## NA
覆盖范围(gr)
## RleList of length 1 ## $A ## integer-Rle of length 26 with 6 runs ## length: 9 5 5 2 3 2 ## Values: 0 1 0 1 2 1
setdiff(range(gr), gr) # 'introns'
## - ## seqnames ranges strand ## ## [1] A[15, 19] + ##——## seqlength: ## A ## NA
IRangesList, GRangesList
许多*列表感知的方法,但有一个常见的“技巧”:对未列出的表示应用向量化函数,然后重新列出
grl <- GRangesList(…)orig_gr <- unlist(grl) transformed_gr <- FUN(trans) transformed_grl <- relist(, grl)
参考
类
方法 -
reverseComplement ()
letterFrequency ()
matchPDict ()
,matchPWM ()
相关包
例子
全基因组序列被Ensembl,NCBI和其他作为Fasta文件分散;模型生物体全基因组序列包装成更用户友好的BSgenome
包。下面是跨chr14计算GC内容。
- getSeq(Hsapiens, chr14_range) letter - frequency (chr14_dna, "GC", as.prob=TRUE) <- getSeq(Hsapiens, chr14_range) letter - frequency (chr14_dna, "GC", as.prob=TRUE)
## G| c# ## [1,] 0.3363
类——类似基因组范围的行为
方法
readGAlignments ()
,readgalignmentsList()
summarizeOverlaps ()
例子
在14号染色体19653707 + 66M = 19653773位点上找到支持上述连接的reads
需要(GenomicRanges)要求(GenomicAlignments)
##加载必需的包:Rsamtools
##我们的'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data require(' rnaseqdata . hnrnpc .ba .chr14')
##加载所需的包:rnaseqdata . hnrnpc . ban .chr14
bf < - bamfile(rnaseqdata.hnrnpc.bam.chr14_bamfiles [[1]],Asmates = True)##对齐,结,重叠我们的ROI Paln < - ReadgalignmentsList(BF)J < - SummarizeJunctions(paln,with.revmap = true)J_OVerlap < - J [J%Over%ROI] ##支持读PALN [J_OVERLAP $ REVMAP [[1]]]
## GaliaNmentsList长8:## [[1]] ## Galignments具有2个对齐和0元数据列:## SEQNAMES Strand雪茄QWIDTH开始端宽NJUNC ## [1] CHR14 - 66M120N6M 72 19653707 19653898 192 1 ##[2] CHR14 + 7M1270N65M 72 19652348 19653689 1342 1 ## ## [[2]] ## Galignments具有2个对齐和0元数据列:## SEQNAMES Strand雪茄QWIDTH开始端宽NJUNC ## [1] CHR14 - 66M120N6M 7219653707 19653898 192 1 ## [2] CHR14 + 72M 72 19653686 19653686 19653686 72 0 ## ## [[3]] ## Galignments with 2对齐和0元数据列:## SEQNAMES Strand雪茄QWIDTH开始端宽NJunc ## [1] CHR14 + 72M 72 19653675 19653675 19653746 72 0 ## [2] CHR14-65M120N7M 72 19653708 19653708 19653708 19653899 192 1 #### ...#5更多元素> ## --- ## SEQLENGTH:## CHR1 CHR10... CHRY ## 249250621 135534747 ... 59373566
类——类似基因组范围的行为
功能和方法
readVcf ()
,readgeno()
,readInfo()
,readGT ()
,writeVcf ()
,filtervcf()
locateVariants ()
(变体重叠范围),predictCoding ()
,summarizeVariants ()
genotypeToSnpMatrix ()
,snpsummary()
例子
从VCF文件中读取变量,并根据已知的基因模型进行注释
##输入变体需要(VariantAnnotation)fl < - system.file(“extdata”,“chr22.vcf.gz”,package =“Variantandation”)VCF < - ReadVCF(FL,“HG19”)SEQLEVELS(VCF)< -“CHR22”##已知的基因模型需要(txdb.hsapiens.ucsc.hg19.knowngene)编码< - locatevariants(rowdata(vcf),txdb.hsapiens.ucsc.hg19.knowngene,codingvariants())head(编码)
## grange有6个范围和7个元数据列:# # seqnames范围链|位置QUERYID TXID # # < Rle > < IRanges > < Rle > | <因素> <整数> <整数> # # [1]chr22(50301422、50301422)- |编码24 75253 # # [2]chr22(50301476、50301476)- |编码25 75253 # # [3]chr22(50301488、50301488)- |编码26 75253 # # [4]chr22(50301494、50301494)- |编码27 75253 # # [5]chr22 (50301584,50301584) - |编码28 75253 # # [6]chr22(50302962、50302962)- |编码57 75253 # # CDSID GENEID PRECEDEID FOLLOWID # # <整数> <人物> < CharacterList > < CharacterList > # # 79087 # # [2] [1] 218562 218562 79087 # # [3] 218562 79087 # # 79087 # # [5] [4] 218562 218562 79087 218563 79087 # # [6] ## --- ## seqlengths: # # # # chr22 NA
相关包
参考
限制
ScanBamParam ()
限制输入所需的数据在特定的基因组范围迭代
屈服
争论BamFile ()
,或FASTQSTREAMER()
允许在大文件中迭代。压缩
Rle
(行程长度编码)类GRangesList
有效地保持了矢量元素被分组的错觉。并行处理
参考
其目的是计算组成基因的重叠外显子的数目。这类计数数据是RNASeq差异表达分析的基本输入,如throughDESeq2和edger.。
确定感兴趣的区域。我们使用“TXDB”包与基因模型Alddy定义
要求(txdb.hsapiens.ucsc.hg19.knowngene)Exygn < - Exonsbyens.Cucsc.hg19.knowngene,“基因”)##仅染色体14 Seqlevels(Exbygn,Force = True)=“Chr14”
标识示例BAM文件。
要求(rnaseqdata.hnrnpc.bam.chr14)长度(rnaseqdata.hnrnpc.bam.chr14_bamfiles)
# # 8 [1]
汇总重叠部分,可选择并行汇总
##接下来的2行可选;non-Windows library(BiocParallel) register(MulticoreParam(workers=detectCores())) olaps <- summarizeOverlaps(exByGn, rnaseqdata . hnrnpc .ba . chr14_bamfiles)
探索我们的手工,例如,Library大小(列和),基因长度和映射读数的数量之间的关系等。
olaps.
##类:摘要分析## DIM:779 8 ## exptData(0):##测定(1):计数## Rownames(779):10001 100113389 ... 9950 9985 ## rowdata元数据列名(0):## Colnames(8):Err127306 Err127307 ... Err127304 Err127305 ## Coldata名称(0):
(分析(olap))
## ERR127306 ERR127307 ERR127309 ERR127309 ERR127302 ERR127302 ERR127302 ERR127303 ## 100101103 139 109 125 152 168#100113389 0 0 0 0 0 0 0 0 0 ## 100124539 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 ## 100126308 0 0 0 0 0 0 ## ERR127304 ERR127304 ERR127305 ## 10001 181 150 ## 100113389 0 0 ## 100113391 0 0 0 ## 100124539 0 0 0 ## 100126297 0 0 0
colsum(分析(olaps)) #库大小
## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304 ## 340646 373268 371639 331518 313800 331135 331606 ## ERR127305 ## 329647
情节(总和(宽度(olap)), rowMeans(分析(olap)),日志=“xy”)
##警告:252 y值<= 0从对数图省略
作为一个高级练习,研究GC内容和读取计数之间的关系
- getSeq(BSgenome.Hsapiens.UCSC. hg19) sequences <- getSeq(BSgenome.Hsapiens.UCSC. hg19)hg19, rowData(olaps)) gcPerExon <- letterFrequency(unlist(sequences), "GC") GC <- relist(as.vector(gcPerExon), sequences) gc_percent <- sum(GC) / sum(width(olaps)) plot(gc_percent, rowMeans(assay(olaps)), log="y")
##警告:252 y值<= 0从对数图省略