表观基因组学2014
作者:马丁·摩根(mtmorgan@fhcrc.org)
日期:2014年8月24日
输入和操作:Biostrings
>NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736 gttggtggcccaccagtgccaaaatacacaagaaaacagcatctt gacactaaaatgcaaaaattgtttgcgtcaatgactcaaaacgaaaatatg…atgggtatcaagttgccccgtataaaaggcaagttaccggttgcacggt >NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454 ttatatgtaggcgcccgttcccgcagccaaagcactcagaattccggg cgtgtagcgcaacgaccatacaaggcaatattttgcgcttagg…
输入和操作:ShortReadreadFastq ()
,FastqStreamer ()
,FastqSampler ()
@err127302.1703 hwi-eas350_0441:1:1:1460:19184 #0/1 cctgagtgaagctgatcttagagagagatagatcttgatcgtcgaggagatgctgaccttgacct + hhghghhhhhhhhdgg < gdgge@gdggd b8 ?? adad < be@ee8egdga3cb85 *,77@ bb0 > ce ?=896=: @err127302.1704 wi - eas350_041:1:1:14 60:16861#0/1 gcggtatgctggaaggtgctcgaatggagagcgccagcgccccggcgctgagccgccgccgccgccc + de ? dd > ed4 > eee > de8eeede8b ? eb <@3; ba79 ?, 881b ?@73;1?########################
例子
#生物序列数据(phiX174Phage) #样本数据,见?phiX174Phage phiX174Phage
##长度为6的DNAStringSet实例## width seq names ## [1] 586 GAGTTTTATCGCTTCCATGAC…ATTGGCGTATCCAACCTGCA Genbank ## [2] 586 GAGTTTTATCGCTTCCATGAC…ATTGGCGTATCCAACCTGCA RF70s ## [3] 586 GAGTTTTATCGCTTCCATGAC…Attggcgtatccaacctgca ss78 ## [4] 586 gagttttatcgcttccatgac…ATTGGCGTATCCAACCTGCA公牛## [5]586 GAGTTTTATCGCTTCCATGAC…Attggcgtatccaacctgca g97 ## [6] 586 gagttttatcgcttccatgac…ATTGGCGTATCCAACCTGCA NEB03
m <- consensusMatrix(phiX174Phage)[1:4,] #细胞核。x位置计数多态<- which(colsum (m != 0) > 1) m[,多态]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## a 4 5 4 3 0 0 5 2 0 ## c 0 0 0 0 5 1 0 0 5 ## g 2 1 2 3 0 0 0 1 4 0 ## t 0 0 0 0 1 5 0 0 1 1 5 0 0 1 1 0 0 1
showMethods(类类(phiX174Phage) =, =搜索())
装饰图案(包=“Biostrings”)
.添加另一个参数到装饰图案
函数查看“BiostringsQuickOverview”小插图。下面的代码加载一些示例数据,phiX174Phage基因组的6个版本作为DNAStringSet对象。
库(Biostrings)数据(phiX174Phage)
解释以下代码的功能及其工作原理
m <- consensusMatrix(phiX174Phage)[1:4,]多态性<- which(colsum (m != 0) > 1) mapply(substr,多态性,多态性,MoreArgs=list(x=phiX174Phage))
# #[1][2][3][4][5][6][7][8][9] # #基因库“G”“G”“”“”“C”“C”“A”“G”“C”# # RF70s“”“”“”“G”“C”“T”“A”“G”“C”# # SS78“”“”“”“G”“C”“T”“A”“G”“C”# #牛“G”“A”“G”“A”“C”“T”“”“”“T”# # G97“A”“A”“G”“A”“C”“T”“G”“A”“C”# # NEB03“”“”“”“G”“T”“T”“A”“G”“C”
输入和操作:“低级”Rsamtools,scanBam ()
,BamFile ()
;“高级”GenomicAlignments
头
@HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516…@SQ SN:chrY LN:59373566 @PG ID:TopHat VN:2.0.8b CL:/home/hpages/ TopHat -2.0.8b。Linux_x86_64/tophat——match -inner-dist 150——solexa-quals——max-multihits 5——no-discordant——no-mixed——covere -search——microexon-search——library-type fr- unxed——num-threads 2——output-dir tophat2_out/ERR127306 /home/hpages/ bowti_2 -2.1.0/indexes/hg19 fastq/ERR127306_1。fastq fastq / ERR127306_2.fastq
对齐:ID,标志,对齐和配偶
ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413…ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720…ERR127306.933914 339 chr14 19653707 1 66M120N6M = 19653686 -213…ERR127306.11052450 83 chr14 19653707 3 66M120N6M = 19652348 -1551…ERR127306.24611331 147 chr14 19653708 1 65M120N7M = 19653675 -225…ERR127306.2698854 419 chr14 19653717 0 56M120N16M = 19653935 290…ERR127306.2698854 163 chr14 19653717 0 56M120N16M = 19653935 2019…
对齐:顺序和质量
...GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%)) ...TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)**** ...TGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCT '******&%)&)))&")')'')'*((******&)&'')'))$))'')&))$)**&&**************** ...TGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCT ##&&(#')$')'%&)%$#$%"%###&!%))'%%''%'))&))#)&%((%())))%)%)))%********* ...GAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCTT )&$'$'$%!&&%&!'%'))%''&%'&))))''$""'%'%&%'#'%'"!'')#&)))))%$)%)&'"'))) ...TTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCTTCATGTGGCT ++++++++++++++++++++++++++++++++++++++*++++++**++++**+**''**+*+*'*)))*)# ...TTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAGCAGCCTCTGGTTTCTTCATGTGGCT ++++++++++++++++++++++++++++++++++++++*++++++**++++**+**''**+*+*'*)))*)#
阵营:标签
...AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:2 CC:Z:chr22 CP:i:16189276 HI:i:0…AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:3 CC:Z:= CP:i:19921600 HI:i:0…AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:4 MD:Z:72 YT:Z:UU XS:A:+ NH:i:3 CC:Z:= CP:i:19921465 HI:i:0…AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:4 MD:Z:72 YT:Z:UU XS:A:+ NH:i:2 CC:Z:chr22 CP:i:16189138 HI:i:0…AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:5 MD:Z:72 YT:Z:UU XS:A:+ NH:i:3 CC:Z:= CP:i:19921464 HI:i:0…:我:0 XM:我:0 XO:我:0 XG:我:0 MD: Z: 72海里:我:0 XS:答:+ NH:我:5答:Z: = CP:我:19653717你好:我:0…AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:72 NM:i:0 XS:A:+ NH:i:5 CC:Z:= CP:i:19921455 HI:i:1
输入和操作:VariantAnnotationreadVcf ()
,readInfo ()
,readGeno ()
选择性地与ScanVcfParam ()
.
头
##fileformat=VCFv4.2 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:// seq/references/ 1000genome espil洛特- ncbi36。fasta# # contg = ##phasing=partial ##INFO= ##INFO=…##FILTER= #FILTER=…# #格式= < ID = GT,数量= 1 =字符串类型,描述=“基因型”> # #格式= < ID =《GQ》、数量= 1 =整数类型,描述=“基因型质量”>
位置
# chrom pos id ref Alt qual filter…20 14370 rs6054257 G A 29 PASS…17330年20。T A 3 q10…20 1110696 rs6040355 A G,T 67 PASS…20 1230237。T。47传递……20 1234567 microsat1 GTC G,GTCT 50 PASS…
变异信息
#铬POS……信息…14370……NS = 3; DP = 14;房颤= 0.5;数据库;H2…17330……NS = 3; DP = 11;房颤= 0.017……20 1110696…NS = 2, DP = 10;房颤= 0.333,0.667;AA = T; DB……20 1230237…NS = 3; DP = 13; AA = T… 20 1234567 ... NS=3;DP=9;AA=G ...
基因型格式和样本
...POS……格式为na00001 na00002 na00003…14370年……GT:《GQ》:DP:总部0 | 0:48:1:51,51 1 | 0:48:8:51,51 1/1:43:5:,……17330年……GT:《GQ》:DP:总部0 | 0:49:3:58,50 0 | 1:3:5:65,3 0/0:41:3…1110696……GT:《GQ》:DP:总部1 | 2:21:6:23,27日2 | 1:2:0:18,2 2/2:35:4…1230237…… GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 ... 1234567 ... GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
输入:rtracklayer进口()
GTF:基因模型
组件的坐标
7蛋白编码基因27221129 27224842。- . ... ...7蛋白编码转录本27221134 27224835。- - - - - -……7蛋白编码外显子27224055 27224835。- - - - - -……7蛋白编码CDS 27224055 27224763。- 0…7蛋白编码起始密码子27224761 27224763。- 0… 7 protein_coding exon 27221134 27222647 . - . ... 7 protein_coding CDS 27222418 27222647 . - 2 ... 7 protein_coding stop_codon 27222415 27222417 . - 0 ... 7 protein_coding UTR 27224764 27224835 . - . ... 7 protein_coding UTR 27221134 27222414 . - . ...
注释
gene_id“ENSG00000005073”;gene_name“HOXA11”;gene_source“ensembl_havana”;gene_biotype“protein_coding”;……transcript_id“ENST00000006015”;transcript_name“hoxa11 - 001”;transcript_source“ensembl_havana”;标记“ccd”;ccds_id“CCDS5411”; ... exon_number "1"; exon_id "ENSE00001147062"; ... exon_number "1"; protein_id "ENSP00000006015"; ... exon_number "1"; ... exon_number "2"; exon_id "ENSE00002099557"; ... exon_number "2"; protein_id "ENSP00000006015"; ... exon_number "2"; ...
范围是:
许多常见的生物学问题都是基于范围的
的GenomicRanges包定义基本类和方法
农庄
GRangesList
范围
start ()
/结束()
/宽度()
长度()
、子集等。mcols ()
Seqinfo
,包括seqlevels
而且seqlengths
Intra-range方法
转变()
,狭窄的()
,侧面()
,发起人()
,调整()
,限制()
,削减()
" ? intra-range-methods
Inter-range方法
range ()
,reduce ()
,空白()
,分离()
覆盖()
(!)" ? inter-range-methods
Between-range方法
findOverlaps ()
,countOverlaps ()
、……% / %
,%在%
,% %外
;联盟()
,相交()
,setdiff ()
,punion ()
,pintersect ()
,psetdiff ()
例子
require(genome icranges) gr <- GRanges("A", IRanges(c(10,20,22), width=5), "+") shift(gr, 1) # 1-based coordinate !
有3个范围和0个元数据列的GRanges: ## seqnames ranges strand ## ## [1] A [11,15] + ## [2] A [21,25] + ## [3] A[23,27] + ##——## seqlength: ## a# # NA
范围(gr) # intra-range
有1个范围和0个元数据列的GRanges: ## seqnames ranges strand ## ## [1] A[10,26] + ##——## seqlength: ## a# # NA
减少(gr) # inter-range
有2个范围和0个元数据列的GRanges: ## seqnames ranges strand ## ## [1] A [10,14] + ## [2] A[20,26] + ##——## seqlength: ## a# # NA
覆盖(gr)
长度为1的RleList ## $A ## integer-长度为26的rle, 6次运行##长度:9 5 5 2 3 2 ##值:0 1 0 1 2 1
Setdiff (range(gr), gr) # '内含子'
有1个范围和0个元数据列的GRanges: ## seqnames ranges strand ## ## [1] A[15,19] + ##——## seqlength: ## a# # NA
IRangesList, GRangesList
许多*列表感知方法,但一个常见的“技巧”:对未列出的表示应用向量化函数,然后重新列出
GRangesList(…)orig_gr <- unlist(grl) transformed_gr <- FUN(orig) transformed_grl <- relist(, grl)
参考
类
方法- - - - - -
reverseComplement ()
letterFrequency ()
matchPDict ()
,matchPWM ()
相关的包
例子
BSgenome
包。下面计算chr14的GC内容。require(BSgenome.Hsapiens.UCSC.hg19) chr14_range = GRanges("chr14", IRanges(1, seqlength(Hsapiens)["chr14"])) chr14_dna <- getSeq(Hsapiens, chr14_range) letterFrequency(chr14_dna, "GC", as.prob=TRUE)
## g | c# [1,] 0.3363
类-类似基因组范围的行为
方法
readGAlignments ()
,readGAlignmentsList ()
summarizeOverlaps ()
例子
需要(GenomicRanges)要求(GenomicAlignments)
##加载所需的包:Rsamtools ## ##附加包:' genome icalignments ' ## ##以下对象从'package:locfit': ## ##被屏蔽
require(Rsamtools) ## our 'region of interest' roi <- GRanges("chr14", IRanges(19653773, width=1)) ## sample data require('RNAseqData.HNRNPC.bam.chr14')
##加载所需的包:RNAseqData.HNRNPC.bam.chr14
男朋友< - BamFile (RNAseqData.HNRNPC.bam。chr14_BAMFILES[[1]], asMates=TRUE) ##对齐,连接,重叠我们的roi paln <- readGAlignmentsList(bf) j <- summarizejunction (paln, with.revmap=TRUE) j_overlap <- j[j %over% roi] ##支持读取paln[j_overlap$revmap[[1]]]
## GAlignmentsList长度8:## [[1]]## GAlignments with 2 alignments and 0 metadata columns: ## seqnames strand cigar qwidth start end width njunc ## [1] chr14 - 66M120N6M 72 19653707 19653898 192 1 ## [2] chr14 + 7M1270N65M 72 19652348 19653689 1342 1 ## ## [[2]] ## GAlignments with 2 alignments and 0 metadata column:## seqnames strand雪茄qwidth start end width njunc ## [1] chr14 - 66M120N6M 72 19653707 19653898 192 1 ## [2] chr14 + 72M 72 19653686 19653757 72 0 ## ## [[3]] # GAlignments with 2对齐和0元数据列:## seqnames strand雪茄qwidth start end width njunc ## [1] chr14 + 72M 72 19653675 19653746 72 0 ## [2] chr14 - 65M120N7M 72 19653708 19653899 192 1 ## ##…## <5 more elements> ##——## seqlength:# # chr1 chr10…chrY ## 249250621 135534747…59373566
类-类似基因组范围的行为
函数和方法
readVcf ()
,readGeno ()
,readInfo ()
,readGT ()
,writeVcf ()
,filterVcf ()
locateVariants ()
(变体重叠范围),predictCoding ()
,summarizeVariants ()
genotypeToSnpMatrix ()
,snpSummary ()
例子
##输入变量需要(VariantAnnotation) fl <- system。##已知基因模型require(txdb . hsapens . ucsc .hg19. knowngene) coding <- locateVariants(rowData(vcf), txdb . hsapens . ucsc .hg19. hg19. txt)knownGene CodingVariants())头(编码)
## grange包含6个范围和7个元数据列:## seqnames ranges strand | LOCATION QUERYID TXID ## | ## [1] chr22 [50301422,50301422] - | coding 24 75253 ## [2] chr22 [50301476,50301476] - | coding 25 75253 ## [3] chr22 [50301488,50301488] - | coding 26 75253 ## [4] chr22 [50301484,50301494] - | coding 27 75253 ## [5] chr22 [50301584,50301584] - | coding 28 75253 ## [6] chr22 [50302962,50302962] - | coding 57 75253 ## CDSID GENEID preferdeid FOLLOWID ## ## [1] 218562 79087 ## [2] 218562 79087 ## [3] 218562 79087 ## [4] 218562 79087 ## [5] 218562 79087 ## [6] 218563 79087 ## # seqlength: ## chr22 ## NA
相关的包
参考
限制
ScanBamParam ()
限制在特定基因组范围输入所需数据迭代
yieldSize
的观点BamFile ()
,或FastqStreamer ()
允许在大文件中迭代。压缩
Rle
(行程长度编码)类GRangesList
有效地维持了向量元素被分组的错觉。并行处理
参考
目的是计算基因中重叠外显子的读数。这种类型的计数数据是RNASeq差分表达式分析的基本输入,例如通过DESeq2而且刨边机.
exByGn <- exonsBy(TxDb.Hsapiens.UCSC.hg19.)## only染色体14 seqlevels(exByGn, force=TRUE) = "chr14"
需要(RNAseqData.HNRNPC.bam.chr14)长度(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
# # 8 [1]
##下两行可选;非windows库(BiocParallel)寄存器(MulticoreParam(workers=detectCores())) olaps <- summarizeOverlaps(exByGn, RNAseqData.HNRNPC.bam.chr14_BAMFILES)
olap
## class: summarizeexperiment ## dim: 779 8 # exptData(0): ## assays(1): counts ## rownames(779): 10001 100113389…9950 9985 ## rowData元数据列名称(0):## colnames(8): ERR127306 ERR127307…ERR127304 ERR127305 ## colData名称(0):
(分析(olap))
# # ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 # # 10001 103 139 109 125 152 168 # # 100113389 0 0 0 0 0 0 # # 100113391 0 0 0 0 0 0 # # 100124539 0 0 0 0 0 0 # # 100126297 0 0 0 0 0 0 # # 100126308 0 0 0 0 0 0 # # ERR127304 ERR127305 # # # # 10001 181 150 100113389 0 0 # # 100113389 0 0 # # 100124539 0 0 # # 100126297 0 0 # # 100126308 0 0
#库大小
## err127306 err127307 err127308 err127309 err127302 err127303 err127304 ## 340646 373268 371639 331518 313800 331135 331606 ## err127305 ## 329647
情节(总和(宽度(olap)), rowMeans(分析(olap)),日志=“xy”)
##警告:对数图中省略了252个y值<= 0
require(BSgenome.Hsapiens.UCSC.hg19) sequences <- getSeq(BSgenome.Hsapiens.UCSC. bsgenome .hg19) sequences <-hg19, rowData(olaps)) gcPerExon <- letterFrequency(unlist(sequences), "GC") GC <- relist(as.vector(gcPerExon), sequences) gc_percent <- sum(GC) / sum(width(olaps)) plot(gc_percent, rowMeans(assay(olaps)), log="y")
##警告:对数图中省略了252个y值<= 0