genomation
: a toolkit for annotation and visualization of genomic data<img style="float: right;padding-left:9px" src="https://raw.githubusercontent.com/BIMSBbioinfo/genomation/master/vignettes/Figures/genomationFlowChart1.png", width=350/> genomation
is a toolkit for annotation and in bulk visualization of genomic features (scored or unscored) over predefined regions. The genomic features which the package can handle can be anything with a minimal information of chromosome, start and end. The features could have any length and most of the time they are associated with a score. Typical examples of such data sets include aligned
reads from high-throughput sequencing (HTS) experiments, percent methylation values for CpGs (or other cytosines), locations of transcription factor binding, and so on. On the other hand, throughout the vignette we use the phrase "genomic annotation" to refer to the regions of the genome associated with a potential function which does not necessarily have a score (examples: CpG islands, genes, enhancers, promoter, exons, etc. ). These genomic annotations are usually the regions of interest, and distribution of genomic features over/around the annotations are can make the way for biological interpretation of the data. The pipeline for computational knowledge extraction consists of three steps: data filtering, integration of data from multiple sources or generation of predictive models and biological interpretation of produced models, which leads to novel hypotheses that can be tested in the wetlab.genomation
aims to facilitate the integration of multiple sources of genomic features with genomic annotation or already published experimental results.
High-throughput data which will be used to show the functonality of the genomation
are located in two places. The annotation and cap analysis of gene expression (CAGE) data comes prepared with the genomation
package, while the raw HTS data can be found in the sister package genomationData
. To install the genomation
and the complementary data package the from the Bioconductor repository, copy and paste the following lines into your R interpreter:
biocLite('genomationData')
biocLite('genomation')
list.files(system.file('extdata',package='genomationData'))
## [1] "H1.Bisulfite-Seq.combined.chr21.bedGraph.gz"
## [2] "SamplesInfo.txt"
## [3] "wgEncodeBroadHistoneH1hescCtcfStdAlnRep1.chr21.bam"
## [4] "wgEncodeBroadHistoneH1hescCtcfStdAlnRep1.chr21.bam.bai"
## [5] "wgEncodeBroadHistoneH1hescCtcfStdPk.broadPeak.gz"
## [6] "wgEncodeBroadHistoneH1hescP300kat3bAlnRep1.chr21.bam"
## [7] "wgEncodeBroadHistoneH1hescP300kat3bAlnRep1.chr21.bam.bai"
## [8] "wgEncodeBroadHistoneH1hescP300kat3bPk.broadPeak.gz"
## [9] "wgEncodeBroadHistoneH1hescSuz12051317AlnRep1.chr21.bam"
## [10] "wgEncodeBroadHistoneH1hescSuz12051317AlnRep1.chr21.bam.bai"
## [11] "wgEncodeBroadHistoneH1hescSuz12051317Pk.broadPeak.gz"
## [12] "wgEncodeHaibTfbsA549Rad21V0422111RawRep1.chr21.bw"
## [13] "wgEncodeHaibTfbsH1hescRad21V0416102AlnRep1.chr21.bam"
## [14] "wgEncodeHaibTfbsH1hescRad21V0416102AlnRep1.chr21.bam.bai"
## [15] "wgEncodeHaibTfbsH1hescRad21V0416102PkRep1.broadPeak.gz"
## [16] "wgEncodeRikenCageA549CellPapAlnRep1.chr21.bam"
## [17] "wgEncodeRikenCageA549CellPapAlnRep1.chr21.bam.bai"
## [18] "wgEncodeSydhTfbsH1hescZnf143IggrabAlnRep1.chr21.bam"
## [19] "wgEncodeSydhTfbsH1hescZnf143IggrabAlnRep1.chr21.bam.bai"
## [20] "wgEncodeSydhTfbsH1hescZnf143IggrabPk.narrowPeak.gz"
sampleInfo = read.table(system.file("extdata/SamplesInfo.txt", package = "genomationData"),
header = TRUE, sep = "\t")
sampleInfo[1:5, 1:5]
## fileName dataOrigin
## 1 wgEncodeBroadHistoneH1hescCtcfStdAlnRep1.chr21.bam Encode
## 2 wgEncodeBroadHistoneH1hescP300kat3bAlnRep1.chr21.bam Encode
## 3 wgEncodeBroadHistoneH1hescSuz12051317AlnRep1.chr21.bam Encode
## 4 wgEncodeHaibTfbsH1hescRad21V0416102AlnRep1.chr21.bam Encode
## 5 wgEncodeSydhTfbsH1hescZnf143IggrabAlnRep1.chr21.bam Encode
## cellType sampleName experimentType
## 1 H1hesc Ctcf ChipSeq
## 2 H1hesc P300 ChipSeq
## 3 H1hesc Suz12 ChipSeq
## 4 H1hesc Rad21 ChipSeq
## 5 H1hesc Znf143 ChipSeq
genomation
package. The data can be accesed throught the data command or located in the extdata folder.
library(genomation)
data(cage)
data(cpgi)
list.files(system.file("extdata", package = "genomation"))
One of larger hindrances in computational genomics stems from the myriad of formats that are used to store the data. Although some formats have been selected as de facto standards for specific kind of biological data (e.g. BAM, VCF), almost all publications come with supplementary tables that do not have the same structure, but hold similar information. The tables usually have a tabular format, contain the locationof elements in genomic coordinates and various metadata colums. genomation
contais functions to read genomic features and genomic annotation provided they are in a tabular format. These functions will read the data from flat files into GRanges or GRangesList objects.
genomation
package. It is a function developed specifically for input of genomic data in tabular formats, and their conversion to a GRanges object. By default, the function persumes that the file is a standard .bed file containing columns chr, start, end.
library(genomation)
tab.file1 = system.file("extdata/tab1.bed", package = "genomation")
readGeneric(tab.file1, header = TRUE)
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
##
## [1] chr21 [9437272, 9439473] *
## [2] chr21 [9483485, 9484663] *
## [3] chr21 [9647866, 9648116] *
## [4] chr21 [9708935, 9709231] *
## [5] chr21 [9825442, 9826296] *
## [6] chr21 [9909011, 9909218] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
readGeneric(tab.file1, header = TRUE, keep.all.metadata = TRUE)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | cpgNum gcNum perCpg
## |
## [1] chr21 [9437272, 9439473] * | 285 1426 25.9
## [2] chr21 [9483485, 9484663] * | 165 818 28
## [3] chr21 [9647866, 9648116] * | 18 168 14.4
## [4] chr21 [9708935, 9709231] * | 31 218 20.9
## [5] chr21 [9825442, 9826296] * | 120 568 28.1
## [6] chr21 [9909011, 9909218] * | 20 143 19.3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
readGeneric(tab.file1, header = TRUE, meta.col = list(CpGnum = 4))
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | CpGnum
## |
## [1] chr21 [9437272, 9439473] * | 285
## [2] chr21 [9483485, 9484663] * | 165
## [3] chr21 [9647866, 9648116] * | 18
## [4] chr21 [9708935, 9709231] * | 31
## [5] chr21 [9825442, 9826296] * | 120
## [6] chr21 [9909011, 9909218] * | 20
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
readGeneric(tab.file1, header = TRUE, keep.all.metadata = TRUE)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | cpgNum gcNum perCpg
## |
## [1] chr21 [9437272, 9439473] * | 285 1426 25.9
## [2] chr21 [9483485, 9484663] * | 165 818 28
## [3] chr21 [9647866, 9648116] * | 18 168 14.4
## [4] chr21 [9708935, 9709231] * | 31 218 20.9
## [5] chr21 [9825442, 9826296] * | 120 568 28.1
## [6] chr21 [9909011, 9909218] * | 20 143 19.3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
tab.file2 = system.file("extdata/tab2.bed", package = "genomation")
readGeneric(tab.file2, chr = 3, start = 2, end = 1)
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
##
## [1] chr21 [9437272, 9439473] *
## [2] chr21 [9483485, 9484663] *
## [3] chr21 [9647866, 9648116] *
## [4] chr21 [9708935, 9709231] *
## [5] chr21 [9825442, 9826296] *
## [6] chr21 [9909011, 9909218] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff.file = system.file("extdata/chr21.refseq.hg19.gtf", package = "genomation")
gff = gffToGRanges(gff.file)
head(gff)
## GRanges object with 6 ranges and 6 metadata columns:
## seqnames ranges strand | source type score
## |
## [1] chr21 [9825832, 9826011] + | hg19_refGene exon 0
## [2] chr21 [9826203, 9826263] + | hg19_refGene exon 0
## [3] chr21 [9907189, 9907492] - | hg19_refGene exon 0
## [4] chr21 [9909047, 9909277] - | hg19_refGene exon 0
## [5] chr21 [9966322, 9966380] - | hg19_refGene exon 0
## [6] chr21 [9968516, 9968593] - | hg19_refGene exon 0
## phase gene_id transcript_id
##
## [1] NR_037421 NR_037421
## [2] NR_037458 NR_037458
## [3] NR_038328 NR_038328
## [4] NR_038328 NR_038328
## [5] NR_038328 NR_038328
## [6] NR_038328 NR_038328
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
gff = gffToGRanges(gff.file, split.group = TRUE)
head(gff)
# reading genes stored as a BED files
cpgi.file = system.file("extdata/chr21.CpGi.hg19.bed", package = "genomation")
cpgi.flanks = readFeatureFlank(cpgi.file)
head(cpgi.flanks$flanks)
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
##
## [1] chr21 [9823442, 9825441] *
## [2] chr21 [9826297, 9828296] *
## [3] chr21 [9907011, 9909010] *
## [4] chr21 [9909219, 9911218] *
## [5] chr21 [9966264, 9968263] *
## [6] chr21 [9968621, 9970620] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A standard step in a computational genomics experiment is visualization of average enrichment over a certain predefined set of ranges, such as mean coverage of a certain histone modification around a transcription factor binding site, or visualization of histone positions around transcription start sites.
ScoreMatrix and ScoreMatrixBin are functions used to extract data over predefined windows.
ScoreMatrix is used when all of the windows have the same width, such as a designated area around the transcription start site, while the ScoreMatrixBin is designed for use with windows of unequal width (e.g. enrichment of methylation over exons).
Both functions have 2 main arguments: target and windows. target is the data that we want to extract, while the windows represents the regions over which we want to see the enrichment. The target data can be in 3 forms: a GRanges, a RLeList or a path to an indexed .bam file. The windows must be GRanges object.
As an example we will extract the density of cage tags around the promoters on the human chromosome 21.data(cage)
data(promoters)
sm = ScoreMatrix(target = cage, windows = promoters)
sm
data(cage)
gff.file = system.file("extdata/chr21.refseq.hg19.gtf", package = "genomation")
exons = gffToGRanges(gff.file, filter = "exon")
sm = ScoreMatrixBin(target = cage, windows = exons, bin.num = 50)
sm
Running ScoreMatrixBin with bin.num=50 on a set of exons warned us that some of the exons are shorter than 50 bp and were thus removed from the set before binning. The rownames of the resulting ScoreMatrix object correspond to the ranges that were used to construct the windows (e.g. row name 10 means that the 10th element in the target GRanges object was used to extract the data). If a certain rowname is not present in the ScoreMatrix object, that means that the corresponding range was filtered out (e.g. the range could have been on a chromosome that was not present in the target).
To simultaneously work on multiple files you can use the ScoreMatrixList function. The function also has 2 obligatory arguments targets and windows. While the windows is the same as in ScoreMatrix, the targets argument contains results from multiple experiments. It can be in one of the three formats: a list of RleLists, a list of GRanges (or a GRangesList object), or a character vector designating a set of .bam or .bigWig files.data(promoters)
data(cpgi)
data(cage)
cage$tpm = NULL
targets = list(cage = cage, cpgi = cpgi)
sm = ScoreMatrixList(targets = targets, windows = promoters, bin.num = 50)
sm
If all of the windows have the same width ScoreMatrixList will use the ScoreMatrix function. That can be overridden by explicitly specifying the bin.num argument, as we did in the example.
There are 2 basic modes of visualization of enrichment over windows: either as a heatmap, or as a histogram. heatMatrix, plotMeta and multiHeatMatrix are functions for visualization of ScoreMatrix and ScoreMatrixList objects.
We will plot the distribution of CAGE tags around promoters on human chr21.
data(cage)
data(promoters)
sm = ScoreMatrix(target = cage, windows = promoters)
heatMatrix(sm, xcoords = c(-1000, 1000))
plotMeta(sm, xcoords = c(-1000, 1000))
<img src="https://raw.githubusercontent.com/BIMSBbioinfo/genomation/master/vignettes/Figures/heatMatrix1-1.png", width=400> <img src="https://raw.githubusercontent.com/BIMSBbioinfo/genomation/master/vignettes/Figures/heatMatrix1-2.png", width=400>
library(GenomicRanges)
data(cage)
data(promoters)
data(cpgi)
sm = ScoreMatrix(target = cage, windows = promoters, strand.aware = TRUE)
cpg.ind = which(countOverlaps(promoters, cpgi) > 0)
nocpg.ind = which(countOverlaps(promoters, cpgi) == 0)
heatMatrix(sm, xcoords = c(-1000, 1000), group = list(CpGi = cpg.ind, noCpGi = nocpg.ind))
sm.scaled = scaleScoreMatrix(sm)
heatMatrix(sm.scaled, xcoords = c(-1000, 1000), group = list(CpGi = cpg.ind,
noCpGi = nocpg.ind))
cage$tpm = NULL
targets = list(cage = cage, cpgi = cpgi)
sml = ScoreMatrixList(targets = targets, windows = promoters, bin.num = 50,
strand.aware = TRUE)
multiHeatMatrix(sml, xcoords = c(-1000, 1000))
multiHeatMatrix(sml, xcoords = c(-1000, 1000), clustfun = function(x) kmeans(x,
centers = 2)$cluster)
More advance usage of the ScoreMatrix family of functions and their visualtization can be found in the specific use-cases at the end of the vignette.
Searching for correlation between sets of genomic features is a standard exploratory method in computational genomics. It is usually done by looking at the overlap between 2 or more sets of ranges and calculating various overlap statistics. genomation
contains two sets of functions for annotation of ranges: the first one is used to facilitate the general annotation of any sets of ranges, while the second one is used to annotate a given feature with gene structures (promoter, exon, intron).
Firstly, we will select the broadPeak files from the genomatonData
package, and read in the peaks for the Ctcf transcription factor
library(genomationData)
genomationDataPath = system.file("extdata", package = "genomationData")
sampleInfo = read.table(file.path(genomationDataPath, "SamplesInfo.txt"), header = TRUE,
sep = "\t", stringsAsFactors = FALSE)
peak.files = list.files(genomationDataPath, full.names = TRUE, pattern = "broadPeak")
names(peak.files) = sampleInfo$sampleName[match(basename(peak.files), sampleInfo$fileName)]
ctcf.peaks = readBroadPeak(peak.files["Ctcf"])
data(cpgi)
peak.annot = annotateWithFeature(ctcf.peaks, cpgi, intersect.chr = TRUE)
peak.annot
## cpgi other
## 7.62 92.38
## [1] 36.94
The output of the annotateWithFeature function shows three types of information: The total number of elements in the target dataset, the percentage of target dataset that overlaps with the feature dataset. And the percentage of the feature elements that overlap the target.
bed.file = system.file("extdata/chr21.refseq.hg19.bed", package = "genomation")
gene.parts = readTranscriptFeatures(bed.file)
ctcf.annot = annotateWithGeneParts(ctcf.peaks, gene.parts, intersect.chr = TRUE)
ctcf.annot
## promoter exon intron intergenic
## 9.58 13.50 47.65 47.17
## promoter exon intron intergenic
## 9.58 7.91 35.34 47.17
## promoter exon intron
## 38.36 15.40 29.33
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 7977 28470 66060 74690 1195000
genomationData
, which we will then annotate.
peaks = GRangesList(lapply(peak.files, readGeneric))
names(peaks) = names(peak.files)
annot.list = annotateWithGeneParts(peaks, gene.parts, intersect.chr = TRUE)
plotGeneAnnotation(annot.list)
plotGeneAnnotation(annot.list, cluster = TRUE)
genomation
packageThe genomation
package provides generalizable functions for genomic data analysis and visualization. Below we will demonstrate the functionality on specific use cases
We will visualize the binding profiles of 6 transcription factors around the Ctcf binding sites.
In the fist step we will select the *.bam files containing mapped reads.genomationDataPath = system.file("extdata", package = "genomationData")
bam.files = list.files(genomationDataPath, full.names = TRUE, pattern = "bam$")
bam.files = bam.files[!grepl("Cage", bam.files)]
ctcf.peaks = readBroadPeak(file.path(genomationDataPath, "wgEncodeBroadHistoneH1hescCtcfStdPk.broadPeak.gz"))
ctcf.peaks = ctcf.peaks[seqnames(ctcf.peaks) == "chr21"]
ctcf.peaks = ctcf.peaks[order(-ctcf.peaks$signalValue)]
ctcf.peaks = resize(ctcf.peaks, width = 1000, fix = "center")
In order to extract the coverage values of all transcription factors around chipseq peaks, we will use the ScoreMatrixList function. ScoreMatrixList assign names to each element of the list based on the names of the bam files. We will use the names of the files to find the corresponding names of each sample in the SamplesInfo.txt. Using the heatmapProfile on our ScoreMatrixList, we can plot the underlying signal side by side.
sml = ScoreMatrixList(bam.files, ctcf.peaks, bin.num = 50, type = "bam")
sampleInfo = read.table(system.file("extdata/SamplesInfo.txt", package = "genomationData"),
header = TRUE, sep = "\t")
names(sml) = sampleInfo$sampleName[match(names(sml), sampleInfo$fileName)]
multiHeatMatrix(sml, xcoords = c(-500, 500), col = c("lightgray", "blue"))
Because of the large range of signal values in chipseq peaks, the heatmapProfile will not show the true extent of colocalization. To get around this, it is advisable to independently scale the rows of each element in the ScoreMatrixList.
sml.scaled = scaleScoreMatrixList(sml)
multiHeatMatrix(sml.scaled, xcoords = c(-500, 500), col = c("lightgray", "blue"))
genomationData
to get he names of the samples. Four of the peak files are in the Encode broadPeak format, while one is in the narrowPeak. To read the files, we will use the readGeneric function. It enables us to select from the files only columns of interest. As the last step, we will restrict ourselves to peaks that are located on chromosome 21 and have width 100 and 1000 bp
genomationDataPath = system.file("extdata", package = "genomationData")
sampleInfo = read.table(file.path(genomationDataPath, "SamplesInfo.txt"), header = TRUE,
sep = "\t", stringsAsFactors = FALSE)
peak.files = list.files(genomationDataPath, full.names = TRUE, pattern = "Peak.gz$")
peaks = list()
for (i in 1:length(peak.files)) {
file = peak.files[i]
name = sampleInfo$sampleName[match(basename(file), sampleInfo$fileName)]
message(name)
peaks[[name]] = readGeneric(file, meta.col = list(score = 5))
}
peaks = GRangesList(peaks)
peaks = peaks[seqnames(peaks) == "chr21" & width(peaks) < 1000 & width(peaks) >
100]
tf.comb = findFeatureComb(peaks, width = 1000)
tf.comb = tf.comb[order(tf.comb$class)]
bam.files = list.files(genomationDataPath, full.names = TRUE, pattern = "bam$")
bam.files = bam.files[!grepl("Cage", bam.files)]
sml = ScoreMatrixList(bam.files, tf.comb, bin.num = 20, type = "bam")
names(sml) = sampleInfo$sampleName[match(names(sml), sampleInfo$fileName)]
sml.scaled = scaleScoreMatrixList(sml)
multiHeatMatrix(sml.scaled, xcoords = c(-500, 500), col = c("lightgray", "blue"))
The plot shows perfectly how misleading the peak calling process can be. Although the plots show that CTFC, Rad21 and Znf143 have almost perfect colozalization, peak callers have trouble identifying peaks in regions with lower enrichments and as a result, we get different statistics when using overlaps.
We can also use data from AnnotationHub since it can return data as GRanges object. Below we download CpG island and DNAse peak locations from AnnotationHub and get a scoreMatrix on the CpG islands.
library(AnnotationHub)
ah = AnnotationHub()
# retrieve CpG island data from annotationHub
cpgi_query <- query(ah, c("CpG Islands", "UCSC", "hg19"))
cpgi <- ah[[names(cpgi_query)]]
dnase_query <- query(ah, c("wgEncodeOpenChromDnase8988tPk.narrowPeak.gz", "UCSC",
"hg19"))
dnaseP <- ah[[names(dnase_query)]]
sm = ScoreMatrixBin(target = dnaseP, windows = cpgi, strand.aware = FALSE)