biscuiteer 1.8.0
biscuiteer
is package to process output from
biscuit into
bsseq objects. It includes a number
of features, such as VCF header parsing, shrunken M-value calculations (which
can be used for compartment inference), and age inference are included. However,
the task of locus- and region-level differential methylation inference is
delegated to other packages (such as dmrseq
).
From Bioconductor,
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("biscuiteer")
A development version is available on GitHub and can be installed via:
biscuiteer
can load either headered of header-free BED files produced from
biscuit vcf2bed
or biscuit mergecg
. In either case, a VCF file is needed
when loading biscuit
output. For practical purposes, only the VCF header is
for biscuiteer
. However, it is encouraged that the user keep the entire VCF,
as biscuit
can be used to call SNVs and allows for structural variant
detection in a similar manner to typical whole-genome sequencing tools.
Furthermore, biscuit
records the version of the software and the calling
arguments used during processing the output VCF, which allows for better
reproducibility.
NOTE: Both the input BED and VCF files must be tabix’ed before being input to
biscuiteer
. This can be done by running bgzip biscuit_output.xxx
followed by
tabix -p xxx biscuit_output.xxx.gz
, where xxx
is either bed
or vcf
.
Data can be loaded using the readBiscuit
function in biscuiteer
:
## Loading required package: biscuiteerData
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## Loading biscuiteerData.
## Loading required package: bsseq
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## The following object is masked from 'package:ExperimentHub':
##
## cache
## The following object is masked from 'package:AnnotationHub':
##
## cache
##
##
##
## Warning: replacing previous import 'BiocParallel::bpstart' by 'QDNAseq::bpstart'
## when loading 'biscuiteer'
orig_bed <- system.file("extdata", "MCF7_Cunha_chr11p15.bed.gz",
package="biscuiteer")
orig_vcf <- system.file("extdata", "MCF7_Cunha_header_only.vcf.gz",
package="biscuiteer")
bisc <- readBiscuit(BEDfile = orig_bed, VCFfile = orig_vcf,
merged = FALSE)
## Checking /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15.bed.gz for import...
## Extracting sample names from /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_header_only.vcf.gz...
## /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15.bed.gz does not have a header. Using VCF file header information to help set column names.
## Assuming unmerged data. Checking now... ...The file might be alright. Double check if you're worried.
## /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15.bed.gz has 254147 indexed loci.
## /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15.bed.gz looks valid for import.
## Reading unmerged input from /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15.bed.gz...
## Excluding CpG sites with uniformly zero coverage...
## Loaded /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15.bed.gz. Creating bsseq object...
Metadata from the biscuit
output can be viewed via:
## CharacterList of length 3
## [["Reference genome"]] hg19.fa
## [["Biscuit version"]] 0.1.3.20160324
## [["Invocation"]] biscuit pileup -r /primary/vari/genomicdata/genomes/hg19/hg1...
If further information about the VCF header is desired,
## class: VCFHeader
## samples(1): MCF7_Cunha
## meta(5): fileformat reference source contig program
## fixed(1): FILTER
## info(3): NS CX N5
## geno(7): GT DP ... GL GQ
In the instance where you have two separate BED files that you would like to
analyze in a single bsseq object, you can combine the files using unionize
,
which is a wrapper around the bsseq function, combine
.
shuf_bed <- system.file("extdata", "MCF7_Cunha_chr11p15_shuffled.bed.gz",
package="biscuiteer")
shuf_vcf <- system.file("extdata",
"MCF7_Cunha_shuffled_header_only.vcf.gz",
package="biscuiteer")
bisc2 <- readBiscuit(BEDfile = shuf_bed, VCFfile = shuf_vcf,
merged = FALSE)
## Checking /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15_shuffled.bed.gz for import...
## Extracting sample names from /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_shuffled_header_only.vcf.gz...
## /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15_shuffled.bed.gz does not have a header. Using VCF file header information to help set column names.
## Assuming unmerged data. Checking now... ...The file might be alright. Double check if you're worried.
## /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15_shuffled.bed.gz has 254147 indexed loci.
## /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15_shuffled.bed.gz looks valid for import.
## Reading unmerged input from /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15_shuffled.bed.gz...
## Excluding CpG sites with uniformly zero coverage...
## Loaded /tmp/RtmpvXBNuR/Rinst2960aa58aa20b1/biscuiteer/extdata/MCF7_Cunha_chr11p15_shuffled.bed.gz. Creating bsseq object...
A handful of analysis paths are available in biscuiteer
, including A/B
comparment inference, age estimation from WGBS data, hypermethylation of
Polycomb Repressor Complex (PRC) binding sites, and hypomethylation of
CpG-poor “partially methylated domains” (PMDs).
When performing A/B compartment inference, the goal is to have something that
has roughly gaussian error. getLogitFracMeth
uses Dirichlet smoothing to turn
raw measurements into lightly moderated, logit-transformed methylated-fraction
estimates, which can the be used as inputs to
compartmap
reg <- GRanges(seqnames = rep("chr11",5),
strand = rep("*",5),
ranges = IRanges(start = c(0,2.8e6,1.17e7,1.38e7,1.69e7),
end= c(2.8e6,1.17e7,1.38e7,1.69e7,2.2e7))
)
frac <- getLogitFracMeth(bisc, minSamp = 1, r = reg)
frac
## GRanges object with 5 ranges and 1 metadata column:
## seqnames ranges strand | MCF7_Cunha
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr11 0-2800000 * | 1.340682
## [2] chr11 2800000-11700000 * | 0.575875
## [3] chr11 11700000-13800000 * | 1.162989
## [4] chr11 13800000-16900000 * | 0.581874
## [5] chr11 16900000-22000000 * | 0.442985
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
biscuiteer
has the functionalitity to guess the age of the sample(s) provided
using the Horvath-style “clock” models (see
Horvath, 2013
for more information).
NOTE: The prediction accuracy of this function is entirely dependent on the parameters set by the user. As such, the defaults (as shown in the example below) should only be used as a starting point for exploration by the user.
NOTE: Please cite the appropriate papers for the epigenetic “clock” chosen:
* For horvath
or horvathshrunk
* Horvath, Genome Biology, 2013
* For hannum
* Hannum et al., Molecular Cell, 2013
* For skinandblood
* Horvath et al., Aging, 2018
## Assessing coverage across age-associated regions...
## All regions in all samples appear to be sufficiently covered.
## $call
## WGBSage(comb, "horvath")
##
## $droppedSamples
## NULL
##
## $droppedRegions
## NULL
##
## $intercept
## [1] 0.6955073
##
## $methcoefs
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | MCF7_Cunha
## <Rle> <IRanges> <Rle> | <numeric>
## chr11:6678129-6678158 chr11 6678129-6678158 * | 0.800000
## chr11:12030629-12030658 chr11 12030629-12030658 * | 0.833333
## MCF7_Cunha_shuffled coefs
## <numeric> <numeric>
## chr11:6678129-6678158 0.250000 0.000792206
## chr11:12030629-12030658 0.247732 -0.138857398
## -------
## seqinfo: 22 sequences from hg19 genome
##
## $age
## [,1]
## MCF7_Cunha 33.18896
## MCF7_Cunha_shuffled 34.88742
To generate a shorthand summary of the hypermethylation of PRCs and
hypomethylation of PMDs, the CpGindex
function in biscuiteer
can be used.
## Computing hypermethylation indices...
## Loading HMM_CpG_islands.hg19...
## Loading H9state23unmeth.hg19...
## Computing hypomethylation indices...
## Loading PMDs.hg19.rda from biscuiteerData...
## Loading Zhou_solo_WCGW_inCommonPMDs.hg19.rda from biscuiteerData...
## Computing indices...
## CpGindex with 1 row and 3 columns
## hyper.MCF7_Cunha hypo.MCF7_Cunha ratio.MCF7_Cunha
## <numeric> <numeric> <numeric>
## 1 0.0690734 0.199262 0.346647
## -------
## This object is just a DataFrame that has an idea of where it came from:
## Hypermethylation was tallied across 120 region (see 'object@hyperMethRegions').
## Hypomethylation was tallied across 13127 region (see 'object@hypoMethRegions').
## GRanges object with 120 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## 1 chr1 32230201-32230224 * | 0.04
## 2 chr1 43638401-43638449 * | 0.04
## 3 chr1 44884001-44884005 * | 0.04
## 4 chr1 46860401-46860406 * | 0.06
## 5 chr1 51435801-51436075 * | 0.05
## ... ... ... ... . ...
## 116 chr20 8112392-8112400 * | 0.030
## 117 chr20 17207801-17208191 * | 0.060
## 118 chr22 20004801-20004802 * | 0.035
## 119 chr22 37252601-37252731 * | 0.060
## 120 chr22 43781850-43781952 * | 0.045
## -------
## seqinfo: 21 sequences from hg19 genome