Navigating the complexity of DNA methylation data poses substantial challenges
due to its intricate biological patterns. The regionalpcs
package is
conceived to address the substantial need for enhanced summarization and
interpretation at a regional level. Traditional methodologies, while
foundational, may not fully encapsulate the biological nuances of methylation
patterns, thereby potentially yielding interpretations that may be suboptimal
or veer towards inaccuracies. This package introduces and utilizes regional
principal components (rPCs), designed to adeptly capture biologically relevant
signals embedded within DNA methylation data. Through the implementation of
rPCs, researchers can gain new insights into complex interactions and
effects in methylation data that might otherwise be missed.
The regionalpcs
package can be easily installed from Bioconductor, providing
you with the latest stable version suitable for general use. Alternatively,
for those interested in exploring or utilizing the latest features and
developments, the GitHub version can be installed directly.
Install regionalpcs
from Bioconductor using the command below. Ensure
that your R version is compatible with the package version available on
Bioconductor for smooth installation and functionality.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("regionalpcs")
To access the development version of regionalpcs
directly from GitHub,
which might include new features or enhancements not yet available in the
Bioconductor version, use the following commands. Note that the development
version might be less stable than the officially released version.
# install devtools package if not already installed
if (!requireNamespace("devtools", quietly=TRUE))
install.packages("devtools")
# download and install the regionalpcs package
devtools::install_github("tyeulalio/regionalpcs")
regionalpcs
R Package TutorialThis tutorial depends on several Bioconductor packages. These packages should be loaded at the beginning of the analysis.
library(regionalpcs)
library(GenomicRanges)
library(tidyr)
library(tibble)
library(dplyr)
library(minfiData)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Let’s proceed to load the packages, briefly understanding their roles in this tutorial.
regionalpcs
: Primary package for summarizing and interpreting DNA
methylation data at a regional level.
GenomicRanges
: Facilitates representation and manipulation of genomic
intervals and variables defined along a genome.
tidyr
, tibble
, dplyr
: Assist in data tidying, representation,
and manipulation.
minfiData
: Provides example Illumina 450k data, aiding in the
demonstration of regionalpcs
functionalities.
TxDb.Hsapiens.UCSC.hg19.knownGene
: Accommodates transcriptome data,
useful for annotating results.
Once the packages are loaded, we can start using the functions provided by each package.
In this tutorial, we’ll utilize a sample dataset, MsetEx.sub
, which is a
subset derived from Illumina’s Human Methylation 450k dataset, specifically
preprocessed to contain 600 CpGs across 6 samples. The dataset is stored in a
MethylSet
object, which is commonly used to represent methylation data.
The methylation beta values, denoting the proportion of methylated cytosines at a particular CpG site, will be extracted from this dataset for our subsequent analyses.
# Load the MethylSet data
data(MsetEx.sub)
# Display the first few rows of the dataset for a preliminary view
head(MsetEx.sub)
#> class: MethylSet
#> dim: 6 6
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(6): cg00050873 cg00212031 ... cg00455876 cg01707559
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#> 5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#> array: IlluminaHumanMethylation450k
#> annotation: ilmn12.hg19
#> Preprocessing
#> Method: Raw (no normalization or bg correction)
#> minfi version: 1.21.2
#> Manifest version: 0.4.0
# Extract methylation M-values from the MethylSet
# M-values are logit-transformed beta-values and are often used in differential
# methylation analysis for improved statistical performance.
mvals <- getM(MsetEx.sub)
# Display the extracted methylation beta values
head(mvals)
#> 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> cg00050873 3.502348 0.4414491 4.340695
#> cg00212031 -3.273751 0.9234662 -2.614777
#> cg00213748 2.076816 -0.1309465 1.260995
#> cg00214611 -3.438838 1.7463950 -2.270551
#> cg00455876 1.839010 -0.9927320 1.619479
#> cg01707559 -3.303987 -0.6433201 -3.540887
#> 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
#> cg00050873 0.24458355 -0.3219281 0.2744392
#> cg00212031 -0.21052257 -0.6861413 -0.1397595
#> cg00213748 -1.10373279 -1.6616553 -0.1270869
#> cg00214611 0.29029649 -0.2103599 -0.6138630
#> cg00455876 -0.09504721 -0.2854655 0.6361273
#> cg01707559 -0.74835377 -0.4678048 -1.1345421
Note that MsetEx.sub
provides a manageable slice of data that we can
utilize to illustrate the capabilities of regionalpcs
without requiring
extensive computational resources.
Now that we have our dataset loaded and methylation values extracted,
let’s proceed with demonstrating the core functionalities of regionalpcs
.
In this step, we’ll obtain the genomic coordinates of the CpG sites in our
methylation dataset using the 450k array probe annotations using the minfi
package.
# Map the methylation data to genomic coordinates using the mapToGenome
# function. This creates a GenomicMethylSet (gset) which includes genomic
# position information for each probe.
gset <- mapToGenome(MsetEx.sub)
# Display the GenomicMethylSet object to observe the structure and initial
# entries.
gset
#> class: GenomicMethylSet
#> dim: 600 6
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(600): cg01003813 cg03723195 ... cg03930849 cg08265308
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#> 5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#> array: IlluminaHumanMethylation450k
#> annotation: ilmn12.hg19
#> Preprocessing
#> Method: Raw (no normalization or bg correction)
#> minfi version: 1.21.2
#> Manifest version: 0.4.0
# Convert the genomic position data into a GRanges object, enabling genomic
# range operations in subsequent analyses.
# The GRanges object (cpg_gr) provides a versatile structure for handling
# genomic coordinates in R/Bioconductor.
cpg_gr <- granges(gset)
# Display the GRanges object for a preliminary view of the genomic coordinates.
cpg_gr
#> GRanges object with 600 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> cg01003813 chrX 2715058 *
#> cg03723195 chrX 2847354 *
#> cg00074638 chrX 2883307 *
#> cg01728536 chrX 7070451 *
#> cg01864404 chrX 8434367 *
#> ... ... ... ...
#> cg26983430 chrY 24549675 *
#> cg01757887 chrY 25114810 *
#> cg00061679 chrY 25314171 *
#> cg03930849 chrY 26716289 *
#> cg08265308 chrY 28555550 *
#> -------
#> seqinfo: 2 sequences from hg19 genome; no seqlengths
Gene regions, which include functional segments such as promoters, gene bodies,
and intergenic regions, play pivotal roles in gene expression and regulation.
Summarizing methylation patterns across these regions can provide insights
into potential gene regulatory mechanisms and associations with phenotypes
or disease states. Herein, we will delve into how to succinctly summarize
methylation data at these crucial genomic segments using the regionalpcs
package.
First, let’s load the gene region annotations. Make sure to align the genomic builds of your annotations and methylation data.
# Obtain promoter regions
# The TxDb object 'txdb' facilitates the retrieval of transcript-based
# genomic annotations.
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# Extracting promoter regions with a defined upstream and downstream window.
# This GRanges object 'promoters_gr' will be utilized to map and summarize
# methylation data in promoter regions.
promoters_gr <- suppressMessages(promoters(genes(txdb),
upstream=1000,
downstream=0))
# Display the GRanges object containing the genomic coordinates of promoter
# regions.
promoters_gr
#> GRanges object with 23056 ranges and 1 metadata column:
#> seqnames ranges strand | gene_id
#> <Rle> <IRanges> <Rle> | <character>
#> 1 chr19 58874215-58875214 - | 1
#> 10 chr8 18247755-18248754 + | 10
#> 100 chr20 43280377-43281376 - | 100
#> 1000 chr18 25757446-25758445 - | 1000
#> 10000 chr1 244006887-244007886 - | 10000
#> ... ... ... ... . ...
#> 9991 chr9 115095945-115096944 - | 9991
#> 9992 chr21 35735323-35736322 + | 9992
#> 9993 chr22 19109968-19110967 - | 9993
#> 9994 chr6 90538619-90539618 + | 9994
#> 9997 chr22 50964906-50965905 - | 9997
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
Creating a region map, which systematically assigns CpGs to specific gene
regions, stands as a crucial precursor to gene-region summarization using the
regionalpcs
package. This mapping elucidates the physical positioning of
CpGs within particular gene regions, facilitating our upcoming endeavors to
comprehend how methylation varies across distinct genomic segments. We’ll use
the create_region_map
function from the regionalpcs
package. This function
takes two genomic ranges objects, cpg_gr
contains CpG positions and genes_gr
contains gene region positions. Make sure both positions are aligned to the
same genome build (e.g. GrCH37, CrCH38).
# get the region map using the regionalpcs function
region_map <- regionalpcs::create_region_map(cpg_gr=cpg_gr,
genes_gr=promoters_gr)
# Display the initial few rows of the region map.
head(region_map)
#> gene_id cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4 10084 cg01303569
#> 5 1069 cg01056373
#> 6 10857 cg01333849
Note: The second column of region_map
must contain values matching the
rownames of your methylation dataframe.
In this final section, we’ll summarize gene regions using Principal
Components (PCs) to capture the maximum variation. We’ll utilize the
compute_regional_pcs
function from the regionalpcs
package for this.
Let’s calculate the regional PCs using our gene regions for demonstration purposes.
# Display head of region map
head(region_map)
#> gene_id cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4 10084 cg01303569
#> 5 1069 cg01056373
#> 6 10857 cg01333849
dim(region_map)
#> [1] 211 2
# Compute regional PCs
res <- compute_regional_pcs(meth=mvals, region_map=region_map, pc_method="gd")
#> Using Gavish-Donoho method
The function returns a list containing multiple elements. Let’s first look at what these elements are.
# Inspect the output list elements
names(res)
#> [1] "regional_pcs" "percent_variance" "loadings" "info"
The first element (res$regional_pcs
) is a data frame containing the
calculated regional PCs.
# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
#> 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> 100131434-PC1 -3.887949 1.82168108 -4.2354365
#> 100133941-PC1 1.100370 0.06719947 0.1932361
#> 100133957-PC1 -2.032026 0.19154436 -0.9193417
#> 10084-PC1 -5.011579 2.34574585 -4.2050727
#> 1069-PC1 -2.004785 1.20363622 -2.5436686
#> 10857-PC1 -1.986294 1.31482789 -2.0044911
#> 5723646053_R04C02
#> 100131434-PC1 2.1707679
#> 100133941-PC1 -0.7438658
#> 100133957-PC1 1.1903709
#> 10084-PC1 2.2586300
#> 1069-PC1 1.5851359
#> 10857-PC1 2.1369556
The output is a data frame with regional PCs for each region as rows and our samples as columns. This is our new representation of methylation values, now on a gene regional PC scale. We can feed these into downstream analyses as is.
The number of regional PCs representing each gene region was determined by the
Gavish-Donoho method. This method allows us to identify PCs that capture actual
signal in our data and not the noise that is inherent in any dataset. To explore
alternative methods, we can change the pc_method
parameter.
# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) |>
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)
#> gene pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4 10084 PC1
#> 5 1069 PC1
#> 6 10857 PC1
# number of genes that have been summarized
length(unique(regions$gene))
#> [1] 142
# how many of each PC did we get
table(regions$pc)
#>
#> PC1
#> 142
We have summarized each of our genes using just one PC. The number of PCs depends on three main factors: the number of samples, the number of CpGs in the gene region, and the noise in the methylation data.
By default, the compute_regional_pcs
function uses the Gavish-Donoho method.
However, we can also use the Marcenko-Pasteur method by setting the pc_method
parameter:
# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(mvals, region_map, pc_method = "mp")
#> Using Marchenko-Pastur method
# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs
# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) |>
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)
#> gene pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4 10084 PC1
#> 5 1069 PC1
#> 6 10857 PC1
# number of genes that have been summarized
length(unique(mp_regions$gene))
#> [1] 142
# how many of each PC did we get
table(mp_regions$pc)
#>
#> PC1
#> 142
The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random Matrix Theory, and they aim to identify the number of significant PCs that capture the true signal in the data and not just the noise. However, these methods differ in how they select the number of significant PCs. The Marcenko-Pasteur method typically selects more PCs to represent a gene region compared to the Gavish-Donoho method. This may be due to the different ways in which the two methods estimate the noise level in the data.
Ultimately, the choice between the two methods depends on the specific needs and goals of the analysis. The Gavish-Donoho method tends to provide more conservative results, while the Marcenko-Pasteur method may capture more of the underlying signal in the data. Researchers should carefully consider their objectives and the characteristics of their data when selecting a method.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [2] GenomicFeatures_1.58.0
#> [3] AnnotationDbi_1.68.0
#> [4] minfiData_0.51.0
#> [5] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#> [6] IlluminaHumanMethylation450kmanifest_0.4.0
#> [7] minfi_1.52.0
#> [8] bumphunter_1.48.0
#> [9] locfit_1.5-9.10
#> [10] iterators_1.0.14
#> [11] foreach_1.5.2
#> [12] Biostrings_2.74.0
#> [13] XVector_0.46.0
#> [14] SummarizedExperiment_1.36.0
#> [15] Biobase_2.66.0
#> [16] MatrixGenerics_1.18.0
#> [17] matrixStats_1.4.1
#> [18] dplyr_1.1.4
#> [19] tibble_3.2.1
#> [20] tidyr_1.3.1
#> [21] GenomicRanges_1.58.0
#> [22] GenomeInfoDb_1.42.0
#> [23] IRanges_2.40.0
#> [24] S4Vectors_0.44.0
#> [25] BiocGenerics_0.52.0
#> [26] regionalpcs_1.4.0
#> [27] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.1 BiocIO_1.16.0
#> [3] bitops_1.0-9 preprocessCore_1.68.0
#> [5] XML_3.99-0.17 lifecycle_1.0.4
#> [7] lattice_0.22-6 MASS_7.3-61
#> [9] base64_2.0.2 scrime_1.3.5
#> [11] magrittr_2.0.3 limma_3.62.0
#> [13] sass_0.4.9 rmarkdown_2.28
#> [15] jquerylib_0.1.4 yaml_2.3.10
#> [17] doRNG_1.8.6 askpass_1.2.1
#> [19] cowplot_1.1.3 DBI_1.2.3
#> [21] RColorBrewer_1.1-3 abind_1.4-8
#> [23] zlibbioc_1.52.0 quadprog_1.5-8
#> [25] purrr_1.0.2 RCurl_1.98-1.16
#> [27] GenomeInfoDbData_1.2.13 ggrepel_0.9.6
#> [29] RMTstat_0.3.1 irlba_2.3.5.1
#> [31] rentrez_1.2.3 genefilter_1.88.0
#> [33] annotate_1.84.0 dqrng_0.4.1
#> [35] DelayedMatrixStats_1.28.0 codetools_0.2-20
#> [37] DelayedArray_0.32.0 xml2_1.3.6
#> [39] tidyselect_1.2.1 UCSC.utils_1.2.0
#> [41] ScaledMatrix_1.14.0 beanplot_1.3.1
#> [43] illuminaio_0.48.0 GenomicAlignments_1.42.0
#> [45] jsonlite_1.8.9 multtest_2.62.0
#> [47] survival_3.7-0 tools_4.4.1
#> [49] Rcpp_1.0.13 glue_1.8.0
#> [51] SparseArray_1.6.0 xfun_0.48
#> [53] HDF5Array_1.34.0 withr_3.0.2
#> [55] BiocManager_1.30.25 fastmap_1.2.0
#> [57] rhdf5filters_1.18.0 fansi_1.0.6
#> [59] openssl_2.2.2 digest_0.6.37
#> [61] rsvd_1.0.5 R6_2.5.1
#> [63] colorspace_2.1-1 RSQLite_2.3.7
#> [65] utf8_1.2.4 generics_0.1.3
#> [67] data.table_1.16.2 rtracklayer_1.66.0
#> [69] httr_1.4.7 S4Arrays_1.6.0
#> [71] pkgconfig_2.0.3 gtable_0.3.6
#> [73] blob_1.2.4 siggenes_1.80.0
#> [75] htmltools_0.5.8.1 bookdown_0.41
#> [77] scales_1.3.0 png_0.1-8
#> [79] knitr_1.48 tzdb_0.4.0
#> [81] reshape2_1.4.4 rjson_0.2.23
#> [83] nlme_3.1-166 curl_5.2.3
#> [85] cachem_1.1.0 rhdf5_2.50.0
#> [87] stringr_1.5.1 restfulr_0.0.15
#> [89] GEOquery_2.74.0 pillar_1.9.0
#> [91] grid_4.4.1 reshape_0.8.9
#> [93] vctrs_0.6.5 BiocSingular_1.22.0
#> [95] beachmat_2.22.0 xtable_1.8-4
#> [97] evaluate_1.0.1 readr_2.1.5
#> [99] cli_3.6.3 compiler_4.4.1
#> [101] Rsamtools_2.22.0 rlang_1.1.4
#> [103] crayon_1.5.3 rngtools_1.5.2
#> [105] nor1mix_1.3-3 mclust_6.1.1
#> [107] plyr_1.8.9 stringi_1.8.4
#> [109] BiocParallel_1.40.0 munsell_0.5.1
#> [111] PCAtools_2.18.0 Matrix_1.7-1
#> [113] hms_1.1.3 sparseMatrixStats_1.18.0
#> [115] bit64_4.5.2 ggplot2_3.5.1
#> [117] Rhdf5lib_1.28.0 KEGGREST_1.46.0
#> [119] statmod_1.5.0 memoise_2.0.1
#> [121] bslib_0.8.0 bit_4.5.0