Contents

1 regionalpcs

Navigating the complexity of DNA methylation data poses substantial challenges due to its intricate biological patterns. The regionalpcs package is conceived to address the substantial need for enhanced summarization and interpretation at a regional level. Traditional methodologies, while foundational, may not fully encapsulate the biological nuances of methylation patterns, thereby potentially yielding interpretations that may be suboptimal or veer towards inaccuracies. This package introduces and utilizes regional principal components (rPCs), designed to adeptly capture biologically relevant signals embedded within DNA methylation data. Through the implementation of rPCs, researchers can gain new insights into complex interactions and effects in methylation data that might otherwise be missed.

2 Installation

The regionalpcs package can be easily installed from Bioconductor, providing you with the latest stable version suitable for general use. Alternatively, for those interested in exploring or utilizing the latest features and developments, the GitHub version can be installed directly.

2.0.1 Bioconductor Installation

Install regionalpcs from Bioconductor using the command below. Ensure that your R version is compatible with the package version available on Bioconductor for smooth installation and functionality.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("regionalpcs")

2.0.2 Development Version Installation

To access the development version of regionalpcs directly from GitHub, which might include new features or enhancements not yet available in the Bioconductor version, use the following commands. Note that the development version might be less stable than the officially released version.

# install devtools package if not already installed
if (!requireNamespace("devtools", quietly=TRUE))
    install.packages("devtools")

# download and install the regionalpcs package
devtools::install_github("tyeulalio/regionalpcs")

3 regionalpcs R Package Tutorial

3.1 Loading Required Packages

This tutorial depends on several Bioconductor packages. These packages should be loaded at the beginning of the analysis.

library(regionalpcs)
library(GenomicRanges)
library(tidyr)
library(tibble)
library(dplyr)
library(minfiData)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

Let’s proceed to load the packages, briefly understanding their roles in this tutorial.

  • regionalpcs: Primary package for summarizing and interpreting DNA methylation data at a regional level.

  • GenomicRanges: Facilitates representation and manipulation of genomic intervals and variables defined along a genome.

  • tidyr, tibble, dplyr: Assist in data tidying, representation, and manipulation.

  • minfiData: Provides example Illumina 450k data, aiding in the demonstration of regionalpcs functionalities.

  • TxDb.Hsapiens.UCSC.hg19.knownGene: Accommodates transcriptome data, useful for annotating results.

Once the packages are loaded, we can start using the functions provided by each package.

3.2 Load the Dataset

3.2.1 Loading Minfi Sample Dataset

In this tutorial, we’ll utilize a sample dataset, MsetEx.sub, which is a subset derived from Illumina’s Human Methylation 450k dataset, specifically preprocessed to contain 600 CpGs across 6 samples. The dataset is stored in a MethylSet object, which is commonly used to represent methylation data.

The methylation beta values, denoting the proportion of methylated cytosines at a particular CpG site, will be extracted from this dataset for our subsequent analyses.

# Load the MethylSet data
data(MsetEx.sub)

# Display the first few rows of the dataset for a preliminary view
head(MsetEx.sub)
#> class: MethylSet 
#> dim: 6 6 
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(6): cg00050873 cg00212031 ... cg00455876 cg01707559
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#>   5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#>   array: IlluminaHumanMethylation450k
#>   annotation: ilmn12.hg19
#> Preprocessing
#>   Method: Raw (no normalization or bg correction)
#>   minfi version: 1.21.2
#>   Manifest version: 0.4.0

# Extract methylation M-values from the MethylSet
# M-values are logit-transformed beta-values and are often used in differential
# methylation analysis for improved statistical performance.
mvals <- getM(MsetEx.sub)

# Display the extracted methylation beta values
head(mvals)
#>            5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> cg00050873          3.502348         0.4414491          4.340695
#> cg00212031         -3.273751         0.9234662         -2.614777
#> cg00213748          2.076816        -0.1309465          1.260995
#> cg00214611         -3.438838         1.7463950         -2.270551
#> cg00455876          1.839010        -0.9927320          1.619479
#> cg01707559         -3.303987        -0.6433201         -3.540887
#>            5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
#> cg00050873        0.24458355        -0.3219281         0.2744392
#> cg00212031       -0.21052257        -0.6861413        -0.1397595
#> cg00213748       -1.10373279        -1.6616553        -0.1270869
#> cg00214611        0.29029649        -0.2103599        -0.6138630
#> cg00455876       -0.09504721        -0.2854655         0.6361273
#> cg01707559       -0.74835377        -0.4678048        -1.1345421

Note that MsetEx.sub provides a manageable slice of data that we can utilize to illustrate the capabilities of regionalpcs without requiring extensive computational resources.

Now that we have our dataset loaded and methylation values extracted, let’s proceed with demonstrating the core functionalities of regionalpcs.

3.3 Obtaining Methylation Array Probe Positions

3.3.1 Load Illumina 450k Array Probe Positions

In this step, we’ll obtain the genomic coordinates of the CpG sites in our methylation dataset using the 450k array probe annotations using the minfi package.

# Map the methylation data to genomic coordinates using the mapToGenome
# function. This creates a GenomicMethylSet (gset) which includes genomic 
# position information for each probe.
gset <- mapToGenome(MsetEx.sub)

# Display the GenomicMethylSet object to observe the structure and initial 
# entries.
gset
#> class: GenomicMethylSet 
#> dim: 600 6 
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(600): cg01003813 cg03723195 ... cg03930849 cg08265308
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#>   5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#>   array: IlluminaHumanMethylation450k
#>   annotation: ilmn12.hg19
#> Preprocessing
#>   Method: Raw (no normalization or bg correction)
#>   minfi version: 1.21.2
#>   Manifest version: 0.4.0

# Convert the genomic position data into a GRanges object, enabling genomic 
# range operations in subsequent analyses.
# The GRanges object (cpg_gr) provides a versatile structure for handling 
# genomic coordinates in R/Bioconductor.
cpg_gr <- granges(gset)

# Display the GRanges object for a preliminary view of the genomic coordinates.
cpg_gr
#> GRanges object with 600 ranges and 0 metadata columns:
#>              seqnames    ranges strand
#>                 <Rle> <IRanges>  <Rle>
#>   cg01003813     chrX   2715058      *
#>   cg03723195     chrX   2847354      *
#>   cg00074638     chrX   2883307      *
#>   cg01728536     chrX   7070451      *
#>   cg01864404     chrX   8434367      *
#>          ...      ...       ...    ...
#>   cg26983430     chrY  24549675      *
#>   cg01757887     chrY  25114810      *
#>   cg00061679     chrY  25314171      *
#>   cg03930849     chrY  26716289      *
#>   cg08265308     chrY  28555550      *
#>   -------
#>   seqinfo: 2 sequences from hg19 genome; no seqlengths

3.4 Summarizing Gene Region Types

3.4.1 Introduction

Gene regions, which include functional segments such as promoters, gene bodies, and intergenic regions, play pivotal roles in gene expression and regulation. Summarizing methylation patterns across these regions can provide insights into potential gene regulatory mechanisms and associations with phenotypes or disease states. Herein, we will delve into how to succinctly summarize methylation data at these crucial genomic segments using the regionalpcs package.

3.4.2 Load Gene Region Annotations

First, let’s load the gene region annotations. Make sure to align the genomic builds of your annotations and methylation data.

# Obtain promoter regions
# The TxDb object 'txdb' facilitates the retrieval of transcript-based 
# genomic annotations.
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# Extracting promoter regions with a defined upstream and downstream window. 
# This GRanges object 'promoters_gr' will be utilized to map and summarize 
# methylation data in promoter regions.
promoters_gr <- suppressMessages(promoters(genes(txdb), 
                                    upstream=1000, 
                                    downstream=0))

# Display the GRanges object containing the genomic coordinates of promoter 
# regions.
promoters_gr
#> GRanges object with 23056 ranges and 1 metadata column:
#>         seqnames              ranges strand |     gene_id
#>            <Rle>           <IRanges>  <Rle> | <character>
#>       1    chr19   58874215-58875214      - |           1
#>      10     chr8   18247755-18248754      + |          10
#>     100    chr20   43280377-43281376      - |         100
#>    1000    chr18   25757446-25758445      - |        1000
#>   10000     chr1 244006887-244007886      - |       10000
#>     ...      ...                 ...    ... .         ...
#>    9991     chr9 115095945-115096944      - |        9991
#>    9992    chr21   35735323-35736322      + |        9992
#>    9993    chr22   19109968-19110967      - |        9993
#>    9994     chr6   90538619-90539618      + |        9994
#>    9997    chr22   50964906-50965905      - |        9997
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

3.4.3 Create a Region Map

Creating a region map, which systematically assigns CpGs to specific gene regions, stands as a crucial precursor to gene-region summarization using the regionalpcs package. This mapping elucidates the physical positioning of CpGs within particular gene regions, facilitating our upcoming endeavors to comprehend how methylation varies across distinct genomic segments. We’ll use the create_region_map function from the regionalpcs package. This function takes two genomic ranges objects, cpg_gr contains CpG positions and genes_gr contains gene region positions. Make sure both positions are aligned to the same genome build (e.g. GrCH37, CrCH38).

# get the region map using the regionalpcs function
region_map <- regionalpcs::create_region_map(cpg_gr=cpg_gr, 
                                                genes_gr=promoters_gr)

# Display the initial few rows of the region map.
head(region_map)
#>     gene_id     cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4     10084 cg01303569
#> 5      1069 cg01056373
#> 6     10857 cg01333849

Note: The second column of region_map must contain values matching the rownames of your methylation dataframe.

3.4.4 Summarizing Gene Regions with Regional Principal Components

In this final section, we’ll summarize gene regions using Principal Components (PCs) to capture the maximum variation. We’ll utilize the compute_regional_pcs function from the regionalpcs package for this.

3.4.4.1 Compute Regional PCs

Let’s calculate the regional PCs using our gene regions for demonstration purposes.

# Display head of region map
head(region_map)
#>     gene_id     cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4     10084 cg01303569
#> 5      1069 cg01056373
#> 6     10857 cg01333849
dim(region_map)
#> [1] 211   2

# Compute regional PCs
res <- compute_regional_pcs(meth=mvals, region_map=region_map, pc_method="gd")
#> Using Gavish-Donoho method

3.4.4.2 Inspecting the Output

The function returns a list containing multiple elements. Let’s first look at what these elements are.

# Inspect the output list elements
names(res)
#> [1] "regional_pcs"     "percent_variance" "loadings"         "info"

3.4.4.3 Extracting and Viewing Regional PCs

The first element (res$regional_pcs) is a data frame containing the calculated regional PCs.

# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
#>               5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> 100131434-PC1         -3.887949        1.82168108        -4.2354365
#> 100133941-PC1          1.100370        0.06719947         0.1932361
#> 100133957-PC1         -2.032026        0.19154436        -0.9193417
#> 10084-PC1             -5.011579        2.34574585        -4.2050727
#> 1069-PC1              -2.004785        1.20363622        -2.5436686
#> 10857-PC1             -1.986294        1.31482789        -2.0044911
#>               5723646053_R04C02
#> 100131434-PC1         2.1707679
#> 100133941-PC1        -0.7438658
#> 100133957-PC1         1.1903709
#> 10084-PC1             2.2586300
#> 1069-PC1              1.5851359
#> 10857-PC1             2.1369556

3.4.4.4 Understanding the Results

The output is a data frame with regional PCs for each region as rows and our samples as columns. This is our new representation of methylation values, now on a gene regional PC scale. We can feed these into downstream analyses as is.

The number of regional PCs representing each gene region was determined by the Gavish-Donoho method. This method allows us to identify PCs that capture actual signal in our data and not the noise that is inherent in any dataset. To explore alternative methods, we can change the pc_method parameter.

# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) |>
    separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)
#>        gene  pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4     10084 PC1
#> 5      1069 PC1
#> 6     10857 PC1

# number of genes that have been summarized
length(unique(regions$gene))
#> [1] 142

# how many of each PC did we get
table(regions$pc)
#> 
#> PC1 
#> 142

We have summarized each of our genes using just one PC. The number of PCs depends on three main factors: the number of samples, the number of CpGs in the gene region, and the noise in the methylation data.

By default, the compute_regional_pcs function uses the Gavish-Donoho method. However, we can also use the Marcenko-Pasteur method by setting the pc_method parameter:

# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(mvals, region_map, pc_method = "mp")
#> Using Marchenko-Pastur method

# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs

# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) |>
    separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)
#>        gene  pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4     10084 PC1
#> 5      1069 PC1
#> 6     10857 PC1

# number of genes that have been summarized
length(unique(mp_regions$gene))
#> [1] 142

# how many of each PC did we get
table(mp_regions$pc)
#> 
#> PC1 
#> 142

The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random Matrix Theory, and they aim to identify the number of significant PCs that capture the true signal in the data and not just the noise. However, these methods differ in how they select the number of significant PCs. The Marcenko-Pasteur method typically selects more PCs to represent a gene region compared to the Gavish-Donoho method. This may be due to the different ways in which the two methods estimate the noise level in the data.

Ultimately, the choice between the two methods depends on the specific needs and goals of the analysis. The Gavish-Donoho method tends to provide more conservative results, while the Marcenko-Pasteur method may capture more of the underlying signal in the data. Researchers should carefully consider their objectives and the characteristics of their data when selecting a method.

4 Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2           
#>  [2] GenomicFeatures_1.58.0                            
#>  [3] AnnotationDbi_1.68.0                              
#>  [4] minfiData_0.51.0                                  
#>  [5] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#>  [6] IlluminaHumanMethylation450kmanifest_0.4.0        
#>  [7] minfi_1.52.0                                      
#>  [8] bumphunter_1.48.0                                 
#>  [9] locfit_1.5-9.10                                   
#> [10] iterators_1.0.14                                  
#> [11] foreach_1.5.2                                     
#> [12] Biostrings_2.74.0                                 
#> [13] XVector_0.46.0                                    
#> [14] SummarizedExperiment_1.36.0                       
#> [15] Biobase_2.66.0                                    
#> [16] MatrixGenerics_1.18.0                             
#> [17] matrixStats_1.4.1                                 
#> [18] dplyr_1.1.4                                       
#> [19] tibble_3.2.1                                      
#> [20] tidyr_1.3.1                                       
#> [21] GenomicRanges_1.58.0                              
#> [22] GenomeInfoDb_1.42.0                               
#> [23] IRanges_2.40.0                                    
#> [24] S4Vectors_0.44.0                                  
#> [25] BiocGenerics_0.52.0                               
#> [26] regionalpcs_1.4.0                                 
#> [27] BiocStyle_2.34.0                                  
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.1             BiocIO_1.16.0            
#>   [3] bitops_1.0-9              preprocessCore_1.68.0    
#>   [5] XML_3.99-0.17             lifecycle_1.0.4          
#>   [7] lattice_0.22-6            MASS_7.3-61              
#>   [9] base64_2.0.2              scrime_1.3.5             
#>  [11] magrittr_2.0.3            limma_3.62.0             
#>  [13] sass_0.4.9                rmarkdown_2.28           
#>  [15] jquerylib_0.1.4           yaml_2.3.10              
#>  [17] doRNG_1.8.6               askpass_1.2.1            
#>  [19] cowplot_1.1.3             DBI_1.2.3                
#>  [21] RColorBrewer_1.1-3        abind_1.4-8              
#>  [23] zlibbioc_1.52.0           quadprog_1.5-8           
#>  [25] purrr_1.0.2               RCurl_1.98-1.16          
#>  [27] GenomeInfoDbData_1.2.13   ggrepel_0.9.6            
#>  [29] RMTstat_0.3.1             irlba_2.3.5.1            
#>  [31] rentrez_1.2.3             genefilter_1.88.0        
#>  [33] annotate_1.84.0           dqrng_0.4.1              
#>  [35] DelayedMatrixStats_1.28.0 codetools_0.2-20         
#>  [37] DelayedArray_0.32.0       xml2_1.3.6               
#>  [39] tidyselect_1.2.1          UCSC.utils_1.2.0         
#>  [41] ScaledMatrix_1.14.0       beanplot_1.3.1           
#>  [43] illuminaio_0.48.0         GenomicAlignments_1.42.0 
#>  [45] jsonlite_1.8.9            multtest_2.62.0          
#>  [47] survival_3.7-0            tools_4.4.1              
#>  [49] Rcpp_1.0.13               glue_1.8.0               
#>  [51] SparseArray_1.6.0         xfun_0.48                
#>  [53] HDF5Array_1.34.0          withr_3.0.2              
#>  [55] BiocManager_1.30.25       fastmap_1.2.0            
#>  [57] rhdf5filters_1.18.0       fansi_1.0.6              
#>  [59] openssl_2.2.2             digest_0.6.37            
#>  [61] rsvd_1.0.5                R6_2.5.1                 
#>  [63] colorspace_2.1-1          RSQLite_2.3.7            
#>  [65] utf8_1.2.4                generics_0.1.3           
#>  [67] data.table_1.16.2         rtracklayer_1.66.0       
#>  [69] httr_1.4.7                S4Arrays_1.6.0           
#>  [71] pkgconfig_2.0.3           gtable_0.3.6             
#>  [73] blob_1.2.4                siggenes_1.80.0          
#>  [75] htmltools_0.5.8.1         bookdown_0.41            
#>  [77] scales_1.3.0              png_0.1-8                
#>  [79] knitr_1.48                tzdb_0.4.0               
#>  [81] reshape2_1.4.4            rjson_0.2.23             
#>  [83] nlme_3.1-166              curl_5.2.3               
#>  [85] cachem_1.1.0              rhdf5_2.50.0             
#>  [87] stringr_1.5.1             restfulr_0.0.15          
#>  [89] GEOquery_2.74.0           pillar_1.9.0             
#>  [91] grid_4.4.1                reshape_0.8.9            
#>  [93] vctrs_0.6.5               BiocSingular_1.22.0      
#>  [95] beachmat_2.22.0           xtable_1.8-4             
#>  [97] evaluate_1.0.1            readr_2.1.5              
#>  [99] cli_3.6.3                 compiler_4.4.1           
#> [101] Rsamtools_2.22.0          rlang_1.1.4              
#> [103] crayon_1.5.3              rngtools_1.5.2           
#> [105] nor1mix_1.3-3             mclust_6.1.1             
#> [107] plyr_1.8.9                stringi_1.8.4            
#> [109] BiocParallel_1.40.0       munsell_0.5.1            
#> [111] PCAtools_2.18.0           Matrix_1.7-1             
#> [113] hms_1.1.3                 sparseMatrixStats_1.18.0 
#> [115] bit64_4.5.2               ggplot2_3.5.1            
#> [117] Rhdf5lib_1.28.0           KEGGREST_1.46.0          
#> [119] statmod_1.5.0             memoise_2.0.1            
#> [121] bslib_0.8.0               bit_4.5.0