MethReg 1.4.0
Transcription factors (TFs) are proteins that facilitate the transcription of DNA into RNA. A number of recent studies have observed that the binding of TFs onto DNA can be affected by DNA methylation, and in turn, DNA methylation can also be added or removed by proteins associated with transcription factors (Bonder et al. 2017; Banovich et al. 2014; Zhu, Wang, and Qian 2016).
To provide functional annotations for differentially methylated regions (DMRs)
and differentially methylated CpG sites (DMS), MethReg
performs integrative
analyses using matched DNA methylation and gene expression along with
Transcription Factor Binding Sites (TFBS) data. MethReg evaluates, prioritizes
and annotates DNA methylation regions (or sites) with high regulatory potential
that works synergistically with TFs to regulate target gene expressions,
without any additional ChIP-seq data.
The results from MethReg
can be used to generate testable hypothesis on the
synergistic collaboration of DNA methylation changes and TFs in gene regulation.
MethReg
can be used either to evaluate regulatory potentials of candidate
regions or to search for methylation coupled TF regulatory processes in the entire genome.
MethReg
is a Bioconductor package and can be installed through BiocManager::install()
.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("MethReg", dependencies = TRUE)
After the package is installed, it can be loaded into R workspace by
library(MethReg)
The figure below illustrates the workflow for MethReg. Given matched array DNA methylation data and RNA-seq gene expression data, MethReg additionally incorporates TF binding information from ReMap2020 (Chèneby et al. 2019) or the JASPAR2020 (Baranasic 2020; Fornes et al. 2020) database, and optionally additional TF-target gene interaction databases, to perform both promoter and distal (enhancer) analysis.
In the unsupervised mode, MethReg analyzes all CpGs on the Illumina arrays. In the supervised mode, MethReg analyzes and prioritizes differentially methylated CpGs identified in EWAS.
There are three main steps: (1) create a dataset with triplets of CpGs, TFs that bind near the CpGs, and putative target genes, (2) for each triplet (CpG, TF, target gene), apply integrative statistical models to DNA methylation, target gene expression, and TF expression values, and (3) visualize and interpret results from statistical models to estimate individual and joint impacts of DNA methylation and TF on target gene expression, as well as annotate the roles of TF and CpG methylation in each triplet.
The results from the statistical models will also allow us to identify a list of CpGs that work synergistically with TFs to influence target gene expression.
For illustration, we will use chromosome 21 data from 38 TCGA-COAD (colon cancer) samples.
The DNA methylation dataset is a matrix or SummarizedExperiment object with methylation beta or M-values (The samples are in the columns and methylation regions or probes are in the rows). If there are potential confounding factors (e.g. batch effect, age, sex) in the dataset, this matrix would contain residuals from fitting linear regression instead (see details Section 5 “Controlling effects from confounding variables” below).
We will analyze all CpGs on chromosome 21 in this vignette.
However, oftentimes, the methylation data can also be, for example, differentially methylated sites (DMS) or differentially methylated regions (DMRs) obtained in an epigenome-wide association study (EWAS) study.
data("dna.met.chr21")
dna.met.chr21[1:5,1:5]
#> TCGA-3L-AA1B-01A TCGA-4N-A93T-01A TCGA-4T-AA8H-01A TCGA-5M-AAT4-01A TCGA-5M-AAT5-01A
#> cg00002080 0.6454046 0.5933725 0.54955509 0.81987982 0.79171160
#> cg00004533 0.9655396 0.9640490 0.96690671 0.95510446 0.96061252
#> cg00009944 0.5437705 0.2803064 0.42918909 0.60734630 0.47555585
#> cg00025591 0.4021317 0.7953653 0.41816364 0.33241304 0.67251468
#> cg00026030 0.1114705 0.1012902 0.06834467 0.08594876 0.06715677
We will first create a SummarizedExperiment object with the function
make_dnam_se
. This function will use the Sesame R/Bioconductor package
to map the array probes into genomic regions. You cen set human genome version
(hg38 or hg19) and the array type (“450k” or “EPIC”)
dna.met.chr21.se <- make_dnam_se(
dnam = dna.met.chr21,
genome = "hg38",
arrayType = "450k",
betaToM = FALSE, # transform beta to m-values
verbose = FALSE # hide informative messages
)
dna.met.chr21.se
#> class: RangedSummarizedExperiment
#> dim: 2918 38
#> metadata(2): genome arrayType
#> assays(1): ''
#> rownames(2918): chr21:10450634-10450635 chr21:10520974-10520975 ...
#> chr21:46670216-46670217 chr21:46670596-46670597
#> rowData names(53): address_A address_B ... MASK_general probeID
#> colnames(38): TCGA-3L-AA1B-01A TCGA-4N-A93T-01A ... TCGA-A6-5656-01B TCGA-A6-5657-01A
#> colData names(1): samples
SummarizedExperiment::rowRanges(dna.met.chr21.se)[1:4,1:4]
#> GRanges object with 4 ranges and 4 metadata columns:
#> seqnames ranges strand | address_A address_B channel
#> <Rle> <IRanges> <Rle> | <integer> <integer> <character>
#> chr21:10450634-10450635 chr21 10450634-10450635 * | 74716393 <NA> Both
#> chr21:10520974-10520975 chr21 10520974-10520975 * | 29756401 20622400 Red
#> chr21:10521044-10521045 chr21 10521044-10521045 * | 15617483 <NA> Both
#> chr21:10521122-10521123 chr21 10521122-10521123 * | 33810384 37781360 Grn
#> designType
#> <character>
#> chr21:10450634-10450635 II
#> chr21:10520974-10520975 I
#> chr21:10521044-10521045 II
#> chr21:10521122-10521123 I
#> -------
#> seqinfo: 26 sequences from an unspecified genome; no seqlengths
Differentially Methylated Regions (DMRs) associated with phenotypes such
as tumor stage can be obtained from R packages such as
coMethDMR
, comb-p
, DMRcate
and many others.
The methylation levels in multiple CpGs within the DMRs need to be
summarized (e.g. using medians), then the analysis for
DMR will proceed in the same way
as those for CpGs.
The gene expression dataset is a matrix with log2 transformed and normalized gene expression values. If there are potential confounding factors (e.g. batch effect, age, sex) in the dataset, this matrix can also contain residuals from linear regression instead (see Section 6 “Controlling effects from confounding variables” below).
The samples are in the columns and the genes are in the rows.
data("gene.exp.chr21.log2")
gene.exp.chr21.log2[1:5,1:5]
#> TCGA-3L-AA1B-01A TCGA-4N-A93T-01A TCGA-4T-AA8H-01A TCGA-5M-AAT4-01A
#> ENSG00000141956 14.64438 14.65342 14.09232 14.60680
#> ENSG00000141959 19.33519 20.03720 19.76128 19.57854
#> ENSG00000142149 17.27832 16.02392 18.16079 15.84463
#> ENSG00000142156 20.38689 18.83080 18.02720 18.91380
#> ENSG00000142166 17.89172 18.06625 18.47187 17.40467
#> TCGA-5M-AAT5-01A
#> ENSG00000141956 14.58640
#> ENSG00000141959 18.27442
#> ENSG00000142149 14.79654
#> ENSG00000142156 18.71926
#> ENSG00000142166 16.71412
We will also create a SummarizedExperiment object for the gene expression data. This object will contain the genomic information for each gene.
gene.exp.chr21.se <- make_exp_se(
exp = gene.exp.chr21.log2,
genome = "hg38",
verbose = FALSE
)
gene.exp.chr21.se
#> class: RangedSummarizedExperiment
#> dim: 752 38
#> metadata(1): genome
#> assays(1): ''
#> rownames(752): ENSG00000141956 ENSG00000141959 ... ENSG00000281420 ENSG00000281903
#> rowData names(2): ensembl_gene_id external_gene_name
#> colnames(38): TCGA-3L-AA1B-01A TCGA-4N-A93T-01A ... TCGA-A6-5656-01B TCGA-A6-5657-01A
#> colData names(1): samples
SummarizedExperiment::rowRanges(gene.exp.chr21.se)[1:5,]
#> GRanges object with 5 ranges and 2 metadata columns:
#> seqnames ranges strand | ensembl_gene_id external_gene_name
#> <Rle> <IRanges> <Rle> | <character> <character>
#> ENSG00000141956 chr21 41798225-41879482 - | ENSG00000141956 PRDM15
#> ENSG00000141959 chr21 44300051-44327376 + | ENSG00000141959 PFKL
#> ENSG00000142149 chr21 31873020-32044633 + | ENSG00000142149 HUNK
#> ENSG00000142156 chr21 45981769-46005050 + | ENSG00000142156 COL6A1
#> ENSG00000142166 chr21 33324429-33359864 + | ENSG00000142166 IFNAR1
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
In this section, regions refer to the regions where CpGs are located.
The function create_triplet_distance_based
provides three different methods to
link a region to a target gene:
target.method = "genes.promoter.overlap"
)target.method = "nearby.genes"
) (Silva et al. 2019).target.method = "window"
) (Reese et al. 2019).For the analysis of probes in gene promoter region, we recommend setting
method = "genes.promoter.overlap"
, or
method = "closest.gene"
.
For the analysis of probes in distal regions, we recommend setting either
method = "window"
or method = "nearby.genes"
.
Note that the distal analysis will be more time and resource consuming.
To link regions to TF using JASPAR2020, MethReg uses motifmatchr
(Schep 2020) to scan
these regions for occurrences of motifs in the database. JASPAR2020 is an
open-access database of curated, non-redundant transcription
factor (TF)-binding profiles (Baranasic 2020; Fornes et al. 2020), which contains
more the 500 human TF motifs.
The argument motif.search.window.size
will be used to extend the region when scanning
for the motifs, for example, a motif.search.window.size
of 50
will add 25
bp
upstream and 25
bp downstream of the original region.
As an example, the following scripts link CpGs with the probes in gene promoter region (method 1. above)
triplet.promoter <- create_triplet_distance_based(
region = dna.met.chr21.se,
target.method = "genes.promoter.overlap",
genome = "hg38",
target.promoter.upstream.dist.tss = 2000,
target.promoter.downstream.dist.tss = 2000,
motif.search.window.size = 500,
motif.search.p.cutoff = 1e-08,
cores = 1
)
Alternatively, we can also link each probe with genes within \(500 kb\) window (method 2).
# Map probes to genes within 500kb window
triplet.distal.window <- create_triplet_distance_based(
region = dna.met.chr21.se,
genome = "hg38",
target.method = "window",
target.window.size = 500 * 10^3,
target.rm.promoter.regions.from.distal.linking = TRUE,
motif.search.window.size = 500,
motif.search.p.cutoff = 1e-08,
cores = 1
)
For method 3, to map probes to 5 nearest upstream and downstream genes:
# Map probes to 5 genes upstream and 5 downstream
triplet.distal.nearby.genes <- create_triplet_distance_based(
region = dna.met.chr21.se,
genome = "hg38",
target.method = "nearby.genes",
target.num.flanking.genes = 5,
target.window.size = 500 * 10^3,
target.rm.promoter.regions.from.distal.linking = TRUE,
motif.search.window.size = 500,
motif.search.p.cutoff = 1e-08,
cores = 1
)
Instead of using JASPAR2020 motifs, we will be using REMAP2020 catalogue of
TF peaks which can be access using the package ReMapEnrich
.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("remap-cisreg/ReMapEnrich", dependencies = TRUE)
To download REMAP2020 catalogue (~1Gb) the following functions are used:
library(ReMapEnrich)
remapCatalog2018hg38 <- downloadRemapCatalog("/tmp/", assembly = "hg38")
remapCatalog <- bedToGranges(remapCatalog2018hg38)
The function create_triplet_distance_based
will accept any Granges with TF
information in the same format as the remapCatalog
one.
#-------------------------------------------------------------------------------
# Triplets promoter using remap
#-------------------------------------------------------------------------------
triplet.promoter.remap <- create_triplet_distance_based(
region = dna.met.chr21.se,
genome = "hg19",
target.method = "genes.promoter.overlap",
TF.peaks.gr = remapCatalog,
motif.search.window.size = 500,
max.distance.region.target = 10^6,
)
The human regulons from the dorothea database will be used as an example:
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("dorothea", dependencies = TRUE)
regulons.dorothea <- dorothea::dorothea_hs
regulons.dorothea %>% head
#> # A tibble: 6 × 4
#> tf confidence target mor
#> <chr> <chr> <chr> <dbl>
#> 1 ADNP D ATF7IP 1
#> 2 ADNP D DYRK1A 1
#> 3 ADNP D TLK1 1
#> 4 ADNP D ZMYM4 1
#> 5 ADNP D ABCC1 1
#> 6 ADNP D ABCC6 1
Using the regulons, you can calculate enrichment scores for each TF across all samples using dorothea and viper.
rnaseq.tf.es <- get_tf_ES(
exp = gene.exp.chr21.se %>% SummarizedExperiment::assay(),
regulons = regulons.dorothea
)
Finally, triplets can be identified using TF-target from regulon databases with the function create_triplet_regulon_based
.
triplet.regulon <- create_triplet_regulon_based(
region = dna.met.chr21.se,
genome = "hg38",
motif.search.window.size = 500,
tf.target = regulons.dorothea,
max.distance.region.target = 10^6 # 1Mbp
)
triplet.regulon %>% head
#> # A tibble: 6 × 7
#> regionID target_symbol target TF_symbol TF confidence distance_region…
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 chr21:29073653-29073654 CCT8 ENSG00000156261 ALX3 ENSG0… E 142
#> 2 chr21:29073731-29073732 CCT8 ENSG00000156261 ALX3 ENSG0… E 64
#> 3 chr21:29073853-29073854 CCT8 ENSG00000156261 ALX3 ENSG0… E -55
#> 4 chr21:29073902-29073903 CCT8 ENSG00000156261 ALX3 ENSG0… E -104
#> 5 chr21:29073917-29073918 CCT8 ENSG00000156261 ALX3 ENSG0… E -119
#> 6 chr21:29073920-29073921 CCT8 ENSG00000156261 ALX3 ENSG0… E -122
The triplet is a data frame with the following columns:
target
: gene identifier (obtained from row names of the gene expression matrix),regionID
: region/CpG identifier (obtained from row names of the DNA methylation matrix)TF
: gene identifier (obtained from the row names of the gene expression matrix)str(triplet.promoter)
#> tibble [3,707 × 7] (S3: tbl_df/tbl/data.frame)
#> $ regionID : chr [1:3707] "chr21:10521553-10521554" "chr21:10521553-10521554" "chr21:10521553-10521554" "chr21:14063939-14063940" ...
#> $ probeID : chr [1:3707] "cg05437132" "cg05437132" "cg05437132" "cg25507885" ...
#> $ target_symbol : chr [1:3707] "TPTE" "TPTE" "TPTE" "ANKRD20A18P" ...
#> $ target : chr [1:3707] "ENSG00000274391" "ENSG00000274391" "ENSG00000274391" "ENSG00000249493" ...
#> $ TF_symbol : chr [1:3707] "KLF5" "TFE3" "KLF15" "THAP1" ...
#> $ TF : chr [1:3707] "ENSG00000102554" "ENSG00000068323" "ENSG00000163884" "ENSG00000131931" ...
#> $ distance_region_target_tss: num [1:3707] 0 0 0 -384 -83 ...
triplet.promoter$distance_region_target_tss %>% range
#> [1] -1999 1989
triplet.promoter %>% head
#> # A tibble: 6 × 7
#> regionID probeID target_symbol target TF_symbol TF distance_region…
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 chr21:10521553-10521554 cg05437132 TPTE ENSG00000274391 KLF5 ENSG0… 0
#> 2 chr21:10521553-10521554 cg05437132 TPTE ENSG00000274391 TFE3 ENSG0… 0
#> 3 chr21:10521553-10521554 cg05437132 TPTE ENSG00000274391 KLF15 ENSG0… 0
#> 4 chr21:14063939-14063940 cg25507885 ANKRD20A18P ENSG00000249493 THAP1 ENSG0… -384
#> 5 chr21:14070786-14070787 cg16038510 RNA5SP488 ENSG00000201812 ZNF354C ENSG0… -83
#> 6 chr21:14071916-14071917 cg24803637 RNA5SP488 ENSG00000201812 ZNF354C ENSG0… 1044
Note that there may be multiple rows for a CpG region, when multiple target gene and/or TFs are found close to it.
Because TF binding to DNA can be influenced by (or influences) DNA methylation levels nearby (Yin et al. 2017), target gene expression levels are often resulted from the synergistic effects of both TF and DNA methylation. In other words, TF activities in gene regulation is often affected by DNA methylation.
Our goal then is to highlight DNA methylation regions (or CpGs) where these synergistic DNAm and TF collaborations occur. We will perform analyses using the 3 datasets described above in Section 3:
The function interaction_model
assess the regulatory impact of
DNA methylation on TF regulation of target genes via the following approach:
considering DNAm values as a binary variable - we define a binary variable
DNAm Group
for DNA methylation values (high = 1, low = 0).
That is, samples with the highest DNAm levels (top 25 percent) has high = 1,
samples with lowest DNAm levels (bottom 25 pecent) has high = 0.
Note that in this implementation, only samples with DNAm values in the first and last quartiles are considered.
\[log_2(RNA target) \sim log_2(TF) + \text{DNAm Group} + log_2(TF) * \text{DNAm Group}\]
results.interaction.model <- interaction_model(
triplet = triplet.promoter,
dnam = dna.met.chr21.se,
exp = gene.exp.chr21.se,
sig.threshold = 0.05,
fdr = TRUE,
stage.wise.analysis = FALSE,
filter.correlated.tf.exp.dnam = TRUE,
filter.triplet.by.sig.term = FALSE
)
The output of interaction_model
function will be a data frame with the following variables:
<variable>_pvalue
: p-value for a tested variable (methylation or TF), given the other variables included in the model.<variable>_estimate
: estimated effect for a variable. If estimate > 0, increasing values
of the variable corresponds to increased outcome values (target gene expression).
If estimate < 0, increasing values of the variable correspond to decreased target gene expression levels.The following columns are provided for the results of fitting quartile model to triplet data:
RLM_DNAmGroup_pvalue
: p-value for binary DNA methylation variableRLM_DNAmGroup_estimate
: estimated DNA methylation effectRLM_TF_pvalue
: p-value for TF expressionRLM_TF_estimate
: estimated TF effectRLM_DNAmGroup:TF_pvalue
: : p-value for DNA methylation by TF interactionRLM_DNAmGroup:TF_estimate
: estimated DNA methylation by TF interaction effect# Results for quartile model
results.interaction.model %>% dplyr::select(
c(1,4,5,grep("RLM",colnames(results.interaction.model)))
) %>% head
#> regionID target TF_symbol RLM_DNAmGroup_pvalue RLM_TF_pvalue
#> 1 chr21:42879066-42879067 ENSG00000160194 ETS2 0.2784371 0.9297654
#> 2 chr21:46286247-46286248 ENSG00000182362 ETS2 0.6538484 0.6641369
#> 3 chr21:46286307-46286308 ENSG00000182362 ETS2 0.6076798 0.9150336
#> RLM_DNAmGroup:TF_pvalue RLM_DNAmGroup_estimate RLM_TF_estimate RLM_DNAmGroup:TF_estimate
#> 1 0.3300744 -4.345629 0.01048215 0.1840699
#> 2 0.6902901 -8.366201 0.29472093 0.3496058
#> 3 0.6648676 -6.768267 -0.04482690 0.2711517
For triplets with significant \(log_2(TF) × DNAm\) interaction effect identified
above, we can further assess how gene regulation by TF changes when DNAm
is high or low. To this end, the function
stratified_model
fits two separate models (see below) to only
samples with the highest DNAm levels (top 25 percent), and then to
only samples with lowest DNAm levels (bottom 25 percent), separately.
\[\text{Stratified Model: } log_2(RNA target) \sim log_2(TF)\]
results.stratified.model <- stratified_model(
triplet = results.interaction.model,
dnam = dna.met.chr21.se,
exp = gene.exp.chr21.se
)
results.stratified.model %>% head
#> regionID probeID target_symbol target TF_symbol TF
#> 1 chr21:42879066-42879067 cg16830966 NDUFV3 ENSG00000160194 ETS2 ENSG00000157557
#> 2 chr21:46286247-46286248 cg20202505 YBEY ENSG00000182362 ETS2 ENSG00000157557
#> 3 chr21:46286307-46286308 cg02878701 YBEY ENSG00000182362 ETS2 ENSG00000157557
#> distance_region_target_tss met.IQR RLM_DNAmGroup_pvalue RLM_TF_pvalue RLM_DNAmGroup:TF_pvalue
#> 1 -576 0.004145209 0.2784371 0.9297654 0.3300744
#> 2 -93 0.002682854 0.6538484 0.6641369 0.6902901
#> 3 -33 0.019176040 0.6076798 0.9150336 0.6648676
#> RLM_DNAmGroup_estimate RLM_TF_estimate RLM_DNAmGroup:TF_estimate Model.quantile
#> 1 -4.345629 0.01048215 0.1840699 Robust Linear Model
#> 2 -8.366201 0.29472093 0.3496058 Robust Linear Model
#> 3 -6.768267 -0.04482690 0.2711517 Robust Linear Model
#> Target_gene_DNAm_high_vs_Target_gene_DNAm_low_wilcoxon_pvalue
#> 1 0.01132970
#> 2 0.03120901
#> 3 0.02574808
#> TF_DNAm_high_vs_TF_DNAm_low_wilcoxon_pvalue
#> 1 0.6231762
#> 2 0.5205229
#> 3 0.9698500
#> % of target genes not expressed in DNAm_low and DNAm_high DNAm_low_RLM_target_vs_TF_pvalue
#> 1 0 % 0.9390548
#> 2 0 % 0.4504486
#> 3 0 % 0.9232194
#> DNAm_low_RLM_target_vs_TF_estimate DNAm_high_RLM_target_vs_TF_pvalue
#> 1 0.009461516 0.2038023
#> 2 0.281132965 0.2450440
#> 3 -0.041198655 0.5952503
#> DNAm_high_RLM_target_vs_TF_estimate DNAm.effect TF.role
#> 1 0.1945520 NA NA
#> 2 0.6308065 NA NA
#> 3 0.2447033 NA NA
The functions plot_interaction_model
will create figures to visualize the data,
in a way that corresponds to the linear model we considered above.
It requires the output from the function interaction_model
(a dataframe),
the DNA methylation matrix and the gene expression matrix as input.
plots <- plot_interaction_model(
triplet.results = results.interaction.model[1,],
dnam = dna.met.chr21.se,
exp = gene.exp.chr21.se
)
plots
#> $`chr21:42879066-42879067_TF_ENSG00000157557_target_ENSG00000160194`
The first row of the figures shows pairwise associations between DNA methylation, TF and target gene expression levels.
The second row of the figures show how much TF activity on target gene expression levels vary by DNA methylation levels. When TF by methylation interaction is significant (Section 4.1), we expect the association between TF and target gene expression vary depending on whether DNA methylation is low or high.
In this example, when DNA methylation is low, target gene expression is relatively independent of the amount of TF available. On the other hand, when DNA methylation level is high, more abundant TF corresponds to increased gene expression (an activator TF). One possibility is that DNA methylation might enhance TF binding in this case. This is an example where DNA methylation and TF work synergistically to affect target gene expression.
While the main goal of MethReg is to prioritize methylation CpGs, also note that without stratifying by DNA methylation, the overall TF-target effects (p = 0.142) is not as significant as the association in stratified analysis in high methylation samples (p = 0.00508). This demonstrates that by additionally modeling DNA methylation, we can also nominate TF – target associations that might have been missed otherwise.
Shown below are some expected results from fitting Models 1 & 2 described in Section 4.1 above, depending on TF binding preferences. Please note that there can be more possible scenarios than those listed here, therefore, careful evaluation of the statistical models and visualization of data as described in Section 4 are needed to gain a good understanding of the multi-omics data.
Both gene expressions and DNA methylation levels can be affected by age, sex,
shifting in cell types, batch effects and other confounding (or covariate) variables.
In this section, we illustrate analysis workflow that reduces confounding effects,
by first extracting the residual data with the function get_residuals
,
before fitting the models discussed above in Section 4.
The get_residuals
function will use gene expression (or DNA methylation data)
and phenotype data as input. To remove confounding effects in gene expression data,
we use the get_residuals
function which extract residuals after fitting the
following model for gene expression data:
\[log_2(RNA target) \sim covariate_{1} + covariate_{2} + ... + covariate_{N}\]
or the following model for methylation data:
\[methylation.Mvalues \sim covariate_{1} + covariate_{2} + ... + covariate_{N}\]
data("gene.exp.chr21.log2")
data("clinical")
metadata <- clinical[,c("sample_type","gender")]
gene.exp.chr21.residuals <- get_residuals(gene.exp.chr21, metadata) %>% as.matrix()
gene.exp.chr21.residuals[1:5,1:5]
data("dna.met.chr21")
dna.met.chr21 <- make_se_from_dnam_probes(
dnam = dna.met.chr21,
genome = "hg38",
arrayType = "450k",
betaToM = TRUE
)
dna.met.chr21.residuals <- get_residuals(dna.met.chr21, metadata) %>% as.matrix()
dna.met.chr21.residuals[1:5,1:5]
The models described in Section 4.1 can then be applied to these residuals
data using the interaction_model
function:
results <- interaction_model(
triplet = triplet,
dnam = dna.met.chr21.residuals,
exp = gene.exp.chr21.residuals
)
This example shows how to use dorothea regulons and viper to calculate enrichment scores for each TF across all samples.
regulons.dorothea <- dorothea::dorothea_hs
regulons.dorothea %>% head
#> # A tibble: 6 × 4
#> tf confidence target mor
#> <chr> <chr> <chr> <dbl>
#> 1 ADNP D ATF7IP 1
#> 2 ADNP D DYRK1A 1
#> 3 ADNP D TLK1 1
#> 4 ADNP D ZMYM4 1
#> 5 ADNP D ABCC1 1
#> 6 ADNP D ABCC6 1
rnaseq.tf.es <- get_tf_ES(
exp = gene.exp.chr21.se %>% SummarizedExperiment::assay(),
regulons = regulons.dorothea
)
rnaseq.tf.es[1:4,1:4]
#> TCGA-3L-AA1B-01A TCGA-4N-A93T-01A TCGA-4T-AA8H-01A TCGA-5M-AAT4-01A
#> ENSG00000101126 0.5107344 -2.1708007 -1.4257370 -2.29950338
#> ENSG00000101544 -1.0332572 -0.2855890 0.8007206 0.14008977
#> ENSG00000139154 -0.9773648 0.2275618 0.9888562 -2.01317607
#> ENSG00000160224 0.2112202 -0.9044230 0.1509887 -0.01717518
regulons.dorothea <- dorothea::dorothea_hs
regulons.dorothea$tf <- MethReg:::map_symbol_to_ensg(
gene.symbol = regulons.dorothea$tf,
genome = "hg38"
)
regulons.dorothea$target <- MethReg:::map_symbol_to_ensg(
gene.symbol = regulons.dorothea$target,
genome = "hg38"
)
split_tibble <- function(tibble, col = 'col') tibble %>% split(., .[, col])
regulons.dorothea.list <- regulons.dorothea %>% na.omit() %>%
split_tibble('tf') %>%
lapply(function(x){x[[3]]})
library(GSVA)
rnaseq.tf.es.gsva <- gsva(
expr = gene.exp.chr21.se %>% SummarizedExperiment::assay(),
gset.idx.list = regulons.dorothea.list,
method = "gsva",
kcdf = "Gaussian",
abs.ranking = TRUE,
min.sz = 5,
max.sz = Inf,
parallel.sz = 1L,
mx.diff = TRUE,
ssgsea.norm = TRUE,
verbose = TRUE
)
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB
#> [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.62.0
#> [3] rtracklayer_1.54.0 Biostrings_2.62.0
#> [5] XVector_0.34.0 GenomicRanges_1.46.0
#> [7] GenomeInfoDb_1.30.0 IRanges_2.28.0
#> [9] S4Vectors_0.32.0 MethReg_1.4.0
#> [11] sesameData_1.11.13 rmarkdown_2.11
#> [13] ExperimentHub_2.2.0 AnnotationHub_3.2.0
#> [15] BiocFileCache_2.2.0 dbplyr_2.1.1
#> [17] BiocGenerics_0.40.0 dplyr_1.0.7
#> [19] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.2 R.utils_2.11.0 tidyselect_1.1.1
#> [4] poweRlaw_0.70.6 RSQLite_2.2.8 AnnotationDbi_1.56.0
#> [7] grid_4.1.1 BiocParallel_1.28.0 munsell_0.5.0
#> [10] codetools_0.2-18 preprocessCore_1.56.0 withr_2.4.2
#> [13] colorspace_2.0-2 Biobase_2.54.0 filelock_1.0.2
#> [16] highr_0.9 knitr_1.36 pscl_1.5.5
#> [19] ggsignif_0.6.3 labeling_0.4.2 MatrixGenerics_1.6.0
#> [22] GenomeInfoDbData_1.2.7 farver_2.1.0 bit64_4.0.5
#> [25] vctrs_0.3.8 generics_0.1.1 xfun_0.27
#> [28] randomForest_4.6-14 R6_2.5.1 doParallel_1.0.16
#> [31] bitops_1.0-7 cachem_1.0.6 fgsea_1.20.0
#> [34] DelayedArray_0.20.0 assertthat_0.2.1 promises_1.2.0.1
#> [37] BiocIO_1.4.0 scales_1.1.1 gtable_0.3.0
#> [40] wheatmap_0.1.0 seqLogo_1.60.0 rlang_0.4.12
#> [43] splines_4.1.1 rstatix_0.7.0 broom_0.7.9
#> [46] BiocManager_1.30.16 yaml_2.2.1 reshape2_1.4.4
#> [49] abind_1.4-5 backports_1.2.1 httpuv_1.6.3
#> [52] tools_4.1.1 bookdown_0.24 ggplot2_3.3.5
#> [55] ellipsis_0.3.2 jquerylib_0.1.4 RColorBrewer_1.1-2
#> [58] DNAcopy_1.68.0 proxy_0.4-26 Rcpp_1.0.7
#> [61] plyr_1.8.6 progress_1.2.2 zlibbioc_1.40.0
#> [64] purrr_0.3.4 RCurl_1.98-1.5 prettyunits_1.1.1
#> [67] ggpubr_0.4.0 cowplot_1.1.1 sfsmisc_1.1-12
#> [70] SummarizedExperiment_1.24.0 haven_2.4.3 ggrepel_0.9.1
#> [73] motifmatchr_1.16.0 magrittr_2.0.1 data.table_1.14.2
#> [76] magick_2.7.3 openxlsx_4.2.4 matrixStats_0.61.0
#> [79] hms_1.1.1 mime_0.12 evaluate_0.14
#> [82] xtable_1.8-4 XML_3.99-0.8 rio_0.5.27
#> [85] readxl_1.3.1 gridExtra_2.3 compiler_4.1.1
#> [88] tibble_3.1.5 KernSmooth_2.23-20 crayon_1.4.1
#> [91] R.oo_1.24.0 htmltools_0.5.2 mgcv_1.8-38
#> [94] segmented_1.3-4 later_1.3.0 tzdb_0.1.2
#> [97] TFBSTools_1.32.0 tidyr_1.1.4 DBI_1.1.1
#> [100] MASS_7.3-54 rappdirs_0.3.3 Matrix_1.3-4
#> [103] bcellViper_1.29.0 car_3.0-11 readr_2.0.2
#> [106] cli_3.0.1 R.methodsS3_1.8.1 parallel_4.1.1
#> [109] forcats_0.5.1 pkgconfig_2.0.3 sesame_1.12.0
#> [112] GenomicAlignments_1.30.0 TFMPvalue_0.0.8 foreign_0.8-81
#> [115] dorothea_1.5.3 foreach_1.5.1 annotate_1.72.0
#> [118] bslib_0.3.1 DirichletMultinomial_1.36.0 JASPAR2020_0.99.10
#> [121] viper_1.28.0 stringr_1.4.0 digest_0.6.28
#> [124] pracma_2.3.3 CNEr_1.30.0 cellranger_1.1.0
#> [127] fastmatch_1.1-3 kernlab_0.9-29 restfulr_0.0.13
#> [130] curl_4.3.2 shiny_1.7.1 Rsamtools_2.10.0
#> [133] gtools_3.9.2 rjson_0.2.20 nlme_3.1-153
#> [136] lifecycle_1.0.1 jsonlite_1.7.2 carData_3.0-4
#> [139] fansi_0.5.0 pillar_1.6.4 lattice_0.20-45
#> [142] survival_3.2-13 KEGGREST_1.34.0 fastmap_1.1.0
#> [145] httr_1.4.2 GO.db_3.14.0 interactiveDisplayBase_1.32.0
#> [148] glue_1.4.2 zip_2.2.0 png_0.1-7
#> [151] iterators_1.0.13 BiocVersion_3.14.0 bit_4.0.4
#> [154] class_7.3-19 stringi_1.7.5 sass_0.4.0
#> [157] mixtools_1.2.0 blob_1.2.2 caTools_1.18.2
#> [160] memoise_2.0.0 e1071_1.7-9
Banovich, Nicholas E, Xun Lan, Graham McVicker, Bryce Van de Geijn, Jacob F Degner, John D Blischak, Julien Roux, Jonathan K Pritchard, and Yoav Gilad. 2014. “Methylation Qtls Are Associated with Coordinated Changes in Transcription Factor Binding, Histone Modifications, and Gene Expression Levels.” PLoS Genetics 10 (9).
Baranasic, Damir. 2020. JASPAR2020: Data Package for Jaspar Database (Version 2020). http://jaspar.genereg.net/.
Bonder, Marc Jan, René Luijk, Daria V Zhernakova, Matthijs Moed, Patrick Deelen, Martijn Vermaat, Maarten Van Iterson, et al. 2017. “Disease Variants Alter Transcription Factor Levels and Methylation of Their Binding Sites.” Nature Genetics 49 (1): 131.
Chèneby, Jeanne, Zacharie Ménétrier, Martin Mestdagh, Thomas Rosnet, Allyssa Douida, Wassim Rhalloussi, Aurélie Bergon, Fabrice Lopez, and Benoit Ballester. 2019. “ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments.” Nucleic Acids Research 48 (D1): D180–D188. https://doi.org/10.1093/nar/gkz945.
Fornes, Oriol, Jaime A Castro-Mondragon, Aziz Khan, Robin Van der Lee, Xi Zhang, Phillip A Richmond, Bhavi P Modi, et al. 2020. “JASPAR 2020: Update of the Open-Access Database of Transcription Factor Binding Profiles.” Nucleic Acids Research 48 (D1): D87–D92.
Reese, Sarah E, Cheng-Jian Xu, T Herman, Mi Kyeong Lee, Sinjini Sikdar, Carlos Ruiz-Arenas, Simon K Merid, et al. 2019. “Epigenome-Wide Meta-Analysis of Dna Methylation and Childhood Asthma.” Journal of Allergy and Clinical Immunology 143 (6): 2062–74.
Schep, Alicia. 2020. Motifmatchr: Fast Motif Matching in R.
Silva, Tiago C, Simon G Coetzee, Nicole Gull, Lijing Yao, Dennis J Hazelett, Houtan Noushmehr, De-Chen Lin, and Benjamin P Berman. 2019. “ELMER V. 2: An R/Bioconductor Package to Reconstruct Gene Regulatory Networks from Dna Methylation and Transcriptome Profiles.” Bioinformatics 35 (11): 1974–7.
Yin, Yimeng, Ekaterina Morgunova, Arttu Jolma, Eevi Kaasinen, Biswajyoti Sahu, Syed Khund-Sayeed, Pratyush K Das, et al. 2017. “Impact of Cytosine Methylation on Dna Binding Specificities of Human Transcription Factors.” Science 356 (6337).
Zhu, Heng, Guohua Wang, and Jiang Qian. 2016. “Transcription Factors as Readers and Effectors of Dna Methylation.” Nature Reviews Genetics 17 (9): 551–65.