OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) (Chen et al. 2019). To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex.
OVESEG
package provides functions to compute OVESEG-test statistics, derive component weights in the mixture null distribution model and estimate p-values from weightedly aggregated permutations. While OVESEG-test p-values can be used to detect MGs, component weights of hypotheses also help portrait all kinds of upregulation patterns among tissue/cell types.
The function OVESEGtest()
includes all necessary steps to obtain OVESEG-test p-values. You need to specify the expression matrix, tissue/cell type labels, number of permutations, and parallel option. For microarray data, the expressions need to be log2-transformed prior to the test. For RNAseq count data, the counts need to be transformed to logCPM (log2-counts per million) prior to the test and use limma::voom()
to incorporate the mean-variance relationship. Parallel option is set according to BiocParallel package.
#Microarray
rtest <- OVESEGtest(y, group, NumPerm=999,
BPPARAM=BiocParallel::SnowParam())
#RNAseq
yvoom <- limma::voom(count, model.matrix(~0+factor(group)))
rtest <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
Theoretically, OVESEG-test can be applied to any molecular expression data types as long as they can be modeled by normal distribution after a transformation, such as log2-transformation for microarray and proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA methylation beta values. If mean-variance relationship exists after transformation, limma::vooma()/voom()
need to be performed firstly and the resulted weight matrix is used as the input of OVESEG
.
Requirements for the input expression data:
The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix.
We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE28490) as an example to show OVESEG-test on microarray data.
library(OVESEG)
#Import data
data(RocheBT)
y <- RocheBT$y #5000*15 matrix
group <- RocheBT$group #15 labels
#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > log2(64))
y <- y[keep,]
#OVESEG-test
rtest1 <- OVESEGtest(y, group, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically
Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean expression is less than a threshold even in the highest expressed group. If mean-variance relationship is obvious, we can consider to apply limma::vooma()
firstly.
#OVESEG-test
yvooma <- limma::vooma(y, model.matrix(~0+factor(group)))
rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically
We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE60424) as an example to show OVESEG-test on RNAseq count data.
library(OVESEG)
#Import data
data(countBT)
count <- countBT$count #10000*60 matrix
group <- countBT$group #60 labels
#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > 2)
count <- count[keep,]
#OVESEG-test
lib.size <- max(colSums(count))
yvoom <- limma::voom(count, model.matrix(~0+factor(group)),
lib.size = lib.size)
rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999,
BPPARAM=BiocParallel::SnowParam())
#> Calculating posterior probabilities of null hypotheses
#> Permuting top 2 groups
#> Permuting top 3 groups
#> Calculating p-values
#> Calculating p-values for each group specifically
Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean count is less than a threshold even in the highest expressed group. lib.size
and other parameters in limma::voom()
can be set manually according to limma package.
OVESEGtest
returns three p-values: pv.overall
, pv.oneside
and pv.multiside
. pv.overall
is calculated by all permutations regardless of upregulated subtypes. pv.oneside
is subtype-specific p-values calculated specifically for the upregulated subtype of each probe. pv.multiside
is pv.oneside
multiplied by K (K comparison correction) and truncated at 1. More details can be found in the paper (Chen et al. 2019).
OVESEG-test statistics are defined as \[\mathbf{\max_{k=1,...,K}\left\{min_{l \neq k}\left\{ \frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}} \right\}\right\}}\] where \(\mathbf{\mu_k(j)}\) is the mean of logarithmic expressions of gene j in subtype k. While it takes a long time to execute permutations for p-value estimation, OVESEGtstat()
is useful if users only need test statistics for ranking genes:
#OVESEG-test statistics
rtstat1 <- OVESEGtstat(y, RocheBT$group)
rtstat2 <- OVESEGtstat(yvooma$E, RocheBT$group, weights=yvooma$weights)
rtstat3 <- OVESEGtstat(yvoom$E, countBT$group, weights=yvoom$weights)
Null hypothesis of OVESEG-test is modeled as a mixture of multiple components, where the weights of components are estimated from posterior probabilities over all probes. Those posterior probabilities are also returned by OVESEGtest()
. If users only want to observe probewise posterior probabilities, postProbNull()
is helpful.
ppnull1 <- postProbNull(y, RocheBT$group)
ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights)
ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights)
By accumulating and normalizing probewise posterior probability with the function nullDistri()
, we can also obtain the probability of one subtype being upregulated conditioned on null hypotheses. The subtype with higher probability tends to get more False Positive MGs.
nullDistri(ppnull1)
#> B CD4 CD8
#> 0.2544411 0.3699901 0.3755688
nullDistri(ppnull2)
#> B CD4 CD8
#> 0.2551931 0.3689916 0.3758153
nullDistri(ppnull3)
#> B CD4 CD8
#> 0.2555459 0.3664028 0.3780512
There are totally \(\mathbf{2^K-1}\) types of expression patterns in a real dataset: probes exclusively expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of subtypes should have been filtered during pre-processing. Accumulating and normalizing probewise posterior probability of null hypotheses and of alternative hypotheses using the function patternDistri()
can present us the ratios of all possible upregulation patterns among subtypes:
patterns <- patternDistri(ppnull1)
#or
patterns <- patternDistri(rtest1)
The ratios of patterns can be illustrated as following.
library(gridExtra)
library(grid)
library(reshape2)
library(ggplot2)
gridpatterns <- function (ppnull) {
K <- length(ppnull$label)
cellcomp <- patternDistri(ppnull)
cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),]
#Bar Chart to show pattern percentage
DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp)
p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels=scales::percent) +
theme_bw(base_size = 16) +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
plot.margin=unit(c(1,1,-0.2,1), "cm"),
panel.grid.minor = element_line(size = 1),
panel.grid.major = element_line(size = 1),
panel.border = element_blank(),
axis.ticks.x = element_blank()) +
ylab("Percentage of up in certain subtype(s)")
#patterns as X-axis
DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)])
DF2 <- melt(DF2, id.var="Rank")
p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette="Set2") +
theme(line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
title = element_blank(),
panel.background = element_rect(fill = "white"),
legend.position="bottom",
legend.key.size = unit(1.2,"line"),
plot.margin=unit(c(-0.2,1,1,1), "cm")) +
scale_y_reverse() +
guides(fill = guide_legend(nrow = 1, byrow = TRUE))
g1 <- ggplotGrob(p1)
g2 <- ggplotGrob(p2)
colnames(g1) <- paste0(seq_len(ncol(g1)))
colnames(g2) <- paste0(seq_len(ncol(g2)))
g <- gtable_combine(g1, g2, along=2)
panels <- g$layout$t[grepl("panel", g$layout$name)]
n1 <- length(g$heights[panels[1]])
n2 <- length(g$heights[panels[2]])
# assign new (relative) heights to the panels
# notice the *4 here to get different heights
g$heights[panels[1]] <- unit(n1*4,"null")
g$heights[panels[2]] <- unit(n2,"null")
return(g)
}
grid.newpage()
grid.draw(gridpatterns(ppnull1))
#or grid.draw(gridpatterns(rtest1))
Default variance estimator in OVESEG
is empirical Bayes moderated variance estimator used in limma (argument alpha="moderated"
). Another option is adding a constant \(\alpha\) to pooled variance estimator (argument alpha=
\(\alpha\)). Setting argument alpha=NULL
will treat all variances as the same constant.
Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control.
##multiple comparison correction
pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni")
pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni")
##fdr control
qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue",
plot=FALSE, verbose=FALSE)$qval
qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue",
plot=FALSE, verbose=FALSE)$qval
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] grid stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_3.3.5 reshape2_1.4.4 gridExtra_2.3 OVESEG_1.10.0
#> [5] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.1 xfun_0.27 bslib_0.3.1
#> [4] purrr_0.3.4 colorspace_2.0-2 vctrs_0.3.8
#> [7] generics_0.1.1 htmltools_0.5.2 snow_0.4-3
#> [10] yaml_2.2.1 utf8_1.2.2 rlang_0.4.12
#> [13] jquerylib_0.1.4 pillar_1.6.4 glue_1.4.2
#> [16] withr_2.4.2 DBI_1.1.1 BiocParallel_1.28.0
#> [19] RColorBrewer_1.1-2 lifecycle_1.0.1 plyr_1.8.6
#> [22] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0
#> [25] evaluate_0.14 labeling_0.4.2 knitr_1.36
#> [28] fastmap_1.1.0 parallel_4.1.1 fdrtool_1.2.16
#> [31] fansi_0.5.0 highr_0.9 Rcpp_1.0.7
#> [34] scales_1.1.1 BiocManager_1.30.16 limma_3.50.0
#> [37] magick_2.7.3 jsonlite_1.7.2 farver_2.1.0
#> [40] digest_0.6.28 stringi_1.7.5 bookdown_0.24
#> [43] dplyr_1.0.7 tools_4.1.1 magrittr_2.0.1
#> [46] sass_0.4.0 tibble_3.1.5 crayon_1.4.1
#> [49] pkgconfig_2.0.3 ellipsis_0.3.2 assertthat_0.2.1
#> [52] rmarkdown_2.11 R6_2.5.1 compiler_4.1.1
Chen, Lulu, David Herrington, Robert Clarke, Guoqiang Yu, and Yue Wang. 2019. “Data-Driven Robust Detection of Tissue/Cell-Specific Markers.” bioRxiv. https://doi.org/10.1101/517961.