Abstract
Differential expression for repeated measures (dream) uses a linear model model to increase power and decrease false positives for RNA-seq datasets with multiple measurements per individual. The analysis fits seamlessly into the widely used workflow of limma/voom (Law et al. 2014).This tutorial assumes that the reader is familiar with the limma/voom workflow for RNA-seq. Process raw count data using limma/voom.
library('variancePartition')
library('edgeR')
library('doParallel')
data(varPartDEdata)
isexpr = rowSums(cpm(countMatrix)>0.1) >= 3
# Standard usage of limma/voom
genes = DGEList( countMatrix )
genes = calcNormFactors( genes )
design = model.matrix( ~ Disease, metadata)
vobj_tmp = voom( genes, design, plot=FALSE)
# apply duplicateCorrelation
dupcor <- duplicateCorrelation(vobj_tmp,design,block=metadata$Individual)
# run voom considering the duplicateCorrelation results
# in order to compute more accurate precision weights
# Otherwise, use the results from the first voom run
vobj = voom( genes, design, plot=FALSE, block=metadata$Individual, correlation=dupcor$consensus)
Limma has a built-in approach for analyzing repeated measures data using duplicateCorrelation. The model can handle a single random effect, and forces the magnitude of the random effect to be the same across all genes.
design = model.matrix( ~ Disease, metadata)
# Estimate linear mixed model with a single variance component
# Fit the model for each gene,
dupcor <- duplicateCorrelation(vobj, design, block=metadata$Individual)
# But this step uses only the genome-wide average for the random effect
fitDupCor <- lmFit(vobj, design, block=metadata$Individual, correlation=dupcor$consensus)
# Fit Empirical Bayes for moderated t-statistics
fitDupCor <- eBayes( fitDupCor )
The dream model operates directly on the results of voom. The only change compared to the standard limma workflow is to replace lmFit with dream.
cl <- makeCluster(4)
registerDoParallel(cl)
# The variable to be tested should be a fixed effect
form <- ~ Disease + (1|Individual)
# Get the contrast matrix for the hypothesis test
L = getContrast( vobj, form, metadata, "Disease1")
# Fit the dream model on each gene
# Apply the contrast matrix L for the hypothesis test
# By default, uses the Satterthwaite approximation for the hypothesis test
fitmm = dream( vobj, form, metadata, L)
## Projected memory usage: > 557.3 Mb
##
## Finished...
## Total: 675 s
# Fit Empirical Bayes for moderated t-statistics
fitmm = eBayes( fitmm )
# get results
topTable( fitmm )
## logFC AveExpr t P.Value adj.P.Val B
## ENST00000283033.5 gene=TXNDC11 1.555819 3.567624 38.95633 6.095309e-23 1.180296e-18 37.34041
## ENST00000257181.9 gene=PRPF38A 1.382756 4.398270 27.97235 1.294078e-19 1.252927e-15 32.37189
## ENST00000525790.1 gene=TDRKH 1.496308 3.184931 21.74514 3.993164e-17 2.577454e-13 27.93310
## ENST00000264485.5 gene=SLC4A4 1.415366 4.476664 21.06020 8.205843e-17 3.179977e-13 27.33874
## ENST00000421974.2 gene=ATP6V0E2 1.372569 3.478030 21.05961 8.211054e-17 3.179977e-13 27.33822
## ENST00000373277.4 gene=SH2D3C 1.383826 3.509569 20.86911 1.007020e-16 3.249990e-13 27.16846
## ENST00000295633.3 gene=FSTL1 1.387415 4.625435 20.11319 2.301911e-16 6.367744e-13 26.47499
## ENST00000339861.4 gene=SEMA4D 1.396601 4.186466 19.27069 5.980850e-16 1.447665e-12 25.66285
## ENST00000231454.1 gene=IL5 1.493454 1.789759 18.88682 9.355308e-16 2.012846e-12 25.27835
## ENST00000577031.1 gene=PAM16 1.460190 1.137609 18.72486 1.132609e-15 2.193183e-12 25.11330
Note that if random effect is not specified, dream() automatically uses lmFit().
For small datasets, the Kenward-Roger method can be more powerful. But it is substantially more computationally intensive.
You can also perform a hypothesis test between two levels. Make sure to inspect your contrast matrix to confirm it is testing what you intend.
form <- ~ 0 + Disease + (1|Individual)
L = getContrast( vobj, form, metadata, c("Disease1", "Disease0"))
L
## Disease0 Disease1
## -1 1
Multiple contrasts can be evaluated at the same time, in order to save computation time:
form <- ~ 0 + Disease + (1|Individual)
# define and then cbind contrasts
L1 = getContrast( vobj, form, metadata, "Disease0")
L2 = getContrast( vobj, form, metadata, "Disease1")
L = cbind(L1, L2)
# fit both contrasts
fit = dream( vobj[1:10,], form, metadata, L)
## Projected memory usage: > 300.2 Kb
##
## Finished...
## Total: 1 s
# empirical Bayes step
fiteb = eBayes( fit )
# extract results from first contrast
topTable( fiteb, coef="L1" )
## logFC AveExpr t P.Value adj.P.Val B
## ENST00000555834.1 gene=RPS6KL1 4.823537 5.272063 43.777882 5.747146e-36 5.747146e-35 72.05229
## ENST00000418210.2 gene=TMEM64 4.199542 4.715367 42.831568 1.378129e-35 6.890646e-35 71.19118
## ENST00000589123.1 gene=NFIC 5.470103 5.855335 39.321378 4.187655e-34 1.395885e-33 67.81729
## ENST00000248564.5 gene=GNG11 3.987147 4.511181 30.468065 1.006355e-29 2.515887e-29 57.75471
## ENST00000337859.6 gene=ZC3H15 3.748305 4.169523 25.511167 1.000455e-26 2.000910e-26 50.81188
## ENST00000263773.5 gene=FNBP4 4.895809 5.290530 11.051769 2.139521e-21 3.565869e-21 38.92088
## ENST00000456159.1 gene=MET 1.953973 2.458926 17.662430 9.717124e-21 1.388161e-20 36.86499
## ENST00000317802.7 gene=TSPYL6 3.802265 4.321189 10.576766 2.126037e-20 2.657547e-20 36.64481
## ENST00000360314.3 gene=CASS4 4.305729 4.633301 9.965925 4.442768e-19 4.936409e-19 33.60077
## ENST00000570099.1 gene=YPEL3 1.580904 2.063331 7.695021 7.951235e-14 7.951235e-14 21.17064
Dream and variancePartition share the same underlying linear mixed model framework. A variancePartition analysis can indicate important variables that should be included as fixed or random effects in the dream analysis.
##
## Finished...
## Total: 481 s
In order to understand the empircal difference between dream and duplication correlation, we can plot the \(-\log_{10}\) p-values from both methods.
The duplicateCorrelation method estimates a single variance term genome-wide even though the donor contribution of a particular gene can vary substantially from the genome-wide trend. Using a single value genome-wide for the within-donor variance can reduce power and increase the false positive rate in a particular, reproducible way. Let \(\tau^2_g\) be the value of the donor component for gene \(g\) and \(\bar{\tau}^2\) be the genome-wide mean. For genes where \(\tau^2_g>\bar{\tau}^2\), using \(\bar{\tau}^2\) under-corrects for the donor component so that it increases the false positive rate compared to using \(\tau^2_g\). Conversely, for genes where \(\tau^2_g<\bar{\tau}^2\), using \(\bar{\tau}^2\) over-corrects for the donor component so that it decreases power. Increasing sample size does not overcome this issue. The dream method overcomes this issue by using a \(\tau^2_g\).
Here, the \(-\log_{10}\) p-values from both methods are plotted and colored by the donor contribution estiamted by variancePartition. The green value indicates \(\bar{\tau}^2\), while red and blue indicate higher and lower values, respectively. When only one variance component is used and the contrast matrix is simple, the effect of using dream versus duplicateCorrelation is determined by the comparison of \(\tau^2_g\) to \(\bar{\tau}^2\):
dream can increase the \(-\log_{10}\) p-value for genes with a lower donor component (i.e. \(\tau^2_g<\bar{\tau}^2\)) and decrease \(-\log_{10}\) p-value for genes with a higher donor component (i.e. \(\tau^2_g>\bar{\tau}^2\))
# Compare p-values and make plot
p1 = topTable(fitDupCor, coef="Disease1", number=Inf, sort.by="none")$P.Value
p2 = topTable(fitmm, number=Inf, sort.by="none")$P.Value
plotCompareP( p1, p2, vp$Individual, dupcor$consensus)
Note that using more variance components or a more complicated contrast matrix can make the relationship more complicated.
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] doParallel_1.0.14 iterators_1.0.10 edgeR_3.24.3
## [4] variancePartition_1.12.3 Biobase_2.42.0 BiocGenerics_0.28.0
## [7] scales_1.0.0 foreach_1.4.4 limma_3.38.3
## [10] ggplot2_3.1.0 pander_0.6.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 locfit_1.5-9.1 lattice_0.20-38 prettyunits_1.0.2 gtools_3.8.1
## [6] assertthat_0.2.1 digest_0.6.18 R6_2.4.0 plyr_1.8.4 evaluate_0.13
## [11] pillar_1.3.1 gplots_3.0.1.1 rlang_0.3.2 progress_1.2.0 lazyeval_0.2.2
## [16] minqa_1.2.4 gdata_2.18.0 nloptr_1.2.1 Matrix_1.2-16 rmarkdown_1.12
## [21] labeling_0.3 splines_3.5.3 lme4_1.1-21 statmod_1.4.30 stringr_1.4.0
## [26] munsell_0.5.0 numDeriv_2016.8-1 compiler_3.5.3 xfun_0.5 pkgconfig_2.0.2
## [31] lmerTest_3.1-0 htmltools_0.3.6 tidyselect_0.2.5 tibble_2.1.1 codetools_0.2-16
## [36] crayon_1.3.4 dplyr_0.8.0.1 withr_2.1.2 MASS_7.3-51.1 bitops_1.0-6
## [41] grid_3.5.3 nlme_3.1-137 gtable_0.2.0 magrittr_1.5 KernSmooth_2.23-15
## [46] stringi_1.4.3 reshape2_1.4.3 colorRamps_2.3 boot_1.3-20 tools_3.5.3
## [51] glue_1.3.1 purrr_0.3.2 hms_0.4.2 pbkrtest_0.4-7 yaml_2.2.0
## [56] colorspace_1.4-1 caTools_1.17.1.2 knitr_1.22
Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2):R29. https://doi.org/10.1186/gb-2014-15-2-r29.