multistateQTL 1.2.0
multistateQTL
is a Bioconductor package for applying basic statistical tests (e.g., feature-wise FDR correction, calculating pairwise sharing), summarizing, and visualizing QTL summary statistics from multiple states (e.g., tissues, celltypes, environmental conditions). It works on the QTLExperiment
(QTLE
) object class, where rows represent features (e.g., genes, transcripts, genomic regions), columns represent states, and assays are the various summary statistics. It also provides wrapper implementations of a number of multi-test correction methods (e.g., mashr, meta-soft, etc), which result in a set of multi-test corrected summary statistics.
QTLExperiment and multistateQTL can be installed from GitHub:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install(c("QTLExperiment", "multistateQTL"), version="devel")
They are also available on GitHub:
devtools::install_git("https://github.com/dunstone-a/QTLExperiment", build_vignettes=TRUE)
devtools::install_git(https://github.com/dunstone-a/multistateQTL", build_vignettes=TRUE)
library(QTLExperiment)
library(multistateQTL)
Provided with real QTL summary statistics as either a QTLE
object or a named list with betas and error (and optionally pval or lfsr), key parameters are estimated that are used to simulate realistic multi-state QTL data. We demonstrate the parameter estimation function on publicly available summary statistics from GTEx (v8). Note that this data only contains tests that were called as significant by GTEx and vroom only loads the first chunk of results as it does not read .gz compressed objects well. This truncated dataset is used in the vignette for convenience, however, to estimate the default parameters in qtleParams()
we downloaded QTL summary statistics for all associations tested for the 10 GTEx tissues with the largest sample sizes from Google Cloud). To speed up calculations we filtered this to include only associations on chromosome 1 and considered significant tests with pval < 0.05 and null tests with a pval > 0.1.
See QTLExperiment for more info on the sumstats2qtle
function and for other approaches for reading in QTL summary statistics.
The parameters estimated here include:
rnorm
to sample an effect size for each QTL for each state. The variance parameter for rnorm
is user defined (default = 0.1).input_path <- system.file("extdata", package="multistateQTL")
state <- c("lung", "thyroid", "spleen", "blood")
input <- data.frame(
state=state,
path=paste0(input_path, "/GTEx_tx_", state, ".tsv"))
gtex <- sumstats2qtle(
input,
feature_id="molecular_trait_id",
variant_id="rsid",
betas="beta",
errors="se",
pvalues="pvalue",
verbose=TRUE)
gtex
## class: QTLExperiment
## dim: 1163 4
## metadata(0):
## assays(3): betas errors pvalues
## rownames(1163): ENST00000428771|rs554008981 ENST00000477976|rs554008981
## ... ENST00000445118|rs368254419 ENST00000483767|rs368254419
## rowData names(2): variant_id feature_id
## colnames(4): lung thyroid spleen blood
## colData names(1): state_id
head(betas(gtex))
## lung thyroid spleen blood
## ENST00000428771|rs554008981 -0.1733690 NA 0.134913 NA
## ENST00000477976|rs554008981 0.1616170 0.3173110 NA NA
## ENST00000483767|rs554008981 -0.4161480 -0.0483018 NA -0.204647
## ENST00000623070|rs554008981 -0.1137930 NA NA NA
## ENST00000669922|rs554008981 -0.1921680 -0.1067540 0.724622 -0.117424
## ENST00000428771|rs201055865 -0.0630909 NA NA NA
Estimating parameters:
params <- qtleEstimate(gtex, threshSig=0.05, threshNull=0.5)
params
## $cv.sig.shape
## [1] 7.070567
##
## $cv.sig.rate
## [1] 7.717952
##
## $cv.null.shape
## [1] 0.007593863
##
## $cv.null.rate
## [1] 1269.631
##
## $betas.sig.shape
## [1] 3.470341
##
## $betas.sig.rate
## [1] 16.27275
##
## $betas.null.shape
## [1] 1.360216
##
## $betas.null.rate
## [1] 11.03421
Looking at the distributions defined by these estimated parameters, the simulated effect sizes for significant QTL will tend to be larger, while the simulated coefficient of variation values will be smaller than for the non-significant QTL.
plotSimulationParams(params=params)
The default parameters available through qtleParams()
were estimated from the GTEx v8 tissue-level eQTL summary statistics from chromosome 1 using the 10 tissues with the largest sample sizes. From these data, significant QTL parameters were estimated from tests in the lowest p-value quantile, while null parameters were estimated from tests in the highest p-value quantile. Data for tests on chromosome 1 were included in all four tissues (n=32613).
The simulation tool allows for the simulation of four types of associations: (1) Global, where the simulated effect size is approximately equal across all states; (2) Unique, where the association is only significant in one state; (3) Multi-state, where the association is significant in a subset of states (i.e., state-groups), and (4) Null, where the association has no significant effects in any state. First each test is randomly assigned as one of the above types according to the proportions specified by the user. For multi-state QTL, each state is assigned to a state-group, either randomly or according to user defined groups, then each multi-state QTL is assigned randomly to one of the state-groups. For unique QTL, the QTL is randomly assigned to a single state.
Simulated mean effect sizes for all non-null QTL are sampled from gamma(beta.sig.shape, beta.sig.rate) and are randomly assigned a positive or negative effect direction. Then for each state where that QTL is significant, an effect size is sampled from N(mean effect size, σ), where σ is user defined (default=0.1). Effect sizes for null QTL are sampled from gamma(beta.null.shape, beta.null.rate) and are randomly assigned a positive or negative effect direction. Standard errors for each QTL for each state are simulated by sampling from gamma(cv.sig.shape, cv.sig.rate) or gamma(cv.null.shape, cv.null.rate) for significant and null QTL, respectively, and multiplying the sampled cv by the absolute value of the simulated beta for that QTL in that state.
Here is an example of a simple simulation with half of the simulated QTL tests having globally significant effects. This example uses the default parameters.
sim <- qtleSimulate(nTests=1000, nStates=6, global=0.5)
sim
## class: QTLExperiment
## dim: 1000 6
## metadata(0):
## assays(3): betas errors lfsrs
## rownames(1000): F0502|v65513 F0829|v28175 ... F0647|v42219 F0431|v72934
## rowData names(11): feature_id variant_id ... S5 S6
## colnames(6): S1 S2 ... S5 S6
## colData names(1): state_id
head(rowData(sim))
## DataFrame with 6 rows and 11 columns
## feature_id variant_id QTL id mean_beta
## <character> <character> <character> <character> <numeric>
## F0502|v65513 F0502 v65513 null F0502|v65513 0.000000
## F0829|v28175 F0829 v28175 global F0829|v28175 -0.646669
## F0680|v76569 F0680 v76569 global F0680|v76569 0.314879
## F0007|v72522 F0007 v72522 null F0007|v72522 0.000000
## F0637|v6704 F0637 v6704 global F0637|v6704 0.658535
## F0924|v60732 F0924 v60732 global F0924|v60732 0.660113
## S1 S2 S3 S4 S5 S6
## <logical> <logical> <logical> <logical> <logical> <logical>
## F0502|v65513 FALSE FALSE FALSE FALSE FALSE FALSE
## F0829|v28175 TRUE TRUE TRUE TRUE TRUE TRUE
## F0680|v76569 TRUE TRUE TRUE TRUE TRUE TRUE
## F0007|v72522 FALSE FALSE FALSE FALSE FALSE FALSE
## F0637|v6704 TRUE TRUE TRUE TRUE TRUE TRUE
## F0924|v60732 TRUE TRUE TRUE TRUE TRUE TRUE
We can also generate more complex simulations, for example this simulation has 20% global, 40% multi-state, 20% unique, and 20% null QTL effects, where multi-state effects are assigned to one of two state-groups.
sim <- qtleSimulate(
nStates=10, nFeatures=100, nTests=1000,
global=0.2, multi=0.4, unique=0.2, k=2)
Here is a snapshot of the simulation key for QTL simulated as unique to a single state:
head(rowData(subset(sim, QTL == "unique")))
## DataFrame with 6 rows and 16 columns
## feature_id variant_id QTL id mean_beta S01
## <character> <character> <character> <character> <numeric> <logical>
## F056|v68405 F056 v68405 unique F056|v68405 1.132321 FALSE
## F089|v18603 F089 v18603 unique F089|v18603 0.641141 FALSE
## F001|v67681 F001 v67681 unique F001|v67681 -0.337594 FALSE
## F067|v31085 F067 v31085 unique F067|v31085 -0.300979 FALSE
## F072|v52705 F072 v52705 unique F072|v52705 -0.309610 FALSE
## F061|v86447 F061 v86447 unique F061|v86447 0.441208 FALSE
## S02 S03 S04 S05 S06 S07
## <logical> <logical> <logical> <logical> <logical> <logical>
## F056|v68405 FALSE FALSE FALSE FALSE FALSE FALSE
## F089|v18603 FALSE FALSE FALSE FALSE FALSE TRUE
## F001|v67681 TRUE FALSE FALSE FALSE FALSE FALSE
## F067|v31085 FALSE FALSE FALSE FALSE FALSE FALSE
## F072|v52705 FALSE FALSE FALSE FALSE FALSE FALSE
## F061|v86447 TRUE FALSE FALSE FALSE FALSE FALSE
## S08 S09 S10 multistateGroup
## <logical> <logical> <logical> <character>
## F056|v68405 FALSE TRUE FALSE NA
## F089|v18603 FALSE FALSE FALSE NA
## F001|v67681 FALSE FALSE FALSE NA
## F067|v31085 FALSE TRUE FALSE NA
## F072|v52705 TRUE FALSE FALSE NA
## F061|v86447 FALSE FALSE FALSE NA
Here is a snapshot of the simulation key for QTL simulated as multi-state:
head(rowData(subset(sim, QTL == "multistate")))
## DataFrame with 6 rows and 16 columns
## feature_id variant_id QTL id mean_beta S01
## <character> <character> <character> <character> <numeric> <logical>
## F031|v89404 F031 v89404 multistate F031|v89404 0.531274 TRUE
## F081|v39490 F081 v39490 multistate F081|v39490 -0.976719 FALSE
## F015|v91714 F015 v91714 multistate F015|v91714 -0.306235 TRUE
## F014|v61533 F014 v61533 multistate F014|v61533 0.456947 FALSE
## F022|v3677 F022 v3677 multistate F022|v3677 -0.480573 FALSE
## F018|v16936 F018 v16936 multistate F018|v16936 -0.487362 FALSE
## S02 S03 S04 S05 S06 S07
## <logical> <logical> <logical> <logical> <logical> <logical>
## F031|v89404 FALSE FALSE TRUE TRUE TRUE TRUE
## F081|v39490 TRUE TRUE FALSE FALSE FALSE FALSE
## F015|v91714 FALSE FALSE TRUE TRUE TRUE TRUE
## F014|v61533 TRUE TRUE FALSE FALSE FALSE FALSE
## F022|v3677 TRUE TRUE FALSE FALSE FALSE FALSE
## F018|v16936 TRUE TRUE FALSE FALSE FALSE FALSE
## S08 S09 S10 multistateGroup
## <logical> <logical> <logical> <character>
## F031|v89404 FALSE FALSE TRUE Group1
## F081|v39490 TRUE TRUE FALSE Group2
## F015|v91714 FALSE FALSE TRUE Group1
## F014|v61533 TRUE TRUE FALSE Group2
## F022|v3677 TRUE TRUE FALSE Group2
## F018|v16936 TRUE TRUE FALSE Group2
message("Number of QTL specific to each state-group:")
table(rowData(subset(sim, QTL == "multistate"))$multistateGroup)
##
## Group1 Group2
## 192 208
The multistateQTL toolkit provides two functions to help deal with missing data, getComplete
and replaceNAs
. The getComplete
function is a smart subsetting function that remove QTL associations (rows) with more than an allowed amount of missing data. The replaceNAs
function allows for NAs in each assay to be replaced with a constant or with the row mean or row median. For example, here is a snapshot of our simulated data from above with added NAs:
na_pattern <- sample(seq(1, ncol(sim)*nrow(sim)), 1000)
sim_na <- sim
assay(sim_na, "betas")[na_pattern] <- NA
assay(sim_na, "errors")[na_pattern] <- NA
assay(sim_na, "lfsrs")[na_pattern] <- NA
message("Number of simulated tests: ", nrow(sim_na))
head(betas(sim_na))
## S01 S02 S03 S04 S05 S06
## F056|v68405 -0.2069116 -0.1777617 NA NA 0.1185643 -0.2044272
## F050|v48935 0.2019550 -0.3288880 0.1095000 0.6518384 0.4406165 0.4421207
## F031|v89404 0.4454272 -0.2475028 -0.1697007 0.5574706 NA 0.8846720
## F081|v39490 0.1993529 -0.9769058 -0.9641381 0.3075263 -0.3230863 -0.2608925
## F012|v22815 -0.6768760 -0.2111132 -0.1790861 -0.2369226 0.2726944 -0.1961095
## F056|v87470 -0.7515367 -0.9673488 -0.7851777 -0.8699182 -0.9189384 -0.8862677
## S07 S08 S09 S10
## F056|v68405 0.3645608 0.07241395 1.1229543 0.1795885
## F050|v48935 0.3286546 0.34967895 -0.2875481 0.1133443
## F031|v89404 0.4759706 0.28585838 0.2503787 0.3781954
## F081|v39490 -0.6649082 -0.94329947 -0.7270914 NA
## F012|v22815 -0.1972697 0.28779758 0.2598863 NA
## F056|v87470 -1.0427818 -0.80325057 -0.8584320 -0.8730868
First we can use getComplete
to keep only the tests that have data for at least half of the states:
sim_na <- getComplete(sim_na, n=0.5, verbose=TRUE)
Then for the remaining QTL, we can fill in the missing values using the following scheme
sim_na <- replaceNAs(sim_na, verbose=TRUE)
head(betas(sim_na))
## S01 S02 S03 S04 S05 S06
## F056|v68405 -0.2069116 -0.1777617 0.0000000 0.0000000 0.1185643 -0.2044272
## F050|v48935 0.2019550 -0.3288880 0.1095000 0.6518384 0.4406165 0.4421207
## F031|v89404 0.4454272 -0.2475028 -0.1697007 0.5574706 0.0000000 0.8846720
## F081|v39490 0.1993529 -0.9769058 -0.9641381 0.3075263 -0.3230863 -0.2608925
## F012|v22815 -0.6768760 -0.2111132 -0.1790861 -0.2369226 0.2726944 -0.1961095
## F056|v87470 -0.7515367 -0.9673488 -0.7851777 -0.8699182 -0.9189384 -0.8862677
## S07 S08 S09 S10
## F056|v68405 0.3645608 0.07241395 1.1229543 0.1795885
## F050|v48935 0.3286546 0.34967895 -0.2875481 0.1133443
## F031|v89404 0.4759706 0.28585838 0.2503787 0.3781954
## F081|v39490 -0.6649082 -0.94329947 -0.7270914 0.0000000
## F012|v22815 -0.1972697 0.28779758 0.2598863 0.0000000
## F056|v87470 -1.0427818 -0.80325057 -0.8584320 -0.8730868
The multistateQTL toolkit also provides the callSignificance
function, which calls QTL tests significant in each state using either a single or two-step threshold approach. For example, we can set a single lfsr threshold of 0.1 to call significance of our simulate QTL:
sim <- callSignificance(sim, assay="lfsrs", thresh=0.05)
message("Median number of significant tests per state: ",
median(colData(sim)$nSignificant))
Because we have the simulated ground-truth, we can compare these significance calls to what was simulated using the simPerformance
function, which provides the following global (i.e. across all state) performance metrics:
sim <- callSignificance(sim, assay="lfsrs", thresh=0.001)
perf_metrics <- simPerformance(sim)
lapply(perf_metrics, FUN=function(x) {round(x, 2)})
## $accuracy
## [1] 0.42
##
## $precision
## [1] 0.42
##
## $recall
## [1] 1
##
## $f1
## [1] 0.59
##
## $cm
## called
## simulated FALSE TRUE
## FALSE 30 5786
## TRUE 10 4174
As you can see the recall of TRUE significant QTL is quite low. However if we change our significance calling approach to be more flexible.
sim <- callSignificance(
sim, mode="simple", assay="lfsrs",
thresh=0.0001, secondThresh=0.0002)
simPerformance(sim)$recall
## [1] 0.9935468
The multistateQTL
package contains five functions for visualising multi-state eQTL data.
These functions are based on the ggplot2 and ComplexHeatmap R packages.
plotPairwiseSharing
: based on ComplexHeatmap::Heatmap
plotQTLClusters
: based on ComplexHeatmap::Heatmap
plotUpSet
: based on ComplexHeatmap::UpSet
plotCompareStates
: based on ggplot2
plotSimulationParams
: based on ggplot2
The functions built on ggplot2
are compatible with ggplot2
syntax such as the +
operator.
The function plotPairwiseSharing
shows the degree of pairwise sharing of significant hits for each combination of two states.
Column annotations can be added by specifying a valid column name from the colData
of the object.
In the plot below, columns are ordered by the broad cell type (multistateGroup
) of the states.
We expect states belonging to the same multi-state group to be more related and have a greater degree of sharing of significant eQTLs.
Column annotations are used to show the number of significant eQTLs for each state.
sim_sig <- getSignificant(sim)
sim_top <- getTopHits(sim_sig, assay="lfsrs", mode="state")
sim_top <- runPairwiseSharing(sim_top)
p1 <- plotPairwiseSharing(sim_top, annotate_cols=c("nSignificant", "multistateGroup"))
These plots show the set of tests that are significant, but not necessarily shared, by groups of states.
plotUpSet(sim_top, annotateColsBy=c("nSignificant", "multistateGroup"))
Once multi-state test correction is performed, you will want to identify global, multi-state, and unique QTL.
Note that plotCompareStates
returns a list with a ggplot2
object as the first element and a table
as the second element.
These can be accessed using the names “plot” or “counts”.
sim_top <- runTestMetrics(sim_top)
plotCompareStates(sim_top, x="S01", y="S02")
## $plot
##
## $counts
##
## S01 S02 both_diverging both_shared
## 9 5 166 294
table(rowData(sim_top)$qtl_type)
##
## global_diverging global_shared multistate_diverging
## 291 116 51
## multistate_shared
## 16
hist(rowData(sim_top)$nSignificant)
The function plotQTLClusters
can be used to produce a heatmap of the eQTL betas values for each state.
Each row is a SNP-gene pair, and columns are states.
Row and column annotations can be added by naming column names from the rowData
or colData
of the input QTLExperiment
object.
sim_top_ms <- subset(sim_top, qtl_type_simple == "multistate")
plotQTLClusters(
sim_top_ms,
annotateColsBy=c("multistateGroup"),
annotateRowsBy=c("qtl_type", "mean_beta", "QTL"))
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] multistateQTL_1.2.0 collapse_2.0.16
## [3] data.table_1.16.2 ComplexHeatmap_2.22.0
## [5] QTLExperiment_1.4.0 SummarizedExperiment_1.36.0
## [7] Biobase_2.66.0 GenomicRanges_1.58.0
## [9] GenomeInfoDb_1.42.0 IRanges_2.40.0
## [11] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [13] MatrixGenerics_1.18.0 matrixStats_1.4.1
## [15] BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3
## [4] clue_0.3-65 GetoptLong_1.0.5 compiler_4.4.1
## [7] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
## [10] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [13] magick_2.8.5 backports_1.5.0 XVector_0.46.0
## [16] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.28
## [19] tzdb_0.4.0 UCSC.utils_1.2.0 tinytex_0.53
## [22] purrr_1.0.2 bit_4.5.0 xfun_0.48
## [25] zlibbioc_1.52.0 cachem_1.1.0 jsonlite_1.8.9
## [28] highr_0.11 DelayedArray_0.32.0 irlba_2.3.5.1
## [31] parallel_4.4.1 cluster_2.1.6 R6_2.5.1
## [34] bslib_0.8.0 RColorBrewer_1.1-3 SQUAREM_2021.1
## [37] jquerylib_0.1.4 Rcpp_1.0.13 bookdown_0.41
## [40] assertthat_0.2.1 iterators_1.0.14 knitr_1.48
## [43] Matrix_1.7-1 splines_4.4.1 tidyselect_1.2.1
## [46] abind_1.4-8 yaml_2.3.10 viridis_0.6.5
## [49] doParallel_1.0.17 codetools_0.2-20 lattice_0.22-6
## [52] tibble_3.2.1 plyr_1.8.9 withr_3.0.2
## [55] evaluate_1.0.1 archive_1.1.9 survival_3.7-0
## [58] fitdistrplus_1.2-1 circlize_0.4.16 pillar_1.9.0
## [61] BiocManager_1.30.25 checkmate_2.3.2 foreach_1.5.2
## [64] generics_0.1.3 vroom_1.6.5 invgamma_1.1
## [67] truncnorm_1.0-9 ggplot2_3.5.1 munsell_0.5.1
## [70] scales_1.3.0 ashr_2.2-63 glue_1.8.0
## [73] tools_4.4.1 mvtnorm_1.3-1 Cairo_1.6-2
## [76] tidyr_1.3.1 colorspace_2.1-1 mashr_0.2.79
## [79] GenomeInfoDbData_1.2.13 cli_3.6.3 fansi_1.0.6
## [82] viridisLite_0.4.2 mixsqp_0.3-54 S4Arrays_1.6.0
## [85] dplyr_1.1.4 gtable_0.3.6 sass_0.4.9
## [88] digest_0.6.37 SparseArray_1.6.0 farver_2.1.2
## [91] rjson_0.2.23 htmltools_0.5.8.1 lifecycle_1.0.4
## [94] rmeta_3.0 httr_1.4.7 GlobalOptions_0.1.2
## [97] bit64_4.5.2 MASS_7.3-61