countsimQC - Comparing characteristic features across count data sets

Charlotte Soneson

2024-10-29

Introduction

The countsimQC package provides a simple way to compare the characteristic features of a collection of (e.g., RNA-seq) count data sets. An important application is in situations where a synthetic count data set has been generated using a real count data set as an underlying source of parameters, in which case it is often important to verify that the final synthetic data captures the main features of the original data set. However, the package can be used to create a visual overview of any collection of one or more count data sets.

Data preparation

In this vignette we will show how to generate a comparative report from a collection of two simulated data sets and the original, underlying real data set. First, we load the object containing the three data sets. The object is a named list, where each element is a DESeqDataSet object, containing the count matrix, a sample information data frame and a model formula (necessary to calculate dispersions). For more information about the DESeqDataSet class, please see the DESeq2 Bioconductor package. For speed reasons, we use only a subset of the features in each data set for the following calculations.

suppressPackageStartupMessages({
  library(countsimQC)
  library(DESeq2)
})

data(countsimExample)
countsimExample
## $Original
## class: DESeqDataSet 
## dim: 10000 11 
## metadata(1): version
## assays(1): counts
## rownames(10000): ENSMUSG00000000001.4 ENSMUSG00000000028.14 ...
##   ENSMUSG00000048027.7 ENSMUSG00000048029.10
## rowData names(0):
## colnames(11): GSM1923445 GSM1923446 ... GSM1923578 GSM1923579
## colData names(2): group sample
## 
## $Sim1
## class: DESeqDataSet 
## dim: 10000 11 
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(4): Cell Batch Group ExpLibSize
## 
## $Sim2
## class: DESeqDataSet 
## dim: 10000 11 
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(3): Cell CellFac Group
countsimExample <- lapply(countsimExample, function(cse) {
  cse[seq_len(1500), ]
})

Report generation

Next, we generate the report using the countsimQCReport() function. Depending on the level of detail and the type of information that are required for the final report, this function can be run in different “modes”:

Here, for the sake of speed, we calculate statistics for a small subset of the observations (subsampleSize = 25) and refrain from calculating permutation p-values.

tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample, outputFile = "countsim_report.html",
                 outputDir = tempDir, outputFormat = "html_document", 
                 showCode = FALSE, forceOverwrite = TRUE,
                 savePlots = TRUE, description = "This is my test report.", 
                 maxNForCorr = 25, maxNForDisp = Inf, 
                 calculateStatistics = TRUE, subsampleSize = 25,
                 kfrac = 0.01, kmin = 5, 
                 permutationPvalues = FALSE, nPermutations = NULL)

The countsimQCReport() function can generate either an HTML file (by setting outputFormat = "html_document" or outputFormat = NULL) or a pdf file (by setting outputFormat = "pdf_document"). The description argument can be used to provide a more extensive description of the data set(s) that are included in the report.

Generation of individual figures

If the argument savePlots is set to TRUE, an .rds file containing the individual ggplot objects will be generated. These objects can be used to perform fine-tuning of the visualizations if desired. Note, however, that the .rds file can become large if the number of data sets is large, or if the individual data sets have many samples or features. The convenience function generateIndividualPlots() can be used to quickly generate individual figures for all plots included in the report, using a variety of devices. For example, to generate each plot in pdf format:

ggplots <- readRDS(file.path(tempDir, "countsim_report_ggplots.rds"))
if (!dir.exists(file.path(tempDir, "figures"))) {
  dir.create(file.path(tempDir, "figures"))
}
generateIndividualPlots(ggplots, device = "pdf", nDatasets = 3, 
                        outputDir = file.path(tempDir, "figures"))
## `geom_smooth()` using formula = 'y ~ x'

Input data format

In the example above, all data sets were provided as DESeqDataSet objects. The advantage of this is that it allows the specification of the experimental design, which is used in the dispersion calculations. countsimQC also allows a data set to be provided as either a data.frame or a matrix. However, in these situations, it will be assumed that all samples are replicates (i.e., a design ~1). An example is provided in the countsimExample_dfmat data set, provided with the package.

data(countsimExample_dfmat)
names(countsimExample_dfmat)
## [1] "Original" "Sim1"     "Sim2"
lapply(countsimExample_dfmat, class)
## $Original
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
## 
## $Sim1
## [1] "matrix" "array" 
## 
## $Sim2
## [1] "data.frame"
tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample_dfmat, 
                 outputFile = "countsim_report_dfmat.html",
                 outputDir = tempDir, outputFormat = "html_document", 
                 showCode = FALSE, forceOverwrite = TRUE,
                 savePlots = TRUE, description = "This is my test report.", 
                 maxNForCorr = 25, maxNForDisp = Inf, 
                 calculateStatistics = TRUE, subsampleSize = 25,
                 kfrac = 0.01, kmin = 5, 
                 permutationPvalues = FALSE, nPermutations = NULL)

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DESeq2_1.46.0               SummarizedExperiment_1.36.0
##  [3] Biobase_2.66.0              MatrixGenerics_1.18.0      
##  [5] matrixStats_1.4.1           GenomicRanges_1.58.0       
##  [7] GenomeInfoDb_1.42.0         IRanges_2.40.0             
##  [9] S4Vectors_0.44.0            BiocGenerics_0.52.0        
## [11] countsimQC_1.24.0          
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        farver_2.1.2            dplyr_1.1.4            
##  [4] blob_1.2.4              bitops_1.0-9            Biostrings_2.74.0      
##  [7] fastmap_1.2.0           XML_3.99-0.17           digest_0.6.37          
## [10] lifecycle_1.0.4         survival_3.7-0          statmod_1.5.0          
## [13] KEGGREST_1.46.0         RSQLite_2.3.7           magrittr_2.0.3         
## [16] genefilter_1.88.0       compiler_4.4.1          rlang_1.1.4            
## [19] sass_0.4.9              tools_4.4.1             utf8_1.2.4             
## [22] yaml_2.3.10             knitr_1.48              labeling_0.4.3         
## [25] S4Arrays_1.6.0          htmlwidgets_1.6.4       bit_4.5.0              
## [28] DelayedArray_0.32.0     abind_1.4-8             BiocParallel_1.40.0    
## [31] withr_3.0.2             purrr_1.0.2             grid_4.4.1             
## [34] fansi_1.0.6             caTools_1.18.3          xtable_1.8-4           
## [37] colorspace_2.1-1        edgeR_4.4.0             ggplot2_3.5.1          
## [40] scales_1.3.0            cli_3.6.3               rmarkdown_2.28         
## [43] crayon_1.5.3            ragg_1.3.3              generics_0.1.3         
## [46] httr_1.4.7              DBI_1.2.3               cachem_1.1.0           
## [49] zlibbioc_1.52.0         splines_4.4.1           parallel_4.4.1         
## [52] AnnotationDbi_1.68.0    XVector_0.46.0          vctrs_0.6.5            
## [55] Matrix_1.7-1            jsonlite_1.8.9          bit64_4.5.2            
## [58] crosstalk_1.2.1         systemfonts_1.1.0       locfit_1.5-9.10        
## [61] limma_3.62.0            tidyr_1.3.1             jquerylib_0.1.4        
## [64] annotate_1.84.0         glue_1.8.0              randtests_1.0.2        
## [67] codetools_0.2-20        DT_0.33                 gtable_0.3.6           
## [70] UCSC.utils_1.2.0        munsell_0.5.1           tibble_3.2.1           
## [73] pillar_1.9.0            htmltools_0.5.8.1       GenomeInfoDbData_1.2.13
## [76] R6_2.5.1                textshaping_0.4.0       evaluate_1.0.1         
## [79] lattice_0.22-6          highr_0.11              png_0.1-8              
## [82] memoise_2.0.1           bslib_0.8.0             Rcpp_1.0.13            
## [85] nlme_3.1-166            SparseArray_1.6.0       mgcv_1.9-1             
## [88] xfun_0.48               pkgconfig_2.0.3