Contents

1 Introduction

The SimBenchData package contains a total of 35 single-cell RNA-seq datasets covering a wide range of data characteristics, including major sequencing protocols, multiple tissue types, and both human and mouse sources. This package serves as a key resource for performance benchmark of single-cell simulation methods, and was used to comprehensively assess the performance of 12 single-cell simulation methods in retaining key data properties of single-cell sequencing data, including gene-wise and cell-wise properties, as well as biological signals such as differential expression and differential proportion of genes. This data package is a valuable resource for the single-cell community for future development and benchmarking of new single-cell simulation methods and other applications.

2 The SimBenchData dataset

The data stored in this package can be retrieved using ExperimentHub.

# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("ExperimentHub")

library(ExperimentHub)
eh <- ExperimentHub()
alldata <- query(eh, "SimBenchData")
alldata 
## ExperimentHub with 35 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: Broad Institute of MIT & Harvard, Cambridge, MA USA, Peking...
## # $species: Homo sapiens, Mus musculus
## # $rdataclass: SeuratObject
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH5384"]]' 
## 
##            title          
##   EH5384 | 293T cell line 
##   EH5385 | Jurkat and 293T
##   EH5386 | BC01 blood     
##   EH5387 | BC01 normal    
##   EH5388 | BC02 lymph     
##   ...      ...            
##   EH5414 | Soumillon      
##   EH5415 | stem cell      
##   EH5416 | Tabula Muris   
##   EH5417 | Tung ipsc      
##   EH5418 | Yang liver

Each dataset can be downloaded using its ID.

data_1 <- alldata[["EH5384"]]  

Information about each dataset such as its description and source URL can be found in the metadata files under the inst/extdata directory. It can also be explored using the function showMetaData. Additional details on each dataset can be explored using the function showAdditionalDetail(). The information on the first three datasets is shown as an example.

library(SimBenchData)

metadata <- showMetaData()
metadata[1:3, ]
##              Name                                      Description BiocVersion
## 1  293T cell line                                   293T cell line        3.13
## 2 Jurkat and 293T mixture of Jurkat (human T lymphocyte)  and 293T        3.13
## 3      BC01 blood            PBMC of breast cancer patient ID BC01        3.13
##   Genome SourceType
## 1   hg19     tar.gz
## 2   hg19     tar.gz
## 3   hg19        Zip
##                                                                           SourceUrl
## 1   https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/293t
## 2 https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/jurkat
## 3                      https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114725
##                             SourceVersion      Species TaxonomyId
## 1   293t_filtered_gene_bc_matrices.tar.gz Homo sapiens       9606
## 2 jurkat_filtered_gene_bc_matrices.tar.gz Homo sapiens       9606
## 3                       GSE114725_RAW.tar Homo sapiens       9606
##   Coordinate_1_based
## 1                 NA
## 2                 NA
## 3                 NA
##                                                                              DataProvider
## 1                                                                            10x genomics
## 2                                                                            10x genomics
## 3 Memorial Sloan Kettering Cancer Center,\tComputational and Systems Biology Program, SKI
##                        Maintainer   RDataClass DispatchClass
## 1 Yue Cao <yue.cao@sydney.edu.au> SeuratObject           Rds
## 2 Yue Cao <yue.cao@sydney.edu.au> SeuratObject           Rds
## 3 Yue Cao <yue.cao@sydney.edu.au> SeuratObject           Rds
##                        RDataPath ExperimentHub_ID
## 1 SimBenchData/293t_cellline.rds           EH5384
## 2   SimBenchData/293t_jurkat.rds           EH5385
## 3    SimBenchData/BC01_blood.rds           EH5386
additionaldetail <- showAdditionalDetail()
additionaldetail[1:3, ]
##   ExperimentHub_ID            Name Species     Protocol Number_of_cells
## 1           EH5384  293T cell line   Human 10x Genomics            2885
## 2           EH5385 Jurkat and 293T   Human 10x Genomics            6143
## 3           EH5386      BC01 blood   Human      inDrops            3034
##   Multiple_celltypes_or_conditions
## 1                               No
## 2                              Yes
## 3                               No

The data processing script for each dataset can be found under the inst/scripts directory.

Session info

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SimBenchData_1.14.0  ExperimentHub_2.14.0 AnnotationHub_3.14.0
## [4] BiocFileCache_2.14.0 dbplyr_2.5.0         BiocGenerics_0.52.0 
## [7] BiocStyle_2.34.0    
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.46.0         xfun_0.48               bslib_0.8.0            
##  [4] Biobase_2.66.0          vctrs_0.6.5             tools_4.4.1            
##  [7] generics_0.1.3          stats4_4.4.1            curl_5.2.3             
## [10] tibble_3.2.1            fansi_1.0.6             AnnotationDbi_1.68.0   
## [13] RSQLite_2.3.7           blob_1.2.4              pkgconfig_2.0.3        
## [16] S4Vectors_0.44.0        lifecycle_1.0.4         GenomeInfoDbData_1.2.13
## [19] compiler_4.4.1          Biostrings_2.74.0       GenomeInfoDb_1.42.0    
## [22] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.10            
## [25] pillar_1.9.0            crayon_1.5.3            jquerylib_0.1.4        
## [28] cachem_1.1.0            mime_0.12               tidyselect_1.2.1       
## [31] digest_0.6.37           dplyr_1.1.4             purrr_1.0.2            
## [34] bookdown_0.41           BiocVersion_3.20.0      fastmap_1.2.0          
## [37] cli_3.6.3               magrittr_2.0.3          utf8_1.2.4             
## [40] withr_3.0.2             filelock_1.0.3          UCSC.utils_1.2.0       
## [43] rappdirs_0.3.3          bit64_4.5.2             rmarkdown_2.28         
## [46] XVector_0.46.0          httr_1.4.7              bit_4.5.0              
## [49] png_0.1-8               memoise_2.0.1           evaluate_1.0.1         
## [52] knitr_1.48              IRanges_2.40.0          rlang_1.1.4            
## [55] glue_1.8.0              DBI_1.2.3               BiocManager_1.30.25    
## [58] jsonlite_1.8.9          R6_2.5.1                zlibbioc_1.52.0