The celarefData package is a repository of a few public datasets that have been processed using the celaref package. This includes some example data for the celaref package vignette (please refer to full examples there), and some other potentially useful preprocessed reference datasets.
The data in this package is a series of data frames output from the contrast_each_group_to_the_rest function of the celaref package. That is, these are differential experession results (calculated using MAST Finak et al. (2015)), of each cell cluster versus the rest of the experiment.
For details and explanation see the celaref package vignette. The commands used to make these data files are also in the make-data.R file of this package.
Farmer et al. (2017) have published a survey of cell types in the mouse lacrimal gland at two developmental stages in paper (Defining epithelial cell dynamics and lineage relationships in the developing lacrimal gland). Only the more mature P4 timepoint is included.
Data:
The (‘Watkins2009’) ‘HaemAtlas’ (Watkins et al. 2009) microarray dataset of purified PBMC cell types was downloaded as a normalised table from the ‘haemosphere’ website: http://haemosphere.org/datasets/show (Graaf et al. 2016)
Processing for those data files via contrast_each_group_to_the_rest_for_norm_ma_with_limma is described in the vignette.
Data:
10X genomics has several datasets available to download from their website, including the pbmc4k dataset, which contains PBMCs derived from a healthy individual. The kmeans k=7 cell-cluster assignments were chosen. Source dataset available here: (https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k)
Data:
In their paper ‘Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq’, Zeisel et al. (2015) performed single cell RNA sequencing in mouse, in two tissues (sscortex and ca1hippocampus).
This data was download from the link provided in the paper: http://linnarssonlab.org/cortex
As described in make-data.R, both counts and cell annotations were parsed from this file: https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt
Data:
As part of their analysis described in ‘Massively parallel digital transcriptional profiling of single cells’, Zheng et al generated a reference dataset of PBMC (peripheral blood mononuclear cell) sub-populations (Zheng et al. 2017). These cell types are:
They used a bead-based purification approach described in their paper, followed by the analysis which they have shared at https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis
To create the derived differential expression tables suitable for using as a reference dataset with celaref.
Data and scripts obtained from https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis
Cell cluster labels were obtained by the (rather nicely reproduceable) analysis scripts (specifically ‘main_process_pure_pbmc.R’ ) also provided by Zheng et al at: https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis
Cells were subsetted to a maximum of 1000 per group - enough for the differential expression for this puropose.
Gene-level ID was set to the GeneSymbol, choosing the more highly expressed ensemblID if multiple mappings exist. The de_table_Zheng2017purePBMC dataset has GeneSymbol IDs, wherease de_table_Zheng2017purePBMC_ensembl was converted back to ensemble IDs.
Exact commands are provided in celarefData make-data.R script.
library(ExperimentHub)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
eh = ExperimentHub()
ExperimentHub::listResources(eh, "celarefData")
## [1] "de_table_10X_pbmc4k_k7" "de_table_Watkins2009_pbmcs"
## [3] "de_table_Zeisel2015_cortex" "de_table_Zeisel2015_hc"
## [5] "de_table_Farmer2017_lacrimalP4" "de_table_Zheng2017purePBMC"
## [7] "de_table_Zheng2017purePBMC_ensembl"
de_table.10X_pbmc4k_k7 <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_10X_pbmc4k_k7')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
de_table.Watkins2009PBMCs <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Watkins2009_pbmcs')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
de_table.zeisel.cortex <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zeisel2015_cortex')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
de_table.zeisel.hippo <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zeisel2015_hc')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
de_table.Farmer2017lacrimalP4 <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Farmer2017_lacrimalP4')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
de_table.Zheng2017purePBMC <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zheng2017purePBMC')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
de_table.Zheng2017purePBMC_ensembl <- ExperimentHub::loadResources(eh, "celarefData", 'de_table_Zheng2017purePBMC_ensembl')[[1]]
## see ?celarefData and browseVignettes('celarefData') for documentation
## loading from cache
head(de_table.10X_pbmc4k_k7)
## ID ensembl_ID GeneSymbol total_count pval log2FC ci_inner
## 1 TRAC ENSG00000277734 TRAC 11252 0.000000e+00 1.609764 1.553279
## 2 LDHB ENSG00000111716 LDHB 10039 0.000000e+00 1.410714 1.366698
## 3 LTB ENSG00000227507 LTB 21543 2.595127e-262 1.383209 1.310434
## 4 RPS29 ENSG00000213741 RPS29 145084 0.000000e+00 1.216305 1.174395
## 5 IL32 ENSG00000008517 IL32 14403 3.342627e-214 1.243103 1.169736
## 6 CD3D ENSG00000167286 CD3D 7808 0.000000e+00 1.219050 1.167091
## ci_outer fdr group sig sig_up gene_count rank rescaled_rank
## 1 1.666249 0.000000e+00 1 TRUE TRUE 15407 1 6.490556e-05
## 2 1.454730 0.000000e+00 1 TRUE TRUE 15407 2 1.298111e-04
## 3 1.455983 5.631425e-260 1 TRUE TRUE 15407 3 1.947167e-04
## 4 1.258216 0.000000e+00 1 TRUE TRUE 15407 4 2.596222e-04
## 5 1.316471 6.058806e-212 1 TRUE TRUE 15407 5 3.245278e-04
## 6 1.271010 0.000000e+00 1 TRUE TRUE 15407 6 3.894334e-04
## dataset
## 1 10X_pbmc4k_k7
## 2 10X_pbmc4k_k7
## 3 10X_pbmc4k_k7
## 4 10X_pbmc4k_k7
## 5 10X_pbmc4k_k7
## 6 10X_pbmc4k_k7