scater
: Single-cell analysis toolkit for expression in Rscater 1.10.1
This document gives an introduction to and overview of the quality control functionality of the scater package. scater contains tools to help with the analysis of single-cell transcriptomic data, focusing on RNA-seq data. The package features:
SingleCellExperiment
class as a data container for interoperability with a wide range of other Bioconductor packages;SingleCellExperiment
objectWe assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.).
First, we create a SingleCellExperiment
object containing the data, as demonstrated below with some example data ("sc_example_counts"
) and metadata ("sc_example_cell_info"
):
Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell ’omics data.
library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce
## class: SingleCellExperiment
## dim: 2000 40
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):
We usually expect (raw) count data to be labelled as "counts"
in the assays, which can be easily retrieved with the counts
accessor.
Getters and setters are also provided for exprs
, tpm
, cpm
, fpkm
and versions of these with the prefix norm_
.
str(counts(example_sce))
Row and column-level metadata are easily accessed (or modified) as shown below.
There are also dedicated getters and setters for spike-in specifiers (isSpike
); size factor values (sizeFactors
); and reduced dimensionality results (reducedDim
).
example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)
## DataFrame with 40 rows and 5 columns
## Cell Mutation_Status Cell_Cycle Treatment whee
## <character> <character> <character> <character> <character>
## Cell_001 Cell_001 positive S treat1 L
## Cell_002 Cell_002 positive G0 treat1 I
## Cell_003 Cell_003 negative G1 treat1 U
## Cell_004 Cell_004 negative S treat1 O
## Cell_005 Cell_005 negative G1 treat2 O
## ... ... ... ... ... ...
## Cell_036 Cell_036 negative G0 treat1 V
## Cell_037 Cell_037 negative G0 treat1 S
## Cell_038 Cell_038 negative G0 treat2 Y
## Cell_039 Cell_039 negative G1 treat1 J
## Cell_040 Cell_040 negative G0 treat2 A
rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)
## DataFrame with 2000 rows and 1 column
## stuff
## <numeric>
## Gene_0001 0.676609133603051
## Gene_0002 0.893353080609813
## Gene_0003 0.857206639135256
## Gene_0004 0.729772792663425
## Gene_0005 0.62721553677693
## ... ...
## Gene_1996 0.00954115530475974
## Gene_1997 0.289470643270761
## Gene_1998 0.391338959336281
## Gene_1999 0.163256920641288
## Gene_2000 0.294641267275438
Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. For example, we can filter out features (genes) that are not expressed in any cells:
keep_feature <- rowSums(counts(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
More details about the SingleCellExperiment
class can be found in the documentation for SingleCellExperiment package.
We calculate counts-per-million using the aptly-named calculateCPM
function.
The output is most appropriately stored as an assay named "cpm"
in the assays of the SingleCellExperiment
object.
cpm(example_sce) <- calculateCPM(example_sce)
Another option is to use the normalize
function, which calculates log2-transformed normalized expression values.
This is done by dividing each count by its size factor (or scaled library size, if no size factors are defined), adding a pseudo-count and log-transforming.
The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in "logcounts"
.
example_sce <- normalize(example_sce)
assayNames(example_sce)
## [1] "counts" "cpm" "logcounts"
Note that exprs
is a synonym for logcounts
when accessing or setting data.
This is done for backwards compatibility with older verions of scater.
identical(exprs(example_sce), logcounts(example_sce))
## [1] TRUE
Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.
assay(example_sce, "is_expr") <- counts(example_sce)>0
The calcAverage
function will compute the average count for each gene after scaling each cell’s counts by its size factor.
If size factors are not available, it will compute a size factor from the library size.
head(calcAverage(example_sce))
## Gene_0001 Gene_0002 Gene_0003 Gene_0004 Gene_0005 Gene_0006
## 305.551749 325.719897 183.090462 162.143201 1.231123 187.167913
Count matrices stored as CSV files or equivalent can be easily read into R session using read.table
from utils or fread
from the data.table package.
It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment
object.
For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package.
This is performed using readSparseCounts
and reduces memory usage by not explicitly storing zeroes in memory.
Data from 10X Genomics experiments can be read in using the read10xCounts
function from the DropletUtils package.
This will automatically generate a SingleCellExperiment
with a sparse matrix, see the documentation for more details.
scater also provides wrapper functions readSalmonResults
or readKallistoResults
to import transcript abundances from the kallisto
and Salmon
pseudo-aligners.
This is done using methods from the tximport package.
SCESet
classAs of July 2017, scater
has switched from the SCESet
class previously defined within the package to the more widely applicable SingleCellExperiment
class.
From Bioconductor 3.6 (October 2017), the release version of scater
will use SingleCellExperiment
.
SingleCellExperiment
is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages.
Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.
It should be straight-forward to convert existing scripts based on SCESet
objects to SingleCellExperiment
objects, with key changes outlined immediately below.
toSingleCellExperiment
and updateSCESet
(for backwards compatibility) can be used to convert an old SCESet
object to a SingleCellExperiment
object;SingleCellExperiment
object with the function SingleCellExperiment
(actually less fiddly than creating a new SCESet
);scater
functions have been refactored to take SingleCellExperiment
objects, so once data is in a SingleCellExperiment
object, the user experience is almost identical to that with the SCESet
class.Users may need to be aware of the following when updating their own scripts:
colnames
function (instead of sampleNames
or cellNames
for an SCESet
object);rownames
function (instead of featureNames
);phenoData
in an SCESet
, corresponds to colData
in a SingleCellExperiment
object and is accessed/assigned with the colData
function (this replaces the pData
function);$
operator (e.g. sce$total_counts
);featureData
in an SCESet
, corresponds to rowData
in a SingleCellExperiment
object and is accessed/assigned with the rowData
function (this replaces the fData
function);plotScater
, which produces a cumulative expression, overview plot, replaces
the generic plot
function for SCESet
objects.sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scater_1.10.1 SingleCellExperiment_1.4.1
## [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [5] BiocParallel_1.16.5 matrixStats_0.54.0
## [7] Biobase_2.42.0 GenomicRanges_1.34.0
## [9] GenomeInfoDb_1.18.1 IRanges_2.16.0
## [11] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [13] ggplot2_3.1.0 knitr_1.21
## [15] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 destiny_2.12.0
## [3] xts_0.11-2 tools_3.5.2
## [5] R6_2.3.0 HDF5Array_1.10.1
## [7] vipor_0.4.5 lazyeval_0.2.1
## [9] colorspace_1.3-2 nnet_7.3-12
## [11] withr_2.1.2 sp_1.3-1
## [13] smoother_1.1 tidyselect_0.2.5
## [15] gridExtra_2.3 curl_3.2
## [17] compiler_3.5.2 labeling_0.3
## [19] bookdown_0.9 scales_1.0.0
## [21] lmtest_0.9-36 DEoptimR_1.0-8
## [23] robustbase_0.93-3 proxy_0.4-22
## [25] stringr_1.3.1 digest_0.6.18
## [27] foreign_0.8-71 rmarkdown_1.11
## [29] rio_0.5.16 XVector_0.22.0
## [31] pkgconfig_2.0.2 htmltools_0.3.6
## [33] TTR_0.23-4 ggthemes_4.0.1
## [35] rlang_0.3.0.1 readxl_1.2.0
## [37] DelayedMatrixStats_1.4.0 bindr_0.1.1
## [39] zoo_1.8-4 dplyr_0.7.8
## [41] zip_1.0.0 car_3.0-2
## [43] RCurl_1.95-4.11 magrittr_1.5
## [45] GenomeInfoDbData_1.2.0 Matrix_1.2-15
## [47] Rcpp_1.0.0 ggbeeswarm_0.6.0
## [49] munsell_0.5.0 Rhdf5lib_1.4.2
## [51] abind_1.4-5 viridis_0.5.1
## [53] scatterplot3d_0.3-41 stringi_1.2.4
## [55] yaml_2.2.0 carData_3.0-2
## [57] MASS_7.3-51.1 zlibbioc_1.28.0
## [59] rhdf5_2.26.2 Rtsne_0.15
## [61] plyr_1.8.4 grid_3.5.2
## [63] forcats_0.3.0 crayon_1.3.4
## [65] lattice_0.20-38 haven_2.0.0
## [67] cowplot_0.9.3 hms_0.4.2
## [69] pillar_1.3.1 igraph_1.2.2
## [71] boot_1.3-20 reshape2_1.4.3
## [73] glue_1.3.0 evaluate_0.12
## [75] laeken_0.4.6 data.table_1.11.8
## [77] BiocManager_1.30.4 vcd_1.4-4
## [79] VIM_4.7.0 cellranger_1.1.0
## [81] gtable_0.2.0 purrr_0.2.5
## [83] assertthat_0.2.0 xfun_0.4
## [85] openxlsx_4.1.0 RcppEigen_0.3.3.5.0
## [87] e1071_1.7-0 class_7.3-15
## [89] viridisLite_0.3.0 tibble_2.0.0
## [91] beeswarm_0.2.3 bindrcpp_0.2.2