1 Introduction

This document gives an introduction to and overview of the quality control functionality of the scater package. scater contains tools to help with the analysis of single-cell transcriptomic data, focusing on RNA-seq data. The package features:

Use of the SingleCellExperiment class as a data container for interoperability with a wide range of other Bioconductor packages;
Functions to import kallisto and Salmon results;
Simple calculation of many quality control metrics from the expression data;
Many tools for visualising scRNA-seq data, especially diagnostic plots for quality control;
Subsetting and many other methods for filtering out problematic cells and features;
Methods for identifying important experimental variables and normalising data ahead of downstream statistical analysis and modeling.

2 Creating a `SingleCellExperiment` object

We assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.). First, we create a SingleCellExperiment object containing the data, as demonstrated below with some example data ("sc_example_counts") and metadata ("sc_example_cell_info"): Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell ’omics data.

library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
    assays = list(counts = sc_example_counts), 
    colData = sc_example_cell_info
)
example_sce

## class: SingleCellExperiment 
## dim: 2000 40 
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):

We usually expect (raw) count data to be labelled as "counts" in the assays, which can be easily retrieved with the counts accessor. Getters and setters are also provided for exprs, tpm, cpm, fpkm and versions of these with the prefix norm_.

str(counts(example_sce))

Row and column-level metadata are easily accessed (or modified) as shown below. There are also dedicated getters and setters for spike-in specifiers (isSpike); size factor values (sizeFactors); and reduced dimensionality results (reducedDim).

example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)

## DataFrame with 40 rows and 5 columns
##                 Cell Mutation_Status  Cell_Cycle   Treatment        whee
##          <character>     <character> <character> <character> <character>
## Cell_001    Cell_001        positive           S      treat1           L
## Cell_002    Cell_002        positive          G0      treat1           I
## Cell_003    Cell_003        negative          G1      treat1           U
## Cell_004    Cell_004        negative           S      treat1           O
## Cell_005    Cell_005        negative          G1      treat2           O
## ...              ...             ...         ...         ...         ...
## Cell_036    Cell_036        negative          G0      treat1           V
## Cell_037    Cell_037        negative          G0      treat1           S
## Cell_038    Cell_038        negative          G0      treat2           Y
## Cell_039    Cell_039        negative          G1      treat1           J
## Cell_040    Cell_040        negative          G0      treat2           A

rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)

## DataFrame with 2000 rows and 1 column
##                         stuff
##                     <numeric>
## Gene_0001   0.676609133603051
## Gene_0002   0.893353080609813
## Gene_0003   0.857206639135256
## Gene_0004   0.729772792663425
## Gene_0005    0.62721553677693
## ...                       ...
## Gene_1996 0.00954115530475974
## Gene_1997   0.289470643270761
## Gene_1998   0.391338959336281
## Gene_1999   0.163256920641288
## Gene_2000   0.294641267275438

Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. For example, we can filter out features (genes) that are not expressed in any cells:

keep_feature <- rowSums(counts(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]

More details about the SingleCellExperiment class can be found in the documentation for SingleCellExperiment package.

3 Calculating a variety of expression values

We calculate counts-per-million using the aptly-named calculateCPM function. The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object.

cpm(example_sce) <- calculateCPM(example_sce)

Another option is to use the normalize function, which calculates log₂-transformed normalized expression values. This is done by dividing each count by its size factor (or scaled library size, if no size factors are defined), adding a pseudo-count and log-transforming. The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in "logcounts".

example_sce <- normalize(example_sce)
assayNames(example_sce)

## [1] "counts"    "cpm"       "logcounts"

Note that exprs is a synonym for logcounts when accessing or setting data. This is done for backwards compatibility with older verions of scater.

identical(exprs(example_sce), logcounts(example_sce))

## [1] TRUE

Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.

assay(example_sce, "is_expr") <- counts(example_sce)>0

The calcAverage function will compute the average count for each gene after scaling each cell’s counts by its size factor. If size factors are not available, it will compute a size factor from the library size.

head(calcAverage(example_sce))

##  Gene_0001  Gene_0002  Gene_0003  Gene_0004  Gene_0005  Gene_0006 
## 305.551749 325.719897 183.090462 162.143201   1.231123 187.167913

4 Other methods of data import

Count matrices stored as CSV files or equivalent can be easily read into R session using read.table from utils or fread from the data.table package. It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment object.

For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package. This is performed using readSparseCounts and reduces memory usage by not explicitly storing zeroes in memory.

Data from 10X Genomics experiments can be read in using the read10xCounts function from the DropletUtils package. This will automatically generate a SingleCellExperiment with a sparse matrix, see the documentation for more details.

scater also provides wrapper functions readSalmonResults or readKallistoResults to import transcript abundances from the kallisto and Salmon pseudo-aligners. This is done using methods from the tximport package.

5 Transitioning from the `SCESet` class

As of July 2017, scater has switched from the SCESet class previously defined within the package to the more widely applicable SingleCellExperiment class. From Bioconductor 3.6 (October 2017), the release version of scater will use SingleCellExperiment. SingleCellExperiment is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages. Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.

It should be straight-forward to convert existing scripts based on SCESet objects to SingleCellExperiment objects, with key changes outlined immediately below.

The functions toSingleCellExperiment and updateSCESet (for backwards compatibility) can be used to convert an old SCESet object to a SingleCellExperiment object;
Create a new SingleCellExperiment object with the function SingleCellExperiment (actually less fiddly than creating a new SCESet);
scater functions have been refactored to take SingleCellExperiment objects, so once data is in a SingleCellExperiment object, the user experience is almost identical to that with the SCESet class.

Users may need to be aware of the following when updating their own scripts:

Cell names can now be accessed/assigned with the colnames function (instead of sampleNames or cellNames for an SCESet object);
Feature (gene/transcript) names should now be accessed/assigned with the rownames function (instead of featureNames);
Cell metadata, stored as phenoData in an SCESet, corresponds to colData in a SingleCellExperiment object and is accessed/assigned with the colData function (this replaces the pData function);
Individual cell-level variables can still be accessed with the $ operator (e.g. sce$total_counts);
Feature metadata, stored as featureData in an SCESet, corresponds to rowData in a SingleCellExperiment object and is accessed/assigned with the rowData function (this replaces the fData function);
plotScater, which produces a cumulative expression, overview plot, replaces the generic plot function for SCESet objects.

Session information

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.10.1               SingleCellExperiment_1.4.1 
##  [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
##  [5] BiocParallel_1.16.5         matrixStats_0.54.0         
##  [7] Biobase_2.42.0              GenomicRanges_1.34.0       
##  [9] GenomeInfoDb_1.18.1         IRanges_2.16.0             
## [11] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [13] ggplot2_3.1.0               knitr_1.21                 
## [15] BiocStyle_2.10.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             destiny_2.12.0          
##  [3] xts_0.11-2               tools_3.5.2             
##  [5] R6_2.3.0                 HDF5Array_1.10.1        
##  [7] vipor_0.4.5              lazyeval_0.2.1          
##  [9] colorspace_1.3-2         nnet_7.3-12             
## [11] withr_2.1.2              sp_1.3-1                
## [13] smoother_1.1             tidyselect_0.2.5        
## [15] gridExtra_2.3            curl_3.2                
## [17] compiler_3.5.2           labeling_0.3            
## [19] bookdown_0.9             scales_1.0.0            
## [21] lmtest_0.9-36            DEoptimR_1.0-8          
## [23] robustbase_0.93-3        proxy_0.4-22            
## [25] stringr_1.3.1            digest_0.6.18           
## [27] foreign_0.8-71           rmarkdown_1.11          
## [29] rio_0.5.16               XVector_0.22.0          
## [31] pkgconfig_2.0.2          htmltools_0.3.6         
## [33] TTR_0.23-4               ggthemes_4.0.1          
## [35] rlang_0.3.0.1            readxl_1.2.0            
## [37] DelayedMatrixStats_1.4.0 bindr_0.1.1             
## [39] zoo_1.8-4                dplyr_0.7.8             
## [41] zip_1.0.0                car_3.0-2               
## [43] RCurl_1.95-4.11          magrittr_1.5            
## [45] GenomeInfoDbData_1.2.0   Matrix_1.2-15           
## [47] Rcpp_1.0.0               ggbeeswarm_0.6.0        
## [49] munsell_0.5.0            Rhdf5lib_1.4.2          
## [51] abind_1.4-5              viridis_0.5.1           
## [53] scatterplot3d_0.3-41     stringi_1.2.4           
## [55] yaml_2.2.0               carData_3.0-2           
## [57] MASS_7.3-51.1            zlibbioc_1.28.0         
## [59] rhdf5_2.26.2             Rtsne_0.15              
## [61] plyr_1.8.4               grid_3.5.2              
## [63] forcats_0.3.0            crayon_1.3.4            
## [65] lattice_0.20-38          haven_2.0.0             
## [67] cowplot_0.9.3            hms_0.4.2               
## [69] pillar_1.3.1             igraph_1.2.2            
## [71] boot_1.3-20              reshape2_1.4.3          
## [73] glue_1.3.0               evaluate_0.12           
## [75] laeken_0.4.6             data.table_1.11.8       
## [77] BiocManager_1.30.4       vcd_1.4-4               
## [79] VIM_4.7.0                cellranger_1.1.0        
## [81] gtable_0.2.0             purrr_0.2.5             
## [83] assertthat_0.2.0         xfun_0.4                
## [85] openxlsx_4.1.0           RcppEigen_0.3.3.5.0     
## [87] e1071_1.7-0              class_7.3-15            
## [89] viridisLite_0.3.0        tibble_2.0.0            
## [91] beeswarm_0.2.3           bindrcpp_0.2.2

Introduction to `scater`: Single-cell analysis toolkit for expression in R

4 January 2019

Package

Contents

1 Introduction

2 Creating a `SingleCellExperiment` object

3 Calculating a variety of expression values

4 Other methods of data import

5 Transitioning from the `SCESet` class

Session information

Introduction to scater: Single-cell analysis toolkit for expression in R

4 January 2019

Package

Contents

1 Introduction

2 Creating a SingleCellExperiment object

3 Calculating a variety of expression values

4 Other methods of data import

5 Transitioning from the SCESet class

Session information

Introduction to `scater`: Single-cell analysis toolkit for expression in R

2 Creating a `SingleCellExperiment` object

5 Transitioning from the `SCESet` class