scater
: Single-cell analysis toolkit for expression in RThis document gives an introduction to and overview of the functionality of the scater
package.
The scater
package is contains tools to help with the analysis of single-cell transcriptomic data, with the focus on RNA-seq data. The package features:
kallisto
for rapid quantification of transcript abundance and tight integration with scater
;Future versions of scater
may also incorporate:
To get up and running as quickly as possible, see the Quick Start section below. For see the various in-depth sections on various aspects of the functionality that follow.
Assuming you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.), then we first need to form an SCESet
object containing the data. An SCESet
object is the basic data container that we use in scater
.
Here we use the example data provided with the package, which gives us two objects, a matrix of counts and a dataframe with information about the cells we are studying:
library(scater, quietly = TRUE)
data("sc_example_counts")
data("sc_example_cell_info")
We use these objects to form an SCESet
object containing all of the necessary information for our analysis:
pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd)
Subsetting is very convenient with this class. For example, we can filter out features (genes) that are not expressed in any cells:
keep_feature <- rowSums(is_exprs(example_sceset)) > 0
example_sceset <- example_sceset[keep_feature,]
Now we have the expression data neatly stored in a structure that can be used for lots of exciting analyses.
It is straight-forward to compute many quality control metrics:
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:40)
Now you can play around with your data using the graphical user interface (GUI), which opens an interactive dashboard in your browser!
scater_gui(example_sceset)
Many plotting functions are available for visualising the data:
plot
: a plot method exists for SCESet
objects, which gives an overview of expression across cells.plotQC
: various methods are available for producing QC diagnostic plots.plotPCA
: produce a principal components plot for the cells.plotTSNE
: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.plotDiffusionMap
: produce a diffusion map (reduced dimension) plot for the cells.plotMDS
: produce a multi-dimensional scaling plot for the cells.plotReducedDim
: plot a reduced-dimension representation of the cells.plotExpression
: plot expression levels for a defined set of features.plotPhenoData
: plot cell metadata and QC metrics.plotFeatureData
: plot feature metadata and QC metrics.More detail on all plotting methods is given throughout the vignette below.
Visualisations can highlight features and cells to be filtered out, which can be done easily with the subsetting capabilities of scater.
The QC plotting functions also enable the identification of important experimental variables, which can be conditioned out in the data normalisation step.
After QC and data normalisation (methods are available in scater
), the dataset is ready for downstream statistical modeling.
The capabilities of scater
are built on top of Bioconductor’s Biobase
, which ties us to some specific terminology. Some of this terminology may be unfamiliar or seem peculiar for the single-cell RNA-seq setting scater
is designed for.
In Bioconductor terminology we assay numerous “features” for a number of “samples”.
Features, in the context of scater
, correspond most commonly to genes or transcripts, but could be any general genomic or transcriptomic regions (e.g. exon) of interest for which we take measurements. Thus, a command such as featureNames
, returns the names of the features defined for an SCESet
object, which in typical scater
usage could correspond to gene IDs. In much of what follows, it may be more intuitive to mentally replace “feature” with “gene” or “transcript” (depending on the context of the study) wherever “feature” appears.
In the scater
context, “samples” refer to individual cells that we have assayed. This differs from common usage of “sample” in other contexts, where we might usually use “sample” to refer to an individual subject, a biological replicate or similar. A “sample” in this sense in scater
may be referred to as a “block” in the more classical statistical sense. Within a “block” (e.g. individual) we may have assayed numerous cells. Thus, the function sampleNames
, when applied to an SCESet
object returns the cell IDs.
In scater
we organise single-cell expression data in objects of the SCESet
class. The class inherits the Bioconductor ExpressionSet
class, which provides a common interface familiar to those who have analyzed microarray experiments with Bioconductor. The class requires three input files:
exprs
, a numeric matrix of expression values, where rows are features, and columns are cellsphenoData
, an AnnotatedDataFrame
object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)featureData
, an AnnotatedDataFrame
object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.For more details about other features inherited from Bioconductor’s ExpressionSet
class, type ?ExpressionSet
at the R prompt.
The requirements for the SCESet
class (as with other S4 classes in R and Bioconductor) are strict. The idea is that strictness with generating a valid class object ensures that downstream methods applied to the class will work reliably. Thus, the expression value matrix must have the same number of columns as the phenoData
object has rows, and it must have the same number of rows as the featureData
dataframe has rows. Row names of the phenoData
object need to match the column names of the expression matrix. Row names of the featureData
object need to match row names of the expression matrix.
You can create a new SCESet
object using count data as follows. In this case, the exprs
slot in the object will be generated as log2(counts-per-million) using the cpm
function from edgeR
, with a “prior count” value of 1:
pd <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pd) <- pd$Cell
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
example_sceset <- newSCESet(countData = sc_example_counts, phenoData = pd,
featureData = fd)
example_sceset
## SCESet (storageMode: environment)
## assayData: 2000 features, 40 samples
## element names: counts, cpm, exprs, is_exprs
## protocolData: none
## phenoData
## sampleNames: Cell_001 Cell_002 ... Cell_040 (40 total)
## varLabels: Cell Mutation_Status Cell_Cycle Treatment
## varMetadata: labelDescription
## featureData
## featureNames: Gene_0001 Gene_0002 ... Gene_2000 (2000 total)
## fvarLabels: Gene
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
We can also make an SCESet
object with just a matrix of expression values (no count data required) like this:
example2 <- newSCESet(exprsData = edgeR::cpm.default(sc_example_counts))
pData(example2)
## data frame with 0 columns and 40 rows
fData(example2)
## data frame with 0 columns and 2000 rows
We have accessor functions to access elements of the SCESet
object:
counts(object)
: returns the matrix of read counts. As you can see above, if no counts are defined for the object, then the counts matrix slot is simpy NULL
.counts(example2)[1:3, 1:6]
## NULL
exprs(object)
: returns the matrix of feature expression values. Typically these should be log2(counts-per-million) values or log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course. The package will generally assume that these are the values to use for expression.exprs(example2)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0.00 749.5293 6.561271 0.000 0.000 0.0000
## Gene_0002 1344.85 396.0927 9.841906 5558.424 2826.476 923.5422
## Gene_0003 0.00 0.0000 0.000000 0.000 1483.563 0.0000
is_exprs(object)
: returns a logical matrix indicating whether each feature expression observation is above the defined lowerDetectionLimit
(default is 0). This can be determined on the count scale or the “expression” (i.e. exprs(object)
) scale.is_exprs(example2)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 FALSE TRUE TRUE FALSE FALSE FALSE
## Gene_0002 TRUE TRUE TRUE TRUE TRUE TRUE
## Gene_0003 FALSE FALSE FALSE FALSE TRUE FALSE
It is straight-forward to change the threshold by which we decide which observations are expressed or not:
is_exprs(example2) <- calcIsExprs(example2, lowerDetectionLimit = 100,
exprs_data = "exprs")
is_exprs(example2)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 FALSE TRUE FALSE FALSE FALSE FALSE
## Gene_0002 TRUE TRUE FALSE TRUE TRUE TRUE
## Gene_0003 FALSE FALSE FALSE FALSE TRUE FALSE
If you later find some count data that you want to add to the SCESet
object (always worth checking down the back of the couch), then you can add it easily:
counts(example2) <- sc_example_counts
example2
## SCESet (storageMode: environment)
## assayData: 2000 features, 40 samples
## element names: counts, exprs, is_exprs
## protocolData: none
## phenoData: none
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
counts(example2)[1:3, 1:6]
## Cell_001 Cell_002 Cell_003 Cell_004 Cell_005 Cell_006
## Gene_0001 0 123 2 0 0 0
## Gene_0002 575 65 3 1561 2311 160
## Gene_0003 0 0 0 0 1213 0
Handily, it is also easy to replace other data in slots of the SCESet
object using generic accessor and replacement functions.
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fd <- new("AnnotatedDataFrame", data = gene_df)
## replace featureData
fData(example_sceset) <- fd
## replace phenotype data
pData(example_sceset) <- pd
## replace expression data to be used
exprs(example_sceset) <- edgeR::cpm.default(counts(example_sceset),
prior.count = 5, log = TRUE)
It is possible to get an overall view of the dataset by using the plot
method available for SCESet
objects. This method plots the cumulative proportion of each cell’s library that is accounted for by the top highest-expressed features (by default showing the cumulative proportion across the top 500 features).
plot(example_sceset)
## Warning in plotSCESet(x, ...): The object does not contain tpm expression
## values. Using exprs(object) values instead.
This gives an overall idea of differences in expression distributions for different cells. It is used in the same way as per-sample boxplots are for microarray or bulk RNA-seq data. Due to the large numbers of zeroes in expression values for single-cell RNA-seq data, boxplots are not as useful, so instead we focus on the contributions from the most expressed features for each cell.
With this function, we can split up the cells based on phenoData variables to get a finer-grained look at differences between cells. By default, the plot method will try to use transcripts-per-million values for the plot, but if these are not present in the SCESet
object, then the values in exprs(object)
will be used. Other expression values can be used, specified with the exprs_values
argument
plot(example_sceset, block1 = "Mutation_Status", block2 = "Treatment",
colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")
This sort of approach can help to pick up large differences in expression distributions across different experimental blocks (e.g. processing batches or similar.)
In scater
, the plotExpression
function makes it easy to plot expression values for a subset of genes or features. This can be particularly useful when investigating the some features identified as being of interest from differential expression testing or other means.
plotExpression(example_sceset, rownames(example_sceset)[1:6],
x = "Mutation_Status", exprs_values = "exprs")
This function uses ggplot2
, making it easy to change the theme to whatever you prefer. We can also show the median expression level per group on the plot and show a violin plot to summarise the distribution of expression values:
plotExpression(example_sceset, rownames(example_sceset)[7:12],
x = "Mutation_Status", exprs_values = "counts", colour = "Cell_Cycle",
show_median = FALSE, show_violin = TRUE, xlab = "Mutation Status",
log = TRUE)
The scater
package puts a focus on aiding with quality control (QC) and pre-processing of single-cell RNA-seq data before further downstream analysis.
We see QC as consisting of three distinct steps:
Following QC, we can proceed with data normalisation before downstream analysis and modelling.
In the next few sections we discuss the QC and filtering capabilities available in scater
.
To compute commonly-used QC metrics we have the function calculateQCMetrics()
:
example_sceset <- calculateQCMetrics(example_sceset, feature_controls = 1:20)
varLabels(example_sceset)
## [1] "Cell"
## [2] "Mutation_Status"
## [3] "Cell_Cycle"
## [4] "Treatment"
## [5] "total_counts"
## [6] "log10_total_counts"
## [7] "filter_on_total_counts"
## [8] "total_features"
## [9] "log10_total_features"
## [10] "filter_on_total_features"
## [11] "pct_dropout"
## [12] "exprs_feature_controls"
## [13] "pct_exprs_feature_controls"
## [14] "filter_on_pct_exprs_feature_controls"
## [15] "pct_exprs_top_50_features"
## [16] "pct_exprs_top_100_features"
## [17] "pct_exprs_top_200_features"
## [18] "counts_feature_controls"
## [19] "pct_counts_feature_controls"
## [20] "filter_on_pct_counts_feature_controls"
## [21] "pct_counts_top_50_features"
## [22] "pct_counts_top_100_features"
## [23] "pct_counts_top_200_features"
## [24] "n_detected_feature_controls"
## [25] "exprs_endogenous_features"
## [26] "counts_endogenous_features"
## [27] "log10_exprs_feature_controls"
## [28] "log10_counts_feature_controls"
## [29] "log10_exprs_endogenous_features"
## [30] "log10_counts_endogenous_features"
## [31] "is_cell_control"
More than one set of feature controls can be defined if desired.
example_sceset <- calculateQCMetrics(
example_sceset, feature_controls = list(controls1 = 1:20, controls2 = 500:1000),
cell_controls = list(set_1 = 1:5, set_2 = 31:40))
varLabels(example_sceset)
## [1] "Cell"
## [2] "Mutation_Status"
## [3] "Cell_Cycle"
## [4] "Treatment"
## [5] "total_counts"
## [6] "log10_total_counts"
## [7] "filter_on_total_counts"
## [8] "total_features"
## [9] "log10_total_features"
## [10] "filter_on_total_features"
## [11] "pct_dropout"
## [12] "exprs_feature_controls_controls1"
## [13] "pct_exprs_feature_controls_controls1"
## [14] "filter_on_pct_exprs_feature_controls_controls1"
## [15] "counts_feature_controls_controls1"
## [16] "pct_counts_feature_controls_controls1"
## [17] "filter_on_pct_counts_feature_controls_controls1"
## [18] "n_detected_feature_controls_controls1"
## [19] "exprs_feature_controls_controls2"
## [20] "pct_exprs_feature_controls_controls2"
## [21] "filter_on_pct_exprs_feature_controls_controls2"
## [22] "counts_feature_controls_controls2"
## [23] "pct_counts_feature_controls_controls2"
## [24] "filter_on_pct_counts_feature_controls_controls2"
## [25] "n_detected_feature_controls_controls2"
## [26] "exprs_feature_controls"
## [27] "pct_exprs_feature_controls"
## [28] "filter_on_pct_exprs_feature_controls"
## [29] "pct_exprs_top_50_features"
## [30] "pct_exprs_top_100_features"
## [31] "pct_exprs_top_200_features"
## [32] "counts_feature_controls"
## [33] "pct_counts_feature_controls"
## [34] "filter_on_pct_counts_feature_controls"
## [35] "pct_counts_top_50_features"
## [36] "pct_counts_top_100_features"
## [37] "pct_counts_top_200_features"
## [38] "n_detected_feature_controls"
## [39] "exprs_endogenous_features"
## [40] "counts_endogenous_features"
## [41] "log10_exprs_feature_controls_controls1"
## [42] "log10_counts_feature_controls_controls1"
## [43] "log10_exprs_feature_controls_controls2"
## [44] "log10_counts_feature_controls_controls2"
## [45] "log10_exprs_feature_controls"
## [46] "log10_counts_feature_controls"
## [47] "log10_exprs_endogenous_features"
## [48] "log10_counts_endogenous_features"
## [49] "is_cell_control_set_1"
## [50] "is_cell_control_set_2"
## [51] "is_cell_control"
This function adds the following columns to pData(object)
:
total_counts
: total number of counts for the cell (aka ‘library size’)log10_total_counts
: total_counts on the log10-scaletotal_features
: the number of features for the cell that have expression above the detection limit (default detection limit is zero)filter_on_total_counts
: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?filter_on_total_features
: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?counts_feature_controls
: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.counts_endogenous_features
: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts
if no control features are indicated.log10_counts_feature_controls
: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.log10_counts_endogenous_features
: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.n_detected_feature_controls
: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls
: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.If we define multiple sets of feature controls, then the above will be supplied for all feature sets, plus the set of all feature controls combined, as appropriate.
Furthermore, where “counts” appear in the above, the same metrics will also be computed for “exprs”, “tpm” and “fpkm” (if tpm and fpkm are present in the SCESet
object).
The function further adds the following columns to fData(object)
:
mean_exprs
: the mean expression level of the gene/feature.exprs_rank
: the rank of the feature’s expression level in the cell.total_feature_counts
: the total number of counts mapped to that feature across all cells.log10_total_feature_counts
: total feature counts on the log10-scale.pct_total_counts
: the percentage of all counts that are accounted for by the counts mapping to the feature.is_feature_control
: is the feature a control feature? Default is FALSE
unless control features are defined by the user.n_cells_exprs
: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).names(fData(example_sceset))
## [1] "Gene" "mean_exprs"
## [3] "exprs_rank" "n_cells_exprs"
## [5] "total_feature_exprs" "log10_total_feature_exprs"
## [7] "pct_total_exprs" "pct_dropout"
## [9] "total_feature_counts" "log10_total_feature_counts"
## [11] "pct_total_counts" "is_feature_control_controls1"
## [13] "is_feature_control_controls2" "is_feature_control"
As above, where “counts” appear in the above, the same metrics will also be computed for “exprs”, “tpm” and “fpkm” (if tpm and fpkm are present in the SCESet
object).
Visualising the data and metadata in various ways can be very helpful for QC. We have a suite of plotting functions to produce diagnostic plots for:
pData(object)
).These three QC plots can all be accessed through the function plotQC
(we need to make sure there are no features with zero or constant expression).
The first step in the QC process is filtering out unwanted features. We will typically filter out features with very low overall expression, and any others that plots or other metrics indicate may be problematic.
First we look at a plot that shows the top 50 (by default) most-expressed features. By default, “expression” is defined using the feature counts (if available), but \(tpm\), \(cpm\), \(fpkm\) or the exprs
values can be used instead, if desired.
keep_feature <- rowSums(is_exprs(example_sceset)) > 4
example_sceset <- example_sceset[keep_feature,]
## Plot QC
plotQC(example_sceset, type = "highest-expression")
p1 <- plotQC(example_sceset, type = "highest-expression", exprs_values = "exprs")
p2 <- plotQC(example_sceset, type = "highest-expression", exprs_values = "tpm")
## Warning in get_exprs.SCESet(object, exprs_values): The object does not
## contain tpm expression values. Returning NULL.
## Using counts as expression values.
multiplot(p1, p2, cols = 2)
As we can see in the output above, if the desired expression values are not present in the SCESet
object (e.g. there are no \(tpm\) values, above), then exprs(object)
values will be used instead, with a warning. So in the above, we end up with the same plot twice.
The multiplot
function allows a very simple way to plot multiple ggplot2
plots on the same page. For more sophisticated possibilities for arranging multiple ggplot2
plots, check out the excellent cowplot
package, available on CRAN. If you have cowplot
installed (highly recommended), then scater
will automatically use it to create particularly attractive plots.
It can also be particularly useful to inspect the most-expressed features in just the cell controls (for example blanks or bulk samples). Subsetting capabilities for SCESet
objects allow us to do this easily. In the previous section, we defined two sets of cell controls in the call to calculateQCMetrics
. That function added the is_cell_control
column to the phenotype data of the SCESet
object example_sceset
, which indicates if a cell is defined as a cell control across any of the cell control sets.
The $
operator makes it easy to access the is_cell_control
column and use it to subset the SCESet
as below. We can compare the most-expressed features in the cell controls and in the cells of biological interest with this subsetting, as shown below.
p1 <- plotQC(example_sceset[, !example_sceset$is_cell_control],
type = "highest-expression")
p2 <- plotQC(example_sceset[, example_sceset$is_cell_control],
type = "highest-expression")
multiplot(p1, p2, cols = 2)
Another way to obtain an idea of the level of technical noise in the dataset is to plot the frequency of expression (that is, number of cells with expression for the gene above the defined threshold (default is zero)) against mean expression expression level . A set of specific features to plot can be defined, but need not be. By default, the function will look for defined feature controls (as supplied to calculateQCMetrics
). If feature controls are found, then these will be plotted, if not then all features will be plotted.
plotQC(example_sceset, type = "exprs-freq-vs-mean")
We can also plot just a subset of features with code like that below (plot not shown):
feature_set_1 <- fData(example_sceset)$is_feature_control_controls1
plotQC(example_sceset, type = "exprs-freq-vs-mean", feature_set = feature_set_1)
Beyond these QC plots, we have a neat, general and flexible function for plotting two feature metadata variables:
plotFeatureData(example_sceset, aes(x = n_cells_exprs, y = pct_total_counts))
We can see that there is a small number of features that are ubiquitously expressed expressed in all cells (n_cells_exprs
) and account for a large proportion of all counts observed (pct_total_counts
; more than 0.5% of all counts).
The subsetting of rows of SCESet
objects makes it easy to drop unwanted features.
See plotPhenoData
and other QC plots below. The subsetting of columns (which correspond to cells) of SCESet
objects makes it easy to drop unwanted cells.
We also have neat functions to plot two cell metadata variables:
plotPhenoData(example_sceset, aes(x = total_counts, y = total_features,
colour = Mutation_Status))
plotPhenoData(example_sceset, aes(x = Mutation_Status, y = total_features,
colour = log10_total_counts))
Note that ggplot aesthetics will work correctly (in general) for everything except colour
(color
) and fill
, which must be either columns of pData
or feature names (i.e. gene/transcript names).
These sorts of plots can be very useful for finding potentially problematic cells.
plotPhenoData(example_sceset, aes(x = total_counts, y = total_features,
colour = Gene_1000))
plotPhenoData(example_sceset, aes(x = pct_exprs_feature_controls,
y = total_features, colour = Gene_0500))
plotPhenoData(example_sceset, aes(x = pct_exprs_feature_controls,
y = pct_exprs_top_50_features,
colour = Gene_0001))
The output of these functions is a ggplot
object, which can be added to, amended and altered. For example, if we don’t like the legend position in the total_features
vs total_counts
plot above, we can change it, and we could also add a trend line for each group:
plotPhenoData(example_sceset, aes(x = log10_exprs_feature_controls,
y = total_features, colour = Mutation_Status)) +
theme(legend.position = "top") +
stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
Tapping into the powerful capabilities of ggplot2
, the possibilities are many.
A particularly useful plot for cell QC is plotting the percentage of expression accounted for by feature controls against total_features.
plotPhenoData(example_sceset, aes(x = total_features,
y = pct_exprs_feature_controls,
colour = Mutation_Status)) +
theme(legend.position = "top") +
stat_smooth(method = "lm", se = FALSE, size = 2, fullrange = TRUE)
On real data, we expect to see well-behaved cells with relatively high total_features (number of features with detectable expression) and low percentage of expression from feature controls. High percentage expression from feature controls and low total_features are indicative of blank and failed cells.
The plotPhenoData
function is useful for exploring the relationships between the many QC metrics computed by calculateQCMetrics
above. Often, problematic cells can be identified from such plots.
The plotPCA
function makes it easy to produce a PCA plot directly from an SCESet
object, which is useful for visualising cells.
The default plot shows the first two principal components and if any cell controls have been defined, plots these cells in a different colour.
plotPCA(example_sceset)
By default, the PCA plot is produced using the 500 features with the most variable expression across all cells. The number of most-variable features used can be changed with the ntop
argument. Alternatively, a specific set of features to use for the PCA can be defined with the feature_set
argument. This allows, for example, using only housekeeping features or control features to produce a PCA plot.
By default the PCA plot uses the expression values in the exprs
slot of the SCESet
object, but other expression values can be used by specifying the exprs_values
argument:
plotPCA(example_sceset, exprs_values = "cpm")
A subset of features can be used to produce a PCA plot. In the code below only the features defined as “feature controls” are used for the PCA (plot not shown).
plotPCA(example_sceset, feature_set = fData(example_sceset)$is_feature_control)
The function allows more than just the first two components to be plotted, and also allows phenotype variables to be used to define the colour, shape and size of points in the scatter plot.
plotPCA(example_sceset, ncomponents = 4, colour_by = "Treatment",
shape_by = "Mutation_Status")
When more than two components are plotted, the diagonal boxes in the scatter plot matrix show the density for each component.
We can also use the colour and size of point in the plot to reflect feature expression values.
plotPCA(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
plotPCA(example_sceset, size_by = "Gene_1000")
plotPCA(example_sceset, colour_by = "Gene_1000")
Thus, expression levels of two marker genes or transcripts can be shown overlaid onto reduced-dimension representations of cells. This also works for plotTSNE
plots.
plotTSNE(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
And for diffusion map plots.
plotDiffusionMap(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
The SCESet
matrix has a reducedDimension
slot, where coordinates for a reduced dimension representation of the cells can be stored. If we so wish, the top principal components can be added to the reducedDimension
slot:
example_sceset <- plotPCA(example_sceset, ncomponents = 4,
colour_by = "Treatment", shape_by = "Mutation_Status",
return_SCESet = TRUE, theme_size = 12)
head(reducedDimension(example_sceset))
## PC1 PC2 PC3 PC4 PC5
## Cell_001 -11.176605 2.0532262 -5.5545027 0.4341370 -1.42997924
## Cell_002 -1.428448 2.6701895 0.5500525 0.9448481 0.09050695
## Cell_003 2.905914 0.9408687 -1.9674264 -0.7710538 -6.86084148
## Cell_004 -9.937164 1.9355382 5.8211075 -0.6894513 6.09765627
## Cell_005 5.605234 3.3310936 5.0907582 -3.3590816 7.01020625
## Cell_006 -1.852437 -2.8529658 -8.3223904 -2.7249169 -0.72997737
## PC6 PC7 PC8 PC9 PC10
## Cell_001 0.81372499 -3.267668 1.3423409 4.7062523 -0.9748436
## Cell_002 -2.61295082 13.323190 -1.4775866 -2.5866046 -6.5973111
## Cell_003 -6.47426447 4.800351 0.9573528 -0.1990285 8.1607279
## Cell_004 4.28597968 2.821533 2.5280484 4.0903161 3.8362059
## Cell_005 -3.48664067 -5.564452 9.6274245 0.1674105 -0.2853712
## Cell_006 0.03192879 -5.031326 -0.3744514 -6.4476639 -1.5961861
As shown above, the function reducedDimension
(as well as the shorthand redDim
) can be used to access the reduced dimension coordinates.
This means that any other dimension reduction method can be applied and the coordinates stored. For example, we might wish to use t-distributed stochastic nearest neighbour embedding (t-SNE) or Gaussian process latent variable models or any other dimensionality reduction method. We can store these in the SCESet
object and produce plots just as we did with PCA:
plotReducedDim(example_sceset, ncomponents = 4, colour_by = "Treatment",
shape_by = "Mutation_Status")
As for the PCA plots we can also colour and size points by feature expression.
plotReducedDim(example_sceset, ncomponents = 4, colour_by = "Gene_1000",
size_by = "Gene_0500")
(Here, the dimensionality reduction is just PCA, so we have the same plot as the PCA plot above.)
Based on PCA or dimensionality reduction plots we may identify outlier cells and, if we wish, filter them out of the analysis. There is also an outlier detection option available with the plotPCA function.
example_sceset <- plotPCA(example_sceset, pca_data_input = "pdata",
detect_outliers = TRUE, return_SCESet = TRUE)
## sROC 0.1-2 loaded
## The following cells/samples are detected as outliers:
##
## Variables with highest loadings for PC1 and PC2:
##
## PC1 PC2
## --------------------------------- ----------- -----------
## pct_counts_top_100_features 0.4134957 0.2473997
## pct_counts_feature_controls 0.1653627 0.4157540
## log10_counts_feature_controls -0.4030576 0.5920277
## log10_counts_endogenous_features -0.4271752 0.4937645
## n_detected_feature_controls -0.4755199 -0.2595464
## total_features -0.4802325 -0.3229203
The $outlier
element of the pData (phenotype data) slot of the SCESet
contains indicator values about whether or not each cell has been designated as an outlier based on the PCA. Here, these values can be accessed for filtering low quality cells with example_sceset$outlier
.
We can also use t-SNE and diffusion maps to visualise cells in a reduced-dimension space:
plotTSNE(example_sceset, colour_by = "Treatment", shape_by = "Mutation_Status")
plotTSNE(example_sceset, colour_by = "Gene_0001", size_by = "Gene_1000")
plotDiffusionMap(example_sceset, size_by = "Gene_1000")
plotDiffusionMap(example_sceset, colour_by = "Gene_1000")
On this example dataset there are no cells that need filtering, but the subsetting capabilities of scater
make it easy to filter out unwanted cells. Column subsetting selects cells, while row subsetting selects features (genes or transcripts).
See the plotQC
options below. The various plotting functions enable visualisation of the relationship betwen experimental variables and the expression data.
We can look at the relative importance of different explanatory variables with some of the plotQC
function options. We can compute the median marginal \(R^2\) for each variable in pData(example_sceset)
when fitting a linear model regressing exprs
values against just that variable.
The default approach looks at all variables in pData(object)
and plots the top nvars_to_plot
variables (default is 10).
plotQC(example_sceset, type = "expl")
## The variable filter_on_total_counts only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_total_features only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_exprs_feature_controls_controls1 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_counts_feature_controls_controls1 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_exprs_feature_controls_controls2 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_counts_feature_controls_controls2 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_exprs_feature_controls only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_counts_feature_controls only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable outlier only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
Alternatively, we can choose a subset of variables to plot in this manner.
plotQC(example_sceset, type = "expl",
variables = c("total_features", "total_counts", "Mutation_Status", "Treatment",
"Cell_Cycle"))
We can also easily produce plots to identify PCs that correlate with experimental and QC variables of interest. The function ranks the principal components in decreasing order of \(R^2\) from a linear model regressing PC value against the variable of interest.
We can also produce a pairs plot of potential explanatory variables ranked by their median percentage of expression variance explained in a marginal (only one explanatory variable) linear model.
plotQC(example_sceset, type = "expl", method = "pairs", theme_size = 6)
## The variable filter_on_total_counts only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_total_features only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_exprs_feature_controls_controls1 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_counts_feature_controls_controls1 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_exprs_feature_controls_controls2 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_counts_feature_controls_controls2 only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_exprs_feature_controls only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable filter_on_pct_counts_feature_controls only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
## The variable outlier only has one unique value, so R^2 is not meaningful.
## This variable will not be plotted.
In this small dataset total_counts and total_features explain a very large proportion of the variance in feature expression. The proportion of variance that they explain for a real dataset should be much smaller (say 1-5%).
The default is to plot six most-associated principal components against the variable of interest.
p1 <- plotQC(example_sceset, type = "find-pcs", variable = "total_features",
plot_type = "pcs-vs-vars")
p2 <- plotQC(example_sceset, type = "find-pcs", variable = "Cell_Cycle",
plot_type = "pcs-vs-vars")
multiplot(p1, p2, cols = 2)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
An alternative is to produce a pairs plot of the top five PCs.
p1 <- plotQC(example_sceset, type = "find-pcs", variable = "total_features",
plot_type = "pairs-pcs")
p2 <- plotQC(example_sceset, type = "find-pcs", variable = "Cell_Cycle",
plot_type = "pairs-pcs")
multiplot(p1, p2, cols = 2)
Combined with the excellent subsetting capabilities of the SCESet
class, we have convenient tools for conducting QC and pre-processing (e.g. filtering) data for downstream analysis.
High levels of variability between cells characterise single-cell expression data. In almost all settings, many sources of unwanted variation should be accounted for before proceeding with more sophisticated analysis. Below, we show some of scater
’s capabilities for normalising data for downstream analyses.
We can use feature controls to help address differences between cells arising from different sets of transcripts being expressed and differences in library composition.
Important experimental variables and latent factors (if used) can be regressed out, so that normalised data has these effects removed.
In many single-cell expression analyses we may want to generate and store pairwise distance matrices for both cells and features.
We can first look at a multidimensional scaling (MDS) plot using Euclidean distance between cells.
cell_dist <- as.matrix(dist(t(exprs(example_sceset))))
cellPairwiseDistances(example_sceset) <- cell_dist
limma::plotMDS(cellDist(example_sceset))
Second, we could also look at an MDS plot using the count data and the Canberra distance metric (plot not shown).
cell_dist <- as.matrix(dist(t(counts(example_sceset)), method = "canberra"))
cellPairwiseDistances(example_sceset) <- cell_dist
limma::plotMDS(cellDist(example_sceset))
We could also look at an MDS plot for the features, here using Euclidean distance (plot not shown).
feature_dist <- as.matrix(dist(exprs(example_sceset)))
featurePairwiseDistances(example_sceset) <- feature_dist
limma::plotMDS(featDist(example_sceset))
Lior Pachter’s group at Berkeley has recently released a new software tool called kallisto for rapid quantification of transcript abundance from RNA-seq data. In scater
, a couple of wrapper functions to kallisto
enable easy and extremely fast quantification of transcript abundance from within R.
################################################################################
### Tests and Examples
# Example if in the kallisto/test directory
setwd("/home/davis/kallisto/test")
kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE,
output_prefix="output", verbose=TRUE, n_bootstrap_samples=10)
sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE)
sce_test
An example using real data from a project investigating cell cycle. Not that this analysis is not ‘live’ (the raw data are not included in the package), but it demonstrates what can be done with scater
and kallisto
.
setwd("/home/davis/021_Cell_Cycle/data/fastq")
system("wc -l targets.txt")
ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690))
kallisto_test <- runKallisto("targets.txt",
"Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx",
output_prefix="kallisto_output_Mmus", n_cores=12,
fragment_length=ave_frag_len, verbose=TRUE)
sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE)
sce_kall_mmus
sce_kall_mmus <- readKallistoResults(kallisto_test)
sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus)
head(fData(sce_kall_mmus))
head(pData(sce_kall_mmus))
sce_kall_mmus[["start_time"]]
counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6]
## Summarise expression at the gene level
sce_kall_mmus_gene <- summariseExprsAcrossFeatures(
sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id")
datatable(fData(sce_kall_mmus_gene))
sce_kall_mmus_gene <- getBMFeatureAnnos(
sce_kall_mmus_gene, filters="ensembl_gene_id",
attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name",
"gene_biotype", "start_position", "end_position",
"percentage_gc_content", "description"),
feature_symbol="mgi_symbol", feature_id="ensembl_gene_id",
biomart="ensembl", dataset="mmusculus_gene_ensembl")
datatable(fData(sce_kall_mmus_gene))
## Add gene symbols to featureNames to make them more intuitive
new_feature_names <- featureNames(sce_kall_mmus_gene)
notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol)
new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb]
notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id)
new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid],
fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_")
sum(duplicated(new_feature_names))
featureNames(sce_kall_mmus_gene) <- new_feature_names
head(featureNames(sce_kall_mmus_gene))
tail(featureNames(sce_kall_mmus_gene))
sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol))