Contents

1 Introduction

Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a Bayesian method to estimate and remove contamination in individual cells. DecontX assumes the observed expression of a cell is a mixture of counts from two multinomial distributions: (1) a distribution of native transcript counts from the cell’s actual population and (2) a distribution of contaminating transcript counts from all other cell populations captured in the assay. Overall, computational decontamination of single cell counts can aid in downstream clustering and visualization.

The package can be loaded using the library command.

library(celda)

DecontX can take either SingleCellExperiment object from package SingleCellExperiment package or a single counts matrix as input. decontX will attempt to convert any input matrix to class dgCMatrix from package Matrix before beginning any analyses.

2 Load PBMC4k data from 10X

We will utlize the 10X PBMC 4K dataset as an example. This can be easily retrieved from the package TENxPBMCData. Make sure the the column names are set before running decontX.

# Install TENxPBMCData if is it not already
if (!requireNamespace("TENxPBMCData", quietly = TRUE)) {
  if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
  }
  BiocManager::install("TENxPBMCData")
}

# Load PBMC data
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
colnames(sce) <- paste(sce$Sample, sce$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol_TENx

3 Running decontX

A SingleCellExperiment (SCE) object or a sparse matrix containing the counts for filtered cells can be passed to decontX via the x parameter. There are two major ways to run decontX: with and without the raw/droplet matrix containing empty droplets. The raw/droplet matrix can be used to empirically estimate the distribution of ambient RNA, which is especially useful when cells that contributed to the ambient RNA are not accurately represented in the filtered count matrix containing the cells. For example, cells that were removed via flow cytometry or that were more sensitive to lysis during dissociation may have contributed to the ambient RNA but were not measured in the filtered/cell matrix. The raw/droplet matrix can be input as a sparse matrix or SCE object using the background parameter:

sce <- decontX(sce, background = raw)

If cell/column names in the raw/droplet matrix are also found in the filtered counts matrix, then they will be excluded from the raw/droplet matrix before calculation of the ambient RNA distribution. If the raw matrix is not available, then decontX will estimate the contamination distribution for each cell cluster based on the profiles of the other cell clusters in the filtered dataset:

sce <- decontX(sce)

Note that in this case decontX will perform heuristic clustering to quickly define major cell clusters. However if you have your own cell cluster labels, they can be specified with the z parameter. If you supply a raw matrix via the background parameter, then the z parameter will not have an effect as clustering will not be performed.

The contamination can be found in the colData(sce)$decontX_contamination and the decontaminated counts can be accessed with decontXcounts(sce). If the input object was a matrix, make sure to save the output into a variable with a different name (e.g. result). The result object will be a list with contamination in result$contamination and the decontaminated counts in result$decontXcounts.

4 Plotting DecontX results

4.1 Cluster labels on UMAP

DecontX creates a UMAP which we can use to plot the cluster labels automatically identified in the analysis. Note that the clustering approach used here is designed to find “broad” cell types rather than individual cell subpopulations within a cell type.

umap <- reducedDim(sce, "decontX_UMAP")
plotDimReduceCluster(x = sce$decontX_clusters,
    dim1 = umap[, 1], dim2 = umap[, 2])

4.2 Contamination on UMAP

The percentage of contamination in each cell can be plotting on the UMAP to visualize what what clusters may have higher levels of ambient RNA.

plotDecontXContamination(sce)

4.3 Expression of markers on UMAP

Known marker genes can also be plotted on the UMAP to identify the cell types for each cluster. We will use CD3D and CD3E for T-cells, LYZ, S100A8, and S100A9 for monocytes, CD79A, CD79B, and MS4A1 for B-cells, GNLY for NK-cells, and PPBP for megakaryocytes.

library(scater)
sce <- logNormCounts(sce)
plotDimReduceFeature(as.matrix(logcounts(sce)),
    dim1 = umap[, 1],
    dim2 = umap[, 2],
    features = c("CD3D", "CD3E", "GNLY",
        "LYZ", "S100A8", "S100A9",
        "CD79A", "CD79B", "MS4A1"),
    exactMatch = TRUE)

4.4 Barplot of markers detected in cell clusters

The percetage of cells within a cluster that have detectable expression of marker genes can be displayed in a barplot. Markers for cell types need to be supplied in a named list. First, the detection of marker genes in the original counts assay is shown:

markers <- list(Tcell_Markers = c("CD3E", "CD3D"),
    Bcell_Markers = c("CD79A", "CD79B", "MS4A1"),
    Monocyte_Markers = c("S100A8", "S100A9", "LYZ"),
    NKcell_Markers = "GNLY")
cellTypeMappings <- list(Tcells = 2, Bcells = 5, Monocytes = 1, NKcells = 6)
plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "counts")

We can then look to see how much decontX removed aberrant expression of marker genes in each cell type by changing the assayName to decontXcounts:

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = "decontXcounts")

Percentages of marker genes detected in other cell types were reduced or completely removed. For example, the percentage of cells that expressed Monocyte marker genes was greatly reduced in T-cells, B-cells, and NK-cells. The original counts and decontamined counts can be plotted side-by-side by listing multiple assays in the assayName parameter. This option is only available if the data is stored in SingleCellExperiment object.

plotDecontXMarkerPercentage(sce,
    markers = markers,
    groupClusters = cellTypeMappings,
    assayName = c("counts", "decontXcounts"))

Some helpful hints when using plotDecontXMarkerPercentage:

  1. Cell clusters can be renamed and re-grouped using the groupCluster parameter, which also needs to be a named list. If groupCluster is used, cell clusters not included in the list will be excluded in the barplot. For example, if we wanted to group T-cells and NK-cells together, we could set cellTypeMappings <- list(NK_Tcells = c(2,6), Bcells = 5, Monocytes = 1)
  2. The level a gene needs to be expressed to be considered detected in a cell can be adjusted using the threshold parameter.
  3. If you are not using a SingleCellExperiment, then you will need to supply the original counts matrix or the decontaminated counts matrix as the first argument to generate the barplots.

4.5 Violin plot to compare the distributions of original and decontaminated counts

Another useful way to assess the amount of decontamination is to view the expression of marker genes before and after decontX across cell types. Here we view the monocyte markers in each cell type. The violin plot shows that the markers have been removed from T-cells, B-cells, and NK-cells, but are largely unaffected in monocytes.

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3)

Some helpful hints when using plotDecontXMarkerExpression:

  1. groupClusters works the same way as in plotDecontXMarkerPercentage.
  2. This function will plot each pair of markers and clusters (or cell type specified by groupClusters). Therefore, you may want to keep the number of markers small in each plot and call the function multiple times for different sets of marker genes.
  3. You can also plot the individual points by setting plotDots = TRUE and/or log transform the points on the fly by setting log1p = TRUE.
  4. This function can plot any assay in a SingleCellExperiment. Therefore you could also examine normalized expression of the original and decontaminated counts. For example:
sce <- scater::logNormCounts(sce,
    exprs_values = "decontXcounts",
    name = "dlogcounts")

plotDecontXMarkerExpression(sce,
    markers = markers[["Monocyte_Markers"]],
    groupClusters = cellTypeMappings,
    ncol = 3,
    assayName = c("logcounts", "dlogcounts"))

5 Other important notes

5.1 Choosing appropriate cell clusters

The ability of DecontX to accurately identify contamination is dependent on the cell cluster labels. DecontX assumes that contamination for a cell cluster comes from combination of counts from all other clusters. The default clustering approach used by DecontX tends to select fewer clusters that represent broader cell types. For example, all T-cells tend to be clustered together rather than splitting naive and cytotoxic T-cells into separate clusters. Custom cell type labels can be suppled via the z parameter if some cells are not being clustered appropriately by the default method.

5.2 Adjusting the priors to influence contamination estimates

There are ways to force decontX to estimate more or less contamination across a dataset by manipulating the priors. The delta parameter is a numeric vector of length two. It is the concentration parameter for the Dirichlet distribution which serves as the prior for the proportions of native and contamination counts in each cell. The first element is the prior for the proportion of native counts while the second element is the prior for the proportion of contamination counts. These essentially act as pseudocounts for the native and contamination in each cell. If estimateDelta = TRUE, delta is only used to produce a random sample of proportions for an initial value of contamination in each cell. Then delta is updated in each iteration. If estimateDelta = FALSE, then delta is fixed with these values for the entire inference procedure. Fixing delta and setting a high number in the second element will force decontX to be more aggressive and estimate higher levels of contamination in each cell at the expense of potentially removing native expression. For example, in the previous PBMC example, we can see what the estimated delta was by looking in the estimates:

metadata(sce)$decontX$estimates$all_cells$delta
## [1] 9.287164 1.038217

Setting a higher value in the second element of delta and estimateDelta = FALSE will force decontX to estimate higher levels of contamination per cell:

sce.delta <- decontX(sce, delta = c(9, 20), estimateDelta = FALSE)

plot(sce$decontX_contamination, sce.delta$decontX_contamination,
     xlab = "DecontX estimated priors",
     ylab = "Setting priors to estimate higher contamination")
abline(0, 1, col = "red", lwd = 2)

6 Session Information

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.22.0               ggplot2_3.3.5              
##  [3] scuttle_1.4.0               TENxPBMCData_1.11.1        
##  [5] HDF5Array_1.22.0            rhdf5_2.38.0               
##  [7] DelayedArray_0.20.0         Matrix_1.3-4               
##  [9] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [11] Biobase_2.54.0              GenomicRanges_1.46.0       
## [13] GenomeInfoDb_1.30.0         IRanges_2.28.0             
## [15] S4Vectors_0.32.0            BiocGenerics_0.40.0        
## [17] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [19] celda_1.10.0                BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.13               AnnotationHub_3.2.0          
##   [3] BiocFileCache_2.2.0           RcppEigen_0.3.3.9.1          
##   [5] plyr_1.8.6                    assertive.files_0.0-2        
##   [7] enrichR_3.0                   multipanelfigure_2.1.2       
##   [9] BiocParallel_1.28.0           digest_0.6.28                
##  [11] foreach_1.5.1                 htmltools_0.5.2              
##  [13] viridis_0.6.2                 magick_2.7.3                 
##  [15] fansi_0.5.0                   magrittr_2.0.1               
##  [17] memoise_2.0.0                 ScaledMatrix_1.2.0           
##  [19] assertive.numbers_0.0-2       cluster_2.1.2                
##  [21] doParallel_1.0.16             ComplexHeatmap_2.10.0        
##  [23] Biostrings_2.62.0             colorspace_2.0-2             
##  [25] blob_1.2.2                    rappdirs_0.3.3               
##  [27] ggrepel_0.9.1                 xfun_0.27                    
##  [29] dplyr_1.0.7                   crayon_1.4.1                 
##  [31] RCurl_1.98-1.5                jsonlite_1.7.2               
##  [33] iterators_1.0.13              glue_1.4.2                   
##  [35] gtable_0.3.0                  zlibbioc_1.40.0              
##  [37] XVector_0.34.0                GetoptLong_1.0.5             
##  [39] BiocSingular_1.10.0           Rhdf5lib_1.16.0              
##  [41] shape_1.4.6                   scales_1.1.1                 
##  [43] DBI_1.1.1                     Rcpp_1.0.7                   
##  [45] viridisLite_0.4.0             xtable_1.8-4                 
##  [47] clue_0.3-60                   gridGraphics_0.5-1           
##  [49] rsvd_1.0.5                    bit_4.0.4                    
##  [51] httr_1.4.2                    FNN_1.1.3                    
##  [53] RColorBrewer_1.1-2            ellipsis_0.3.2               
##  [55] pkgconfig_2.0.3               farver_2.1.0                 
##  [57] sass_0.4.0                    uwot_0.1.10                  
##  [59] dbplyr_2.1.1                  utf8_1.2.2                   
##  [61] tidyselect_1.1.1              labeling_0.4.2               
##  [63] rlang_0.4.12                  reshape2_1.4.4               
##  [65] later_1.3.0                   AnnotationDbi_1.56.0         
##  [67] munsell_0.5.0                 BiocVersion_3.14.0           
##  [69] tools_4.1.1                   cachem_1.0.6                 
##  [71] dbscan_1.1-8                  generics_0.1.1               
##  [73] RSQLite_2.2.8                 ExperimentHub_2.2.0          
##  [75] evaluate_0.14                 stringr_1.4.0                
##  [77] fastmap_1.1.0                 yaml_2.2.1                   
##  [79] knitr_1.36                    bit64_4.0.5                  
##  [81] purrr_0.3.4                   KEGGREST_1.34.0              
##  [83] sparseMatrixStats_1.6.0       mime_0.12                    
##  [85] compiler_4.1.1                beeswarm_0.4.0               
##  [87] filelock_1.0.2                curl_4.3.2                   
##  [89] png_0.1-7                     interactiveDisplayBase_1.32.0
##  [91] tibble_3.1.5                  bslib_0.3.1                  
##  [93] stringi_1.7.5                 highr_0.9                    
##  [95] RSpectra_0.16-0               lattice_0.20-45              
##  [97] assertive.base_0.0-9          vctrs_0.3.8                  
##  [99] pillar_1.6.4                  lifecycle_1.0.1              
## [101] rhdf5filters_1.6.0            BiocManager_1.30.16          
## [103] combinat_0.0-8                jquerylib_0.1.4              
## [105] GlobalOptions_0.1.2           RcppAnnoy_0.0.19             
## [107] BiocNeighbors_1.12.0          data.table_1.14.2            
## [109] bitops_1.0-7                  irlba_2.3.3                  
## [111] httpuv_1.6.3                  assertive.types_0.0-3        
## [113] R6_2.5.1                      bookdown_0.24                
## [115] assertive.properties_0.0-4    promises_1.2.0.1             
## [117] gridExtra_2.3                 vipor_0.4.5                  
## [119] codetools_0.2-18              MCMCprecision_0.4.0          
## [121] assertthat_0.2.1              rjson_0.2.20                 
## [123] withr_2.4.2                   GenomeInfoDbData_1.2.7       
## [125] parallel_4.1.1                beachmat_2.10.0              
## [127] grid_4.1.1                    rmarkdown_2.11               
## [129] DelayedMatrixStats_1.16.0     Rtsne_0.15                   
## [131] shiny_1.7.1                   ggbeeswarm_0.6.0