Further strategies for analyzing single-cell RNA-seq data

Aaron T. L. Lun1, Davis J. McCarthy2,3 and John C. Marioni1,2,4

1Cancer Research UK Cambridge Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, United Kingdom
2EMBL European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
3St Vincent's Institute of Medical Research, 41 Victoria Parade, Fitzroy, Victoria 3065, Australia
4Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom

2024-11-01

1 Overview

Here, we describe a few additional analyses that can be performed with single-cell RNA sequencing data. This includes detection of significant correlations between genes and regressing out the effect of cell cycle from the gene expression matrix.

2 Identifying correlated gene pairs with Spearman’s rho

scRNA-seq data is commonly used to identify correlations between the expression profiles of different genes. This is quantified by computing Spearman’s rho, which accommodates non-linear relationships in the expression values. Non-zero correlations between pairs of genes provide evidence for their co-regulation. However, the noise in the data requires some statistical analysis to determine whether a correlation is significantly non-zero.

To demonstrate, we use the correlatePairs function to identify significant correlations between the various histocompatability antigens in the haematopoietic stem cell (HSC) Smart-seq2 dataset (Wilson et al. 2015). Counts were obtained from NCBI GEO as a supplementary file using the accession number GSE61533, and are used to generate a SingleCellExperiment as shown below.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)
wilson.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
    "GSE61nnn/GSE61533/suppl/GSE61533_HTSEQ_count_results.xls.gz"))
library(R.utils)
wilson.name2 <- "GSE61533_HTSEQ_count_results.xls"
gunzip(wilson.fname, destname=wilson.name2, remove=FALSE, overwrite=TRUE)

library(readxl)
all.counts <- read_excel(wilson.name2)
gene.names <- all.counts$ID
all.counts <- as.matrix(all.counts[,-1])
rownames(all.counts) <- gene.names

library(SingleCellExperiment)
sce.hsc <- SingleCellExperiment(list(counts=all.counts))
is.spike <- grepl("^ERCC", rownames(sce.hsc))
sce.hsc <- splitAltExps(sce.hsc, ifelse(is.spike, "ERCC", "gene"))

library(scater)
sce.hsc <- addPerCellQC(sce.hsc)
spike.drop <- quickPerCellQC(colData(sce.hsc))
sce.hsc <- sce.hsc[,!spike.drop$discard]

library(scran)
sce.hsc <- computeSumFactors(sce.hsc)
sce.hsc <- logNormCounts(sce.hsc)

The significance of each correlation is determined using a permutation test. For each pair of genes, the null hypothesis is that the expression profiles of two genes are independent. Shuffling the profiles and recalculating the correlation yields a null distribution that is used to obtain a p-value for each observed correlation value (Phipson and Smyth 2010).

set.seed(100)
var.cor <- correlatePairs(sce.hsc, subset.row=grep("^H2-", rownames(sce.hsc)))
head(var.cor)

## DataFrame with 6 rows and 5 columns
##         gene1       gene2       rho     p.value         FDR
##   <character> <character> <numeric>   <numeric>   <numeric>
## 1       H2-Aa      H2-Ab1  0.590421 4.69767e-10 2.04349e-07
## 2      H2-Ab1      H2-Eb1  0.529031 4.99524e-08 1.08647e-05
## 3       H2-Aa      H2-Eb1  0.502605 2.84104e-07 4.11951e-05
## 4       H2-D1       H2-K1  0.436215 1.23780e-05 1.34611e-03
## 5       H2-K1       H2-K2  0.346442 6.69501e-04 5.81557e-02
## 6       H2-K2       H2-Q6  0.339944 8.56575e-04 5.81557e-02

Correction for multiple testing across many gene pairs is performed by controlling the FDR at 5%.

sig.cor <- var.cor$FDR <= 0.05
summary(sig.cor)

##    Mode   FALSE    TRUE    NA's 
## logical     431       4     511

We can also compute correlations between specific pairs of genes, or between all pairs between two distinct sets of genes. The example below computes the correlation between Fos and Jun, which dimerize to form the AP-1 transcription factor (Angel and Karin 1991).

correlatePairs(sce.hsc, subset.row=cbind("Fos", "Jun"))

## DataFrame with 1 row and 5 columns
##         gene1       gene2       rho    p.value        FDR
##   <character> <character> <numeric>  <numeric>  <numeric>
## 1         Fos         Jun   0.44997 6.0349e-06 6.0349e-06

Examination of the expression profiles in Figure 1 confirms the presence of a modest correlation between these two genes.

library(scater)
plotExpression(sce.hsc, features="Fos", x="Jun")

Figure 1: Expression of Fos plotted against the expression of Jun for all cells in the HSC dataset.

The use of correlatePairs is primarily intended to identify correlated gene pairs for validation studies. Obviously, non-zero correlations do not provide evidence for a direct regulatory interaction, let alone specify causality. To construct regulatory networks involving many genes, we suggest using dedicated packages such as WCGNA.

Comments from Aaron:

We suggest only computing correlations between a subset of genes of interest, known either a priori or empirically defined, e.g., as HVGs. Computing correlations across all genes will take too long; unnecessarily increase the severity of the multiple testing correction; and may prioritize strong but uninteresting correlations, e.g., between tightly co-regulated house-keeping genes.
The correlateGenes() function can be used on the output of correlatePairs() to return gene-centric output. This calculates a combined p-value (Simes 1986) for each gene that indicates whether it is significantly correlated to any other gene. From a statistical perspective, this is a more natural approach to correcting for multiple testing when genes, rather than pairs of genes, are of interest.
The Limited field indicates whether the p-value was lower-bounded by the number of permutations. If this is TRUE for any non-significant gene at the chosen FDR threshold, consider increasing the number of permutations to improve power.

3 Comments on filtering by abundance

Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference. In applications involving hypothesis testing, these genes typically do not provide enough evidence to reject the null hypothesis yet they still increase the severity of the multiple testing correction. The discreteness of the counts may also interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.

The “optimal” choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness and to avoid zeroes, e.g., for normalization purposes. By comparison, the filter statistic for hypothesis testing is mainly required to be independent of the test statistic under the null hypothesis (Bourgon, Gentleman, and Huber 2010). Given these differences in priorities, we (or the relevant function) will filter at each step as appropriate, rather than applying a single filter for the entire analysis. For example, computeSumFactors() will apply a somewhat stringent filter based on the average count, while fitTrendVar() will apply a relatively relaxed filter based on the average log-expression. Other applications will not do any abundance-based filtering at all (e.g., denoisePCA()) to preserve biological signal from lowly expressed genes.

Nonetheless, if global filtering is desired, it is simple to achieve by simply subsetting the SingleCellExperiment object. The example below demonstrates how we could remove genes with average counts less than 1 in the HSC dataset. The number of TRUE values in demo.keep corresponds to the number of retained rows/genes after filtering.

ave.counts <- calculateAverage(sce.hsc)
demo.keep <- ave.counts >= 1
filtered.sce.hsc <- sce.hsc[demo.keep,]
summary(demo.keep)

##    Mode   FALSE    TRUE 
## logical   24377   14029

4 Concluding remarks

All software packages used in this workflow are publicly available from the Comprehensive R Archive Network (https://cran.r-project.org) or the Bioconductor project (http://bioconductor.org). The specific version numbers of the packages used are shown below, along with the version of the R installation.

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scran_1.34.0                scater_1.34.0              
##  [3] ggplot2_3.5.1               scuttle_1.16.0             
##  [5] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0
##  [7] Biobase_2.66.0              GenomicRanges_1.58.0       
##  [9] GenomeInfoDb_1.42.0         IRanges_2.40.0             
## [11] S4Vectors_0.44.0            BiocGenerics_0.52.0        
## [13] MatrixGenerics_1.18.0       matrixStats_1.4.1          
## [15] readxl_1.4.3                R.utils_2.12.3             
## [17] R.oo_1.26.0                 R.methodsS3_1.8.2          
## [19] BiocFileCache_2.14.0        dbplyr_2.5.0               
## [21] knitr_1.48                  BiocStyle_2.34.0           
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               gridExtra_2.3           rlang_1.1.4            
##  [4] magrittr_2.0.3          compiler_4.4.1          RSQLite_2.3.7          
##  [7] vctrs_0.6.5             pkgconfig_2.0.3         crayon_1.5.3           
## [10] fastmap_1.2.0           magick_2.8.5            XVector_0.46.0         
## [13] labeling_0.4.3          utf8_1.2.4              rmarkdown_2.28         
## [16] UCSC.utils_1.2.0        ggbeeswarm_0.7.2        tinytex_0.53           
## [19] purrr_1.0.2             bit_4.5.0               xfun_0.49              
## [22] bluster_1.16.0          zlibbioc_1.52.0         cachem_1.1.0           
## [25] beachmat_2.22.0         jsonlite_1.8.9          blob_1.2.4             
## [28] highr_0.11              DelayedArray_0.32.0     BiocParallel_1.40.0    
## [31] irlba_2.3.5.1           parallel_4.4.1          cluster_2.1.6          
## [34] R6_2.5.1                bslib_0.8.0             limma_3.62.0           
## [37] jquerylib_0.1.4         cellranger_1.1.0        Rcpp_1.0.13            
## [40] bookdown_0.41           igraph_2.1.1            Matrix_1.7-1           
## [43] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
## [46] viridis_0.6.5           codetools_0.2-20        curl_5.2.3             
## [49] lattice_0.22-6          tibble_3.2.1            withr_3.0.2            
## [52] evaluate_1.0.1          pillar_1.9.0            BiocManager_1.30.25    
## [55] filelock_1.0.3          generics_0.1.3          munsell_0.5.1          
## [58] scales_1.3.0            glue_1.8.0              metapod_1.14.0         
## [61] tools_4.4.1             BiocNeighbors_2.0.0     ScaledMatrix_1.14.0    
## [64] locfit_1.5-9.10         cowplot_1.1.3           grid_4.4.1             
## [67] edgeR_4.4.0             colorspace_2.1-1        GenomeInfoDbData_1.2.13
## [70] beeswarm_0.4.0          BiocSingular_1.22.0     vipor_0.4.7            
## [73] cli_3.6.3               rsvd_1.0.5              fansi_1.0.6            
## [76] S4Arrays_1.6.0          viridisLite_0.4.2       dplyr_1.1.4            
## [79] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
## [82] dqrng_0.4.1             SparseArray_1.6.0       ggrepel_0.9.6          
## [85] farver_2.1.2            memoise_2.0.1           htmltools_0.5.8.1      
## [88] lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0          
## [91] bit64_4.5.2

References

Angel, P., and M. Karin. 1991. “The role of Jun, Fos and the AP-1 complex in cell-proliferation and transformation.” Biochim. Biophys. Acta 1072 (2-3): 129–57.

Bourgon, R., R. Gentleman, and W. Huber. 2010. “Independent filtering increases detection power for high-throughput experiments.” Proc. Natl. Acad. Sci. U.S.A. 107 (21): 9546–51.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9: Article 39.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3): 751–54.

Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6): 712–24.