Contents

1 Introduction

Single-cell RNA sequencing is becoming an increasingly common tool to investigate the cellular population and patients’ outcome in cancer research. However, due to the sparse data and the complex tumor microenvironment, it is challenging to identify neoplastic cells that play important roles in tumor growth and disease progression. This challenge is exaggerated in the research of blood cancer patients, from whom the neoplastic cells can be highly similar to the normal cells.

In this package we present partCNV/partCNVH, a statistical framework for rapid and accurate detection of aneuploid cells with local copy number deletion or amplification. Our method uses an EM algorithm with mixtures of Poisson distributions while incorporating cytogenetics information to guide the classification (partCNV). When applicable, we further improve the accuracy by integrating a Hidden Markov Model for feature selection (partCNVH).

2 Installation

The package can be installed using BiocManager by the following commands

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("partCNV")

Alternatively, the package can be installed using devtools and launched by

library(devtools)
install_github("rx-li/partCNV")

3 Understand your cytogenetics data

First of all, our method can only be used to infer cell status if the patient has a part of the cells with cytogenetics alterations. For example, if a patient has cytogenetics data as 46,XY,del(20)(q11.1q13.1)[5]/46,XY[15], it means that out of 20 metaphases, 5 (i.e., 25%) of the cells has deletion on chromosome 20 in region q11.1 to q13.1. If the patient has a normal cytogenetics, e.g., 46,XY[20], or all altered cells, e.g., 46,XY,del(20)(q11.1q13.1)[25], there won’t be any need to apply the proposed method.

Second, when you have a complicated cytogenetics feature, use them one by one to identify the desired cell group. For example, 47,XY,+8[5]/46,idem,del(8)(q21.2q24.3)/46,XY[7], we start with the chromosome amplification on chromosome 8 excluding the region (q21.2q24.3) to identify the cells with this alteration using the proposed method. After the cells with chromosome 8 amplification are identified, we apply the proposed method to identify cells with del(8)(q21.2q24.3).

Lastly, our tool is primarily designed for analyzing the scRNA-seq data from patients with hematologic malignancies, such as MDS and AML. For solid tumors such as lung cancer and breast cancer, it is better to use CNV based method such as copyKAT and inferCNV inferCNV.

4 Load example data

The analysis can start from whole-genome scRNA-seq data or a subset of scRNA-seq data based on the location of interest. In this package, we prepared a whole-genome scRNA-seq data:

library(partCNV)
data(SimData)
dim(SimData)
## [1] 24519   400
SimData[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##            C1 C2 C3 C4 C5
## AL627309.1  .  .  .  .  .
## AL627309.5  .  .  .  .  .
## AL627309.4  .  .  .  .  .
## AP006222.2  .  .  .  .  .
## LINC01409   .  .  .  .  .

This simple example dataset contains 24519 genes and 400 cells. If you start with Seurat object (after quality control), run:

library(Seurat)
Seurat_obj <- NormalizeData(Your_SeuratObj, normalization.method = "LogNormalize", scale.factor = 10000)
Counts = Seurat_obj@assays$RNA@counts

If you start with a SingleCellExperiment object, the counts can be normalized using function NormalizeCounts.

Counts <- NormalizeCounts(Your_SingleCellExperimentObj, scale_factor=10000)

5 Run partCNV

Since this is simulation data, prior knowledge (e.g., cytogenetics data in real studies) shows that this example data has 40% of cells with deletion on chromosome 20 q11.1 to q13.1. Let’s get started with locating this region.

res <- GetCytoLocation(cyto_feature = "chr20(q11.1-q13.1)")
## loading from cache
## require("GenomicRanges")
## Interested region: chr20:28100001-51200000.
## A total of 381 genes are located in this region.

Then let’s subset the data and normalize the data:

GEout <- GetExprCountCyto(cytoloc_output = res, Counts = as.matrix(SimData), normalization = TRUE, qt_cutoff = 0.99)

For this function, the qt_cutoff is to filter out the cells with very low expressions. 0.99 here means that we filter out cells that only express in 1% (1-0.99) cells. Make this qt_cutoff larger if your total gene number within the region is small.

Now we apply partCNV:

pcout <- partCNV(int_counts = GEout$ProcessedCount,
                 cyto_type = "del",
                 cyto_p = 0.40)

Understand the results:

table(pcout)
## pcout
##   0   1 
## 242 158
sum(pcout==1)/length(pcout)
## [1] 0.395
p1 <- sum(pcout==1)/length(pcout)

39.5% of cells are labeled as locally aneuploid (1) and others are diploid (0).

Let’s visualize it:

library(Seurat)
sim_seurat <- CreateSeuratObject(counts = SimData)
sim_seurat <- NormalizeData(sim_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
sim_seurat <- FindVariableFeatures(sim_seurat, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(sim_seurat)
sim_seurat <- ScaleData(sim_seurat, features = all.genes)
sim_seurat <- RunPCA(sim_seurat, features = VariableFeatures(object = sim_seurat))
sim_seurat <- RunUMAP(sim_seurat, dims = 1:10)
sim_seurat <- AddMetaData(
    object = sim_seurat,
    metadata = pcout,
    col.name = "partCNV_label"
)
sim_seurat$partCNV_label <- factor(sim_seurat$partCNV_label, levels = c(1,0), labels = c(
    "aneuploid", "diploid"
))
library(ggplot2)
DimPlot(sim_seurat, reduction = "umap", group = "partCNV_label") + ggtitle(paste0("partCNV (", signif(p1,2)*100, "%)"))

6 Run partCNVH

Compared with partCNV, partCNVH added an additional step of feature selection. This is especially helpful if your cytogenetics provide a very broad region and part of it does not have chromosomal alterations. The first two steps of using partCNVH are exactly the same as using partCNV.

res <- GetCytoLocation(cyto_feature = "chr20(q11.1-q13.1)")
## loading from cache
## Interested region: chr20:28100001-51200000.
## A total of 381 genes are located in this region.
GEout <- GetExprCountCyto(cytoloc_output = res, Counts = as.matrix(SimData), normalization = TRUE, qt_cutoff = 0.99)

For this function, the qt_cutoff is to filter out the cells with very low expressions. 0.99 here means that we filter out cells that only express in 1% (1-0.99) cells. Make this qt_cutoff larger if your total gene number within the region is small.

Now we apply partCNVH:

pcHout <- partCNVH(int_counts = GEout$ProcessedCount,
                 cyto_type = "del",
                 cyto_p = 0.40)

Understand the results (in pcHout, EMlabel is the partCNV label and EMHMMlabel is the partCNVH label).

table(pcHout$EMHMMlabel)
## 
##   0   1 
## 235 165
sum(pcHout$EMHMMlabel==1)/length(pcHout$EMHMMlabel)
## [1] 0.4125
p2 <- sum(pcHout$EMHMMlabel==1)/length(pcHout$EMHMMlabel)

41.25% of cells are labeled as locally aneuploid (1) and others are diploid (0).

Let’s visualize it:

# I commented these steps because they are exactly the same as partCNV run. 
# library(Seurat)
# sim_seurat <- CreateSeuratObject(counts = SimData)
# sim_seurat <- NormalizeData(sim_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
# sim_seurat <- FindVariableFeatures(sim_seurat, selection.method = "vst", nfeatures = 2000)
# all.genes <- rownames(sim_seurat)
# sim_seurat <- ScaleData(sim_seurat, features = all.genes)
# sim_seurat <- RunPCA(sim_seurat, features = VariableFeatures(object = sim_seurat))
# sim_seurat <- RunUMAP(sim_seurat, dims = 1:10)
sim_seurat <- AddMetaData(
    object = sim_seurat,
    metadata = pcHout$EMHMMlabel,
    col.name = "partCNVH_label"
)
sim_seurat$partCNVH_label <- factor(sim_seurat$partCNVH_label, levels = c(1,0), labels = c(
    "aneuploid", "diploid"
))
library(ggplot2)
DimPlot(sim_seurat, reduction = "umap", group = "partCNVH_label") + ggtitle(paste0("partCNVH (", signif(p2,2)*100, "%)"))

Session info

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_3.5.1        Seurat_5.1.0         SeuratObject_5.0.2  
##  [4] sp_2.1-4             GenomicRanges_1.58.0 GenomeInfoDb_1.42.0 
##  [7] IRanges_2.40.0       S4Vectors_0.44.0     BiocGenerics_0.52.0 
## [10] partCNV_1.4.0        BiocStyle_2.34.0    
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22            splines_4.4.1              
##   [3] later_1.3.2                 filelock_1.0.3             
##   [5] tibble_3.2.1                polyclip_1.10-7            
##   [7] fastDummies_1.7.4           lifecycle_1.0.4            
##   [9] globals_0.16.3              Rsolnp_1.16                
##  [11] lattice_0.22-6              MASS_7.3-61                
##  [13] magrittr_2.0.3              plotly_4.10.4              
##  [15] sass_0.4.9                  rmarkdown_2.28             
##  [17] jquerylib_0.1.4             yaml_2.3.10                
##  [19] httpuv_1.6.15               sctransform_0.4.1          
##  [21] spam_2.11-0                 spatstat.sparse_3.1-0      
##  [23] reticulate_1.39.0           cowplot_1.1.3              
##  [25] pbapply_1.7-2               DBI_1.2.3                  
##  [27] RColorBrewer_1.1-3          abind_1.4-8                
##  [29] zlibbioc_1.52.0             Rtsne_0.17                 
##  [31] purrr_1.0.2                 nnet_7.3-19                
##  [33] rappdirs_0.3.3              GenomeInfoDbData_1.2.13    
##  [35] ggrepel_0.9.6               irlba_2.3.5.1              
##  [37] listenv_0.9.1               spatstat.utils_3.1-0       
##  [39] goftest_1.2-3               RSpectra_0.16-2            
##  [41] spatstat.random_3.3-2       fitdistrplus_1.2-1         
##  [43] parallelly_1.38.0           leiden_0.4.3.1             
##  [45] codetools_0.2-20            DelayedArray_0.32.0        
##  [47] tidyselect_1.2.1            UCSC.utils_1.2.0           
##  [49] farver_2.1.2                matrixStats_1.4.1          
##  [51] BiocFileCache_2.14.0        spatstat.explore_3.3-3     
##  [53] jsonlite_1.8.9              progressr_0.15.0           
##  [55] ggridges_0.5.6              survival_3.7-0             
##  [57] tools_4.4.1                 ica_1.0-3                  
##  [59] Rcpp_1.0.13                 glue_1.8.0                 
##  [61] SparseArray_1.6.0           gridExtra_2.3              
##  [63] xfun_0.48                   MatrixGenerics_1.18.0      
##  [65] dplyr_1.1.4                 withr_3.0.2                
##  [67] BiocManager_1.30.25         fastmap_1.2.0              
##  [69] fansi_1.0.6                 digest_0.6.37              
##  [71] truncnorm_1.0-9             R6_2.5.1                   
##  [73] mime_0.12                   colorspace_2.1-1           
##  [75] scattermore_1.2             tensor_1.5                 
##  [77] spatstat.data_3.1-2         RSQLite_2.3.7              
##  [79] utf8_1.2.4                  tidyr_1.3.1                
##  [81] generics_0.1.3              data.table_1.16.2          
##  [83] httr_1.4.7                  htmlwidgets_1.6.4          
##  [85] S4Arrays_1.6.0              uwot_0.2.2                 
##  [87] pkgconfig_2.0.3             gtable_0.3.6               
##  [89] blob_1.2.4                  lmtest_0.9-40              
##  [91] SingleCellExperiment_1.28.0 XVector_0.46.0             
##  [93] htmltools_0.5.8.1           dotCall64_1.2              
##  [95] bookdown_0.41               scales_1.3.0               
##  [97] Biobase_2.66.0              png_0.1-8                  
##  [99] spatstat.univar_3.0-1       knitr_1.48                 
## [101] reshape2_1.4.4              nlme_3.1-166               
## [103] curl_5.2.3                  cachem_1.1.0               
## [105] zoo_1.8-12                  stringr_1.5.1              
## [107] BiocVersion_3.20.0          KernSmooth_2.23-24         
## [109] parallel_4.4.1              miniUI_0.1.1.1             
## [111] AnnotationDbi_1.68.0        pillar_1.9.0               
## [113] grid_4.4.1                  vctrs_0.6.5                
## [115] RANN_2.6.2                  promises_1.3.0             
## [117] dbplyr_2.5.0                xtable_1.8-4               
## [119] cluster_2.1.6               evaluate_1.0.1             
## [121] magick_2.8.5                tinytex_0.53               
## [123] cli_3.6.3                   compiler_4.4.1             
## [125] rlang_1.1.4                 crayon_1.5.3               
## [127] future.apply_1.11.3         labeling_0.4.3             
## [129] plyr_1.8.9                  stringi_1.8.4              
## [131] viridisLite_0.4.2           deldir_2.0-4               
## [133] munsell_0.5.1               Biostrings_2.74.0          
## [135] lazyeval_0.2.2              spatstat.geom_3.3-3        
## [137] Matrix_1.7-1                RcppHNSW_0.6.0             
## [139] patchwork_1.3.0             depmixS4_1.5-0             
## [141] bit64_4.5.2                 future_1.34.0              
## [143] KEGGREST_1.46.0             shiny_1.9.1                
## [145] highr_0.11                  SummarizedExperiment_1.36.0
## [147] AnnotationHub_3.14.0        ROCR_1.0-11                
## [149] igraph_2.1.1                memoise_2.0.1              
## [151] bslib_0.8.0                 bit_4.5.0