Here, we demonstrate a grid search of clustering parameters with a mouse
hippocampus VeraFISH dataset. BANKSY currently provides four algorithms for
clustering the BANKSY matrix with clusterBanksy: Leiden (default), Louvain,
k-means, and model-based clustering. In this vignette, we run only Leiden
clustering. See ?clusterBanksy
for more details on the parameters for
different clustering methods.
The dataset comprises gene expression for 10,944 cells and 120 genes in 2
spatial dimensions. See ?Banksy::hippocampus
for more details.
# Load libs
library(Banksy)
library(SummarizedExperiment)
library(SpatialExperiment)
library(scuttle)
library(scater)
library(cowplot)
library(ggplot2)
# Load data
data(hippocampus)
gcm <- hippocampus$expression
locs <- as.matrix(hippocampus$locations)
Here, gcm
is a gene by cell matrix, and locs
is a matrix specifying the
coordinates of the centroid for each cell.
head(gcm[,1:5])
#> cell_1276 cell_8890 cell_691 cell_396 cell_9818
#> Sparcl1 45 0 11 22 0
#> Slc1a2 17 0 6 5 0
#> Map 10 0 12 16 0
#> Sqstm1 26 0 0 2 0
#> Atp1a2 0 0 4 3 0
#> Tnc 0 0 0 0 0
head(locs)
#> sdimx sdimy
#> cell_1276 -13372.899 15776.37
#> cell_8890 8941.101 15866.37
#> cell_691 -14882.899 15896.37
#> cell_396 -15492.899 15835.37
#> cell_9818 11308.101 15846.37
#> cell_11310 14894.101 15810.37
Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:
se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs)
colData(se) <- cbind(colData(se), spatialCoords(se))
# QC based on total counts
qcstats <- perCellQCMetrics(se)
thres <- quantile(qcstats$total, c(0.05, 0.98))
keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2])
se <- se[, keep]
Next, perform normalization of the data.
# Normalization to mean library size
se <- computeLibraryFactors(se)
aname <- "normcounts"
assay(se, aname) <- normalizeCounts(se, log = FALSE)
BANKSY has a few key parameters. We describe these below.
For characterising neighborhoods, BANKSY computes the weighted neighborhood
mean (H_0
) and the azimuthal Gabor filter (H_1
), which estimates gene
expression gradients. Setting compute_agf=TRUE
computes both H_0
and H_1
.
k_geom
specifies the number of neighbors used to compute each H_m
for
m=0,1
. If a single value is specified, the same k_geom
will be used
for each feature matrix. Alternatively, multiple values of k_geom
can be
provided for each feature matrix. Here, we use k_geom[1]=15
and
k_geom[2]=30
for H_0
and H_1
respectively. More neighbors are used to
compute gradients.
We compute the neighborhood feature matrices using normalized expression
(normcounts
in the se
object).
k_geom <- c(15, 30)
se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=15
#> Done
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=30
#> Done
#> Computing harmonic m = 0
#> Using 15 neighbors
#> Done
#> Computing harmonic m = 1
#> Using 30 neighbors
#> Centering
#> Done
computeBanksy
populates the assays
slot with H_0
and H_1
in this
instance:
se
#> class: SpatialExperiment
#> dim: 120 10205
#> metadata(1): BANKSY_params
#> assays(4): counts normcounts H0 H1
#> rownames(120): Sparcl1 Slc1a2 ... Notch3 Egfr
#> rowData names(0):
#> colnames(10205): cell_1276 cell_691 ... cell_11635 cell_10849
#> colData names(4): sample_id sdimx sdimy sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : sdimx sdimy
#> imgData names(1): sample_id
The lambda
parameter is a mixing parameter in [0,1]
which
determines how much spatial information is incorporated for downstream analysis.
With smaller values of lambda
, BANKY operates in cell-typing mode, while at
higher levels of lambda
, BANKSY operates in domain-finding mode. As a
starting point, we recommend lambda=0.2
for cell-typing and lambda=0.8
for
zone-finding. Here, we run lambda=0
which corresponds to non-spatial
clustering, and lambda=0.2
for spatially-informed cell-typing. We compute PCs
with and without the AGF (H_1
).
lambda <- c(0, 0.2)
se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
runBanksyPCA
populates the reducedDims
slot, with each combination of
use_agf
and lambda
provided.
reducedDimNames(se)
#> [1] "PCA_M0_lam0" "PCA_M0_lam0.2" "PCA_M1_lam0" "PCA_M1_lam0.2"
Next, we cluster the BANKSY embedding with Leiden graph-based clustering. This
admits two parameters: k_neighbors
and resolution
. k_neighbors
determines
the number of k nearest neighbors used to construct the shared nearest
neighbors graph. Leiden clustering is then performed on the resultant graph
with resolution resolution
. For reproducibiltiy we set a seed for each
parameter combination.
k <- 50
res <- 1
se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
clusterBanksy
populates colData(se)
with cluster labels:
colnames(colData(se))
#> [1] "sample_id" "sdimx"
#> [3] "sdimy" "sizeFactor"
#> [5] "clust_M0_lam0_k50_res1" "clust_M0_lam0.2_k50_res1"
#> [7] "clust_M1_lam0_k50_res1" "clust_M1_lam0.2_k50_res1"
To compare clustering runs visually, different runs can be relabeled to
minimise their differences with connectClusters
:
se <- connectClusters(se)
#> clust_M1_lam0_k50_res1 --> clust_M0_lam0_k50_res1
#> clust_M0_lam0.2_k50_res1 --> clust_M1_lam0_k50_res1
#> clust_M1_lam0.2_k50_res1 --> clust_M0_lam0.2_k50_res1
Visualise spatial coordinates with cluster labels.
cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
cplots <- lapply(cnames, function(cnm) {
plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) +
coord_equal() +
labs(title = cnm) +
theme(legend.title = element_blank()) +
guides(colour = guide_legend(override.aes = list(size = 2)))
})
plot_grid(plotlist = cplots, ncol = 2)
Compare all cluster outputs with compareClusters
. This function computes
pairwise cluster comparison metrics between the clusters in colData(se)
based
on adjusted Rand index (ARI):
compareClusters(se, func = "ARI")
#> clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.67
#> clust_M0_lam0.2_k50_res1 0.670 1.00
#> clust_M1_lam0_k50_res1 1.000 0.67
#> clust_M1_lam0.2_k50_res1 0.747 0.87
#> clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.747
#> clust_M0_lam0.2_k50_res1 0.670 0.870
#> clust_M1_lam0_k50_res1 1.000 0.747
#> clust_M1_lam0.2_k50_res1 0.747 1.000
or normalized mutual information (NMI):
compareClusters(se, func = "NMI")
#> clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.741
#> clust_M0_lam0.2_k50_res1 0.741 1.000
#> clust_M1_lam0_k50_res1 1.000 0.741
#> clust_M1_lam0.2_k50_res1 0.782 0.915
#> clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.782
#> clust_M0_lam0.2_k50_res1 0.741 0.915
#> clust_M1_lam0_k50_res1 1.000 0.782
#> clust_M1_lam0.2_k50_res1 0.782 1.000
See ?compareClusters
for the full list of comparison measures.
Vignette runtime:
#> Time difference of 52.79655 secs
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ExperimentHub_2.14.0 AnnotationHub_3.14.0
#> [3] BiocFileCache_2.14.0 dbplyr_2.5.0
#> [5] spatialLIBD_1.17.10 cowplot_1.1.3
#> [7] scater_1.34.0 ggplot2_3.5.1
#> [9] harmony_1.2.1 Rcpp_1.0.13
#> [11] data.table_1.16.2 scran_1.34.0
#> [13] scuttle_1.16.0 Seurat_5.1.0
#> [15] SeuratObject_5.0.2 sp_2.1-4
#> [17] SpatialExperiment_1.16.0 SingleCellExperiment_1.28.0
#> [19] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [21] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [23] IRanges_2.40.0 S4Vectors_0.44.0
#> [25] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [27] matrixStats_1.4.1 Banksy_1.2.0
#> [29] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-9 spatstat.sparse_3.1-0 doParallel_1.0.17
#> [4] httr_1.4.7 RColorBrewer_1.1-3 tools_4.4.1
#> [7] sctransform_0.4.1 DT_0.33 utf8_1.2.4
#> [10] R6_2.5.1 lazyeval_0.2.2 uwot_0.2.2
#> [13] withr_3.0.2 gridExtra_2.3 progressr_0.15.0
#> [16] cli_3.6.3 spatstat.explore_3.3-3 fastDummies_1.7.4
#> [19] labeling_0.4.3 sass_0.4.9 spatstat.data_3.1-2
#> [22] ggridges_0.5.6 pbapply_1.7-2 Rsamtools_2.22.0
#> [25] dbscan_1.2-0 aricode_1.0.3 dichromat_2.0-0.1
#> [28] sessioninfo_1.2.2 parallelly_1.38.0 attempt_0.3.1
#> [31] maps_3.4.2 limma_3.62.0 pals_1.9
#> [34] RSQLite_2.3.7 BiocIO_1.16.0 generics_0.1.3
#> [37] ica_1.0-3 spatstat.random_3.3-2 dplyr_1.1.4
#> [40] Matrix_1.7-1 ggbeeswarm_0.7.2 fansi_1.0.6
#> [43] abind_1.4-8 lifecycle_1.0.4 yaml_2.3.10
#> [46] edgeR_4.4.0 SparseArray_1.6.0 Rtsne_0.17
#> [49] paletteer_1.6.0 grid_4.4.1 blob_1.2.4
#> [52] promises_1.3.0 dqrng_0.4.1 crayon_1.5.3
#> [55] miniUI_0.1.1.1 lattice_0.22-6 beachmat_2.22.0
#> [58] mapproj_1.2.11 KEGGREST_1.46.0 magick_2.8.5
#> [61] pillar_1.9.0 knitr_1.48 metapod_1.14.0
#> [64] rjson_0.2.23 future.apply_1.11.3 codetools_0.2-20
#> [67] leiden_0.4.3.1 glue_1.8.0 spatstat.univar_3.0-1
#> [70] vctrs_0.6.5 png_0.1-8 spam_2.11-0
#> [73] gtable_0.3.6 rematch2_2.1.2 cachem_1.1.0
#> [76] xfun_0.48 S4Arrays_1.6.0 mime_0.12
#> [79] survival_3.7-0 RcppHungarian_0.3 iterators_1.0.14
#> [82] tinytex_0.53 fields_16.3 statmod_1.5.0
#> [85] bluster_1.16.0 fitdistrplus_1.2-1 ROCR_1.0-11
#> [88] nlme_3.1-166 bit64_4.5.2 filelock_1.0.3
#> [91] RcppAnnoy_0.0.22 bslib_0.8.0 irlba_2.3.5.1
#> [94] vipor_0.4.7 KernSmooth_2.23-24 colorspace_2.1-1
#> [97] DBI_1.2.3 tidyselect_1.2.1 bit_4.5.0
#> [100] compiler_4.4.1 curl_5.2.3 BiocNeighbors_2.0.0
#> [103] DelayedArray_0.32.0 plotly_4.10.4 rtracklayer_1.66.0
#> [106] bookdown_0.41 scales_1.3.0 lmtest_0.9-40
#> [109] rappdirs_0.3.3 stringr_1.5.1 digest_0.6.37
#> [112] goftest_1.2-3 spatstat.utils_3.1-0 rmarkdown_2.28
#> [115] benchmarkmeData_1.0.4 RhpcBLASctl_0.23-42 XVector_0.46.0
#> [118] htmltools_0.5.8.1 pkgconfig_2.0.3 highr_0.11
#> [121] fastmap_1.2.0 rlang_1.1.4 htmlwidgets_1.6.4
#> [124] UCSC.utils_1.2.0 shiny_1.9.1 farver_2.1.2
#> [127] jquerylib_0.1.4 zoo_1.8-12 jsonlite_1.8.9
#> [130] BiocParallel_1.40.0 mclust_6.1.1 config_0.3.2
#> [133] RCurl_1.98-1.16 BiocSingular_1.22.0 magrittr_2.0.3
#> [136] GenomeInfoDbData_1.2.13 dotCall64_1.2 patchwork_1.3.0
#> [139] munsell_0.5.1 viridis_0.6.5 reticulate_1.39.0
#> [142] leidenAlg_1.1.4 stringi_1.8.4 zlibbioc_1.52.0
#> [145] MASS_7.3-61 plyr_1.8.9 parallel_4.4.1
#> [148] listenv_0.9.1 ggrepel_0.9.6 deldir_2.0-4
#> [151] Biostrings_2.74.0 sccore_1.0.5 splines_4.4.1
#> [154] tensor_1.5 locfit_1.5-9.10 igraph_2.1.1
#> [157] spatstat.geom_3.3-3 RcppHNSW_0.6.0 reshape2_1.4.4
#> [160] ScaledMatrix_1.14.0 XML_3.99-0.17 BiocVersion_3.20.0
#> [163] evaluate_1.0.1 golem_0.5.1 BiocManager_1.30.25
#> [166] foreach_1.5.2 httpuv_1.6.15 RANN_2.6.2
#> [169] tidyr_1.3.1 purrr_1.0.2 polyclip_1.10-7
#> [172] benchmarkme_1.0.8 future_1.34.0 scattermore_1.2
#> [175] rsvd_1.0.5 xtable_1.8-4 restfulr_0.0.15
#> [178] RSpectra_0.16-2 later_1.3.2 viridisLite_0.4.2
#> [181] tibble_3.2.1 GenomicAlignments_1.42.0 memoise_2.0.1
#> [184] beeswarm_0.4.0 AnnotationDbi_1.68.0 cluster_2.1.6
#> [187] shinyWidgets_0.8.7 globals_0.16.3