ClustAll 1.2.0
ClustAll is an R package designed for patient stratification in complex diseases. It addresses common challenges encountered in clinical data analysis and provides a versatile framework for identifying patient subgroups.
Patient stratification is essential in biomedical research for understanding disease heterogeneity, identifying prognostic factors, and guiding personalized treatment strategies. The ClustAll underlying concept is that a robust stratification should be reproducible through various clustering methods. ClustAll employs diverse distance metrics (Correlation-based distance and Gower distance) and clustering methods (K-Means, K-Medoids, and H-Clust).
The names of ClustAll stratification outputs consist of a letter followed by a number, such as cuts_a_9. The letter denotes the combination of distance metric and clustering method utilized to generate the particular stratification, while the number corresponds to the embedding derived from the depth at which the dendrogram with grouped variables was cut.
Nomenclature | Distance.Metric | Clustering.Method |
---|---|---|
a | Correlation | K-means |
b | Correlation | Hierarchical Clustering |
c | Gower | K-medoids |
d | Gower | Hierarchical-Clustering |
ClustAll is developed using S4 object-oriented programming, and requires R (>=4.2.0). It utilizes other R packages that are currently available from CRAN and Bioconductor.
You can find the package repository on GitHub, ClustAll.
The ClustAll package can be downloaded and installed by running the following code within R:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager ")
BiocManager::install("ClustAll")
After installation, load the ClustAll package:
library(ClustAll)
We will use the data provided in the data package ClustAll to demonstrate how to stratify a population using clinical data.
Breast Cancer Wisconsin (Diagnostic) ClustAll includes a real dataset of breast cancer, described at doi: 10.24432/C5DW2B. This dataset comprises two types of features —categorical and numerical— derived from a digitized image of a fine needle aspirate (FNA) of a breast mass from 659 patients. Each patient is characterized by 30 features (10x3) and belongs to one of two target classes: ‘malignant’ or ‘benign’. To showcase ClustAll’s when dealing with missing data, a modification with random missing values was applied to the dataset, demonstrating the package’s resilience in handling missing data. The breast cancer dataset includes the following features:
The dataset also includes the patient ID and diagnosis (M = malignant, B = benign).
We denote the data set as BreastCancerWisconsin (wdbc).
We load the breast cancer dataset, which is available in Kaggle. The data set can be loaded with the following code:
data("BreastCancerWisconsin", package = "ClustAll")
data_use <- subset(wdbc, select=-ID)
An initial exploration reveals the absence of missing values. The dataset comprises 30 numerical features and one categorical feature (the ground truth). As the initial data does not contain missing values we will apply the ClustAll workflow accordingly.
sum(is.na(data_use))
#> [1] 0
dim(data_use)
#> [1] 569 31
First, the ClustAllObject is created and stored. In this step, we indicate if there is a feature that contains the ground truth (true labels) in the argument colValidation
. This feature is not consider to compute the stratification. In this specific case, it corresponds to “Diagnosis”.
obj_noNA <- createClustAll(data = data_use, nImputation = NULL,
dataImputed = NULL, colValidation = "Diagnosis")
#> The dataset contains character values.
#> They will be transformed into categorical (more than one class) or binary (one class).
#> Before continuing, check that the transformation has been processed correctly.
#>
#> ClustALL object was created successfully. You can run runClustAll.
Next, we apply the ClustAll algorithm. The output is stored in a ClustAllObject, which contains the clustering results.
obj_noNA1 <- runClustAll(Object = obj_noNA, threads = 2, simplify = FALSE)
#> ______ __ __ ___ __ __
#> / ____// /__ __ _____ / /_ / | / / / /
#> / / / // / / // ___// __// /| | / / / /
#> / /___ / // /_/ /(__ )/ /_ / ___ | / /___ / /___
#> /_____//_/ |__,_//____/ |__//_/ |_|/_____//_____/
#> Running Data Complexity Reduction and Stratification Process.
#> This step may take some time...
#>
#>
#> Calculating the correlation distance matrix of the stratifications...
#>
#>
#> Filtering non-robust stratifications...
#>
#> ClustAll pipeline finished successfully!
We show the object:
obj_noNA1
#> ClustAllObject
#> Data: Number of variables: 30. Number of patients: 569
#> Imputated: NO.
#> Number of imputations: 0
#> Processed: TRUE
#> Number of stratifications: 72
To display population-robust stratifications (>85% bootstrapping stability), we call plotJaccard
, using the ClustAllObject as input. In addition, we specify the threshold to consider similar a pair of stratifications in the stratification_similarity
argument.
In this specific case, a similarity of 0.9 reveals three different groups of alternatives for stratifying the population, indicated by the the red rectangles:
plotJACCARD(Object = obj_noNA1, stratification_similarity = 0.9, paint = TRUE)
We can displayed the centroids (a representative) from each group of alternative stratification solutions (highlighted in red squares in the previous step) using resStratification
. Each representative stratification illustrates the number of clusters and the patients belonging to each cluster.
In this case, the alternative stratifications have been computed with the following specifications:
resStratification(Object = obj_noNA1, population = 0.05,
stratification_similarity = 0.9, all = FALSE)
#> $cuts_a_28
#> $cuts_a_28[[1]]
#>
#> 1 2
#> 183 386
#>
#>
#> $cuts_c_9
#> $cuts_c_9[[1]]
#>
#> 1 2
#> 197 372
#>
#>
#> $cuts_c_4
#> $cuts_c_4[[1]]
#>
#> 1 2
#> 199 370
In order to compare two sets of representative stratifications, plotSankey
was implemented. The ClustAllObject is used as input. In addition, we specify the pairs of stratifications we want to compare in the clusters
argument.
In this case, the first Sankey plot illustrates patient transitions between two sets of representative stratifications (cuts_c_9 and cuts_a_28), revealing the flow and distribution of patients across the clusters. The second Sankey plot illustrates patient transitions between a representative stratifications (cuts_a_28) and the ground truth, revealing the flow and distribution of patients across the clusters.
The stratification representatives can be added to the initial dataset to facilitate further exploration.
In this case, we add the three stratification representatives to the dataset. For simplicity, we show the two top rows of the dataset:
df <- cluster2data(Object = obj_noNA1,
stratificationName = c("cuts_c_9","cuts_a_28","cuts_c_4"))
head(df,3)
#> radius1 texture1 perimeter1 area1 smoothness1 compactness1 concavity1
#> 1 17.99 10.38 122.8 1001 0.11840 0.27760 0.3001
#> 2 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
#> 3 19.69 21.25 130.0 1203 0.10960 0.15990 0.1974
#> concave_points1 symmetry1 fractal_dimension1 radius2 texture2 perimeter2
#> 1 0.14710 0.2419 0.07871 1.0950 0.9053 8.589
#> 2 0.07017 0.1812 0.05667 0.5435 0.7339 3.398
#> 3 0.12790 0.2069 0.05999 0.7456 0.7869 4.585
#> area2 smoothness2 compactness2 concavity2 concave_points2 symmetry2
#> 1 153.40 0.006399 0.04904 0.05373 0.01587 0.03003
#> 2 74.08 0.005225 0.01308 0.01860 0.01340 0.01389
#> 3 94.03 0.006150 0.04006 0.03832 0.02058 0.02250
#> fractal_dimension2 radius3 texture3 perimeter3 area3 smoothness3 compactness3
#> 1 0.006193 25.38 17.33 184.6 2019 0.1622 0.6656
#> 2 0.003532 24.99 23.41 158.8 1956 0.1238 0.1866
#> 3 0.004571 23.57 25.53 152.5 1709 0.1444 0.4245
#> concavity3 concave_points3 symmetry3 fractal_dimension3 cuts_c_9 cuts_a_28
#> 1 0.7119 0.2654 0.4601 0.11890 1 1
#> 2 0.2416 0.1860 0.2750 0.08902 1 1
#> 3 0.4504 0.2430 0.3613 0.08758 1 1
#> cuts_c_4
#> 1 1
#> 2 1
#> 3 1
To evaluate the performance of the selected ClustAll stratifications against ground truth, sensitivity and specificity can be assessed using validateStratification
. Higher values indicate greater precision in the stratification process.
In this particular case, our method retrieves three stratification representatives with a sensitivity and specificity exceeding 80% and 90%, respectively, despite being computed using different methods. These results underscore the notion that a robust stratification should be consistent across diverse clustering methods.
# STRATIFICATION 1
validateStratification(obj_noNA1, "cuts_a_28")
#> sensitivity specificity
#> 0.8207547 0.9747899
# STRATIFICATION 2
validateStratification(obj_noNA1, "cuts_c_9")
#> sensitivity specificity
#> 0.8584906 0.9579832
# STRATIFICATION 3
validateStratification(obj_noNA1, "cuts_c_4")
#> sensitivity specificity
#> 0.8820755 0.9663866
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ClustAll_1.2.0 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9 shape_1.4.6.1
#> [4] magrittr_2.0.3 magick_2.8.5 TH.data_1.1-2
#> [7] estimability_1.5.1 modeltools_0.2-23 jomo_2.7-6
#> [10] nloptr_2.1.1 rmarkdown_2.28 GlobalOptions_0.1.2
#> [13] vctrs_0.6.5 Cairo_1.6-2 minqa_1.2.8
#> [16] tinytex_0.53 htmltools_0.5.8.1 broom_1.0.7
#> [19] mitml_0.4-5 sass_0.4.9 bslib_0.8.0
#> [22] htmlwidgets_1.6.4 sandwich_3.1-1 emmeans_1.10.5
#> [25] zoo_1.8-12 cachem_1.1.0 networkD3_0.4
#> [28] igraph_2.1.1 lifecycle_1.0.4 iterators_1.0.14
#> [31] pkgconfig_2.0.3 Matrix_1.7-1 R6_2.5.1
#> [34] fastmap_1.2.0 clue_0.3-65 digest_0.6.37
#> [37] colorspace_2.1-1 spatial_7.3-17 ps_1.8.1
#> [40] S4Vectors_0.44.0 rmio_0.4.0 fansi_1.0.6
#> [43] compiler_4.4.1 doParallel_1.0.17 backports_1.5.0
#> [46] highr_0.11 bigassertr_0.1.6 pan_1.9
#> [49] MASS_7.3-61 rjson_0.2.23 scatterplot3d_0.3-44
#> [52] fBasics_4041.97 flashClust_1.01-2 tools_4.4.1
#> [55] bigstatsr_1.6.1 prabclus_2.3-4 FactoMineR_2.11
#> [58] nnet_7.3-19 glue_1.8.0 stabledist_0.7-2
#> [61] nlme_3.1-166 bigparallelr_0.3.2 grid_4.4.1
#> [64] cluster_2.1.6 generics_0.1.3 snow_0.4-4
#> [67] gtable_0.3.6 flock_0.7 class_7.3-22
#> [70] tidyr_1.3.1 utf8_1.2.4 rmutil_1.1.10
#> [73] flexmix_2.3-19 BiocGenerics_0.52.0 ggrepel_0.9.6
#> [76] foreach_1.5.2 pillar_1.9.0 clValid_0.7
#> [79] robustbase_0.99-4-1 circlize_0.4.16 splines_4.4.1
#> [82] dplyr_1.1.4 lattice_0.22-6 bit_4.5.0
#> [85] survival_3.7-0 tidyselect_1.2.1 ComplexHeatmap_2.22.0
#> [88] pbapply_1.7-2 knitr_1.48 bookdown_0.41
#> [91] IRanges_2.40.0 stats4_4.4.1 xfun_0.48
#> [94] diptest_0.77-1 timeDate_4041.110 matrixStats_1.4.1
#> [97] DEoptimR_1.1-3 DT_0.33 yaml_2.3.10
#> [100] boot_1.3-31 evaluate_1.0.1 codetools_0.2-20
#> [103] kernlab_0.9-33 timeSeries_4041.111 tibble_3.2.1
#> [106] BiocManager_1.30.25 multcompView_0.1-10 cli_3.6.3
#> [109] rpart_4.1.23 xtable_1.8-4 munsell_0.5.1
#> [112] jquerylib_0.1.4 Rcpp_1.0.13 doSNOW_1.0.20
#> [115] stable_1.1.6 coda_0.19-4.1 png_0.1-8
#> [118] parallel_4.4.1 modeest_2.4.0 leaps_3.2
#> [121] ggplot2_3.5.1 mclust_6.1.1 ff_4.5.0
#> [124] lme4_1.1-35.5 glmnet_4.1-8 mvtnorm_1.3-1
#> [127] scales_1.3.0 statip_0.2.3 purrr_1.0.2
#> [130] crayon_1.5.3 fpc_2.2-13 GetoptLong_1.0.5
#> [133] rlang_1.1.4 cowplot_1.1.3 multcomp_1.4-26
#> [136] mice_3.16.0