The GrafGen package is for classifying Helicobacter pylori genomes according to genetic distance from nine reference populations as defined by equation 2 in Jin (2019). The main function is this package is grafGen()
which requires a file of genotypes that can be either a PLINK bed file or a VCF file.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GrafGen")
Before using the GrafGen package, it must be loaded into an R session.
library(GrafGen)
The GrafGen package includes example data which is a subset of the reference data that was used to train the model. The data is stored in the extdata folder.
dir <- system.file("extdata", package="GrafGen", mustWork=TRUE)
geno.file <- paste0(dir, .Platform$file.sep, "data.vcf.gz")
print(geno.file)
## [1] "/home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpyHZDgV/Rinst979112e9f2f4/GrafGen/extdata/data.vcf.gz"
The grafGen()
function returns a list of class “grafpop” with two objects: table
and vertex
. The object table
is a data frame containing hypothetical ancestry percents (F_percent, E_percent and A_percent) based on known African, European and Asian samples, respectively, normalized genetic distance scores (GD1_x, GD2_y, GD3_z), the predicted reference population (Refpop), nearest neighboring reference population, percent separation as defined in the user manual and the genetic distances to each reference populations (hpgpAfrica, hpgpAfrica-distant, hpgpAfroamerica, hpgpEuroamerica, hpgpMediterranea, hpgpEurope, hpgpEurasia, hpgpAsia, and hpgpAklavik86-like).
The object vertex
is a list containing the (fixed) x-y coordinates of the African, European and Asian vertex population centroids.
ret <- grafGen(geno.file, print=0)
ret$table[seq_len(5), ]
## Sample N_SNPs GD1_x GD2_y GD3_z F_percent E_percent A_percent
## 1 HpGP-ALG-002 35528 1.325330 1.246303 -0.008719 27.79 72.21 0
## 2 HpGP-ALG-004 35528 1.355911 1.264769 0.004511 19.76 80.24 0
## 3 HpGP-ALG-005 35528 1.350071 1.267337 -0.003531 19.70 80.30 0
## 4 HpGP-ALG-006 35528 1.340957 1.265292 -0.002128 21.14 78.86 0
## 5 HpGP-ALG-010 35528 1.343997 1.266336 0.003096 20.57 79.43 0
## hpgpAfrica hpgpAfrica-distant hpgpAfroamerica hpgpEuroamerica
## 1 0.398096 0.661930 0.324232 0.276226
## 2 0.429534 0.660384 0.336875 0.279032
## 3 0.420432 0.655047 0.331030 0.275570
## 4 0.416124 0.658398 0.327399 0.275852
## 5 0.422221 0.657674 0.333455 0.279073
## hpgpMediterranea hpgpEurope hpgpEurasia hpgpAsia hpgpAklavik86-like
## 1 0.277100 0.305868 0.395434 0.577032 0.597266
## 2 0.260454 0.289007 0.380532 0.564095 0.587367
## 3 0.256573 0.284675 0.379566 0.565708 0.589778
## 4 0.257503 0.288340 0.383189 0.573169 0.592772
## 5 0.261808 0.288747 0.384723 0.573536 0.594683
## Refpop Nearest_neighbor Separation_percent
## 1 hpgpEuroamerica hpgpMediterranea 0.32
## 2 hpgpMediterranea hpgpEuroamerica 7.13
## 3 hpgpMediterranea hpgpEuroamerica 7.40
## 4 hpgpMediterranea hpgpEuroamerica 7.13
## 5 hpgpMediterranea hpgpEuroamerica 6.59
Printing the return object from grafGen()
will display a table of frequency counts for the predicted reference populations for the user input data.
print(ret)
##
## Predicted reference population counts:
## hpgpAfrica hpgpAfrica-distant hpgpAfroamerica hpgpEuroamerica
## 15 1 10 28
## hpgpMediterranea hpgpEurope hpgpEurasia hpgpAsia
## 44 57 13 36
## hpgpAklavik86-like
## 2
Plotting the return object will display a plot of the genetic distance scores (GD1_x vs GD2_y) for the user input data and the reference data. Additional plots can be obtained by calling the grafGenPlot()
function.
plot(ret)
The functions interactiveReferencePlot
and interactivePlot
create interactive plots for the reference data and user input data respectively. A call to interactiveReferencePlot
will all show the results of all samples in the reference data. Hovering over a point in the plot will display three lines of information. Line 1 contains the type and id of that sample. Line 2 contains the sample’s reference population, next nearest reference population, and separation percent to the next nearest reference population as defined in the user manual. Line 3 contains the percent African, European and Asian ancestry for that sample. The legend shows the types (which are the source countries in interactiveReferencePlot
) for all samples, and clicking the name of a type will add or remove those samples from the plot.
if (interactive()) interactiveReferencePlot()
The GrafGen
package also includes an R shiny app to view and filter the plot using up to two variables. The function createApp
returns a list containing the app and data objects needed with the app. The app then can be launched with the runApp
function.
tmp <- createApp(ret)
if (interactive()) {
reference_results <- tmp$reference_results
user_results <- tmp$user_results
user_metadata <- tmp$user_metadata
shiny::runApp(tmp$app)
}
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GrafGen_1.2.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
## [4] ggplot2_3.5.1 htmlwidgets_1.6.4 rstatix_0.7.2
## [7] vctrs_0.6.5 tools_4.4.1 generics_0.1.3
## [10] stats4_4.4.1 tibble_3.2.1 fansi_1.0.6
## [13] highr_0.11 pkgconfig_2.0.3 data.table_1.16.2
## [16] RColorBrewer_1.1-3 S4Vectors_0.44.0 GenomeInfoDbData_1.2.13
## [19] lifecycle_1.0.4 farver_2.1.2 compiler_4.4.1
## [22] stringr_1.5.1 munsell_0.5.1 carData_3.0-5
## [25] httpuv_1.6.15 GenomeInfoDb_1.42.0 htmltools_0.5.8.1
## [28] sass_0.4.9 yaml_2.3.10 lazyeval_0.2.2
## [31] Formula_1.2-5 plotly_4.10.4 pillar_1.9.0
## [34] later_1.3.2 car_3.1-3 ggpubr_0.6.0
## [37] jquerylib_0.1.4 tidyr_1.3.1 MASS_7.3-61
## [40] cachem_1.1.0 abind_1.4-8 mime_0.12
## [43] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
## [46] dplyr_1.1.4 purrr_1.0.2 cowplot_1.1.3
## [49] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
## [52] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
## [55] broom_1.0.7 withr_3.0.2 scales_1.3.0
## [58] UCSC.utils_1.2.0 promises_1.3.0 backports_1.5.0
## [61] XVector_0.46.0 rmarkdown_2.28 httr_1.4.7
## [64] ggsignif_0.6.4 shiny_1.9.1 evaluate_1.0.1
## [67] knitr_1.48 GenomicRanges_1.58.0 IRanges_2.40.0
## [70] viridisLite_0.4.2 rlang_1.1.4 Rcpp_1.0.13
## [73] xtable_1.8-4 glue_1.8.0 BiocGenerics_0.52.0
## [76] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.52.0