JASPAR (http://jaspar.genereg.net/) is an open-access database containing manually curated, non-redundant transcription factor (TF) binding profiles for TFs across six taxonomic groups. In this 9th release, we expanded the CORE collection with 341 new profiles (148 for plants, 101 for vertebrates, 85 for urochordates, and 7 for insects), which corresponds to a 19% expansion over the previous release. We added 298 new profiles to the Unvalidated collection when no orthogonal evidence was found in the literature. All the profiles were clustered to provide familial binding profiles for each taxonomic group. Moreover, we revised the structural classification of DNA binding domains to consider plant-specific TFs. This release introduces word clouds to represent the scientific knowledge associated with each TF. We updated the genome tracks of TFBSs predicted with JASPAR profiles in eight organisms; the human and mouse TFBS predictions can be visualized as native tracks in the UCSC Genome Browser. Finally, we provide a new tool to perform JASPAR TFBS enrichment analysis in user-provided genomic regions. All the data is accessible through the JASPAR website, its associated RESTful API, the R/Bioconductor data package, and a new Python package, pyJASPAR, that facilitates serverless access to the data.
The easiest way to use the JASPAR2022 data package (Castro-Mondragon et al. 2021) is by TFBSTools
package interface (Tan and Lenhard 2016), which provides functions to retrieve and manipulate data from the JASPAR database. This vignette demonstrates how to use those functions.
library(JASPAR2022)
library(TFBSTools)
Matrices from JASPAR can be retrieved using either getMatrixByID
or getMatrixByName
function by providing a matrix ID or a matrix name from JASPAR, respectively. These functions accept either a single element as the ID/name parameter or a vector of values. The former case returns a PFMatrix
object, while the later one returns a PFMatrixList
with multiple PFMatrix
objects.
## the user assigns a single matrix ID to the argument ID
pfm <- getMatrixByID(JASPAR2022, ID = "MA0139.1")
## the function returns a PFMatrix object
pfm
#> An object of class PFMatrix
#> ID: MA0139.1
#> Name: CTCF
#> Matrix Class: C2H2 zinc finger factors
#> strand: +
#> Tags:
#> $comment
#> [1] "TF has several motif variants."
#>
#> $family
#> [1] "More than 3 adjacent zinc fingers"
#>
#> $medline
#> [1] "17512414"
#>
#> $remap_tf_name
#> [1] "CTCF"
#>
#> $tax_group
#> [1] "vertebrates"
#>
#> $tfbs_shape_id
#> [1] "133"
#>
#> $type
#> [1] "ChIP-seq"
#>
#> $collection
#> [1] "CORE"
#>
#> $species
#> 9606
#> "Homo sapiens"
#>
#> $acc
#> [1] "P49711"
#>
#> Background:
#> A C G T
#> 0.25 0.25 0.25 0.25
#> Matrix:
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> A 87 167 281 56 8 744 40 107 851 5 333 54 12 56
#> C 291 145 49 800 903 13 528 433 11 0 3 12 0 8
#> G 76 414 449 21 0 65 334 48 32 903 566 504 890 775
#> T 459 187 134 36 2 91 11 324 18 3 9 341 8 71
#> [,15] [,16] [,17] [,18] [,19]
#> A 104 372 82 117 402
#> C 733 13 482 322 181
#> G 5 507 307 73 266
#> T 67 17 37 396 59
The user can utilise the PFMatrix object for further analysis and visualisation. Here is an example of how to plot a sequence logo of a given matrix using functions available in TFBSTools
package.
seqLogo(toICM(pfm))
## the user assigns multiple matrix IDs to the argument ID
pfmList <- getMatrixByID(JASPAR2022, ID=c("MA0139.1", "MA1102.1"))
## the function returns a PFMatrix object
pfmList
#> PFMatrixList of length 2
#> names(2): MA0139.1 MA1102.1
## PFMatrixList can be subsetted to extract enclosed PFMatrix objects
pfmList[[2]]
#> An object of class PFMatrix
#> ID: MA1102.1
#> Name: CTCFL
#> Matrix Class: C2H2 zinc finger factors
#> strand: +
#> Tags:
#> $family
#> [1] "More than 3 adjacent zinc fingers"
#>
#> $medline
#> [1] "26268681"
#>
#> $remap_tf_name
#> [1] "CTCFL"
#>
#> $source
#> [1] "29126285"
#>
#> $tax_group
#> [1] "vertebrates"
#>
#> $tfbs_shape_id
#> [1] "1"
#>
#> $type
#> [1] "ChIP-seq"
#>
#> $collection
#> [1] "CORE"
#>
#> $species
#> 9606
#> "Homo sapiens"
#>
#> $acc
#> [1] "Q8NI51"
#>
#> Background:
#> A C G T
#> 0.25 0.25 0.25 0.25
#> Matrix:
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> A 627 3344 909 499 8422 76 568 266 123 127 117 4022 714 1329
#> C 6915 1487 4650 6703 166 93 145 325 179 136 8625 730 3772 3392
#> G 860 2520 2898 535 152 8596 8035 7716 8447 8380 37 3702 3448 1810
#> T 423 1474 368 1088 85 60 77 518 76 182 46 371 891 2294
getMatrixByName
retrieves matrices by name. If multiple matrix names are supplied, the function returns a PFMatrixList object.
pfm <- getMatrixByName(JASPAR2022, name="Arnt")
pfm
#> An object of class PFMatrix
#> ID: MA0004.1
#> Name: Arnt
#> Matrix Class: Basic helix-loop-helix factors (bHLH)
#> strand: +
#> Tags:
#> $alias
#> [1] "HIF-1beta,bHLHe2"
#>
#> $description
#> [1] "aryl hydrocarbon receptor nuclear translocator"
#>
#> $family
#> [1] "PAS domain factors"
#>
#> $medline
#> [1] "7592839"
#>
#> $remap_tf_name
#> [1] "ARNT"
#>
#> $symbol
#> [1] "ARNT"
#>
#> $tax_group
#> [1] "vertebrates"
#>
#> $tfbs_shape_id
#> [1] "11"
#>
#> $type
#> [1] "SELEX"
#>
#> $unibind
#> [1] "1"
#>
#> $collection
#> [1] "CORE"
#>
#> $species
#> 10090
#> "Mus musculus"
#>
#> $acc
#> [1] "P53762"
#>
#> Background:
#> A C G T
#> 0.25 0.25 0.25 0.25
#> Matrix:
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> A 4 19 0 0 0 0
#> C 16 0 20 0 0 0
#> G 0 1 0 20 0 20
#> T 0 0 0 0 20 0
pfmList <- getMatrixByName(JASPAR2022, name=c("Arnt", "Ahr::Arnt"))
pfmList
#> PFMatrixList of length 2
#> names(2): Arnt Ahr::Arnt
The getMatrixSet
function fetches all matrices that match criteria defined by the named arguments, and it returns a PFMatrixList
object.
## select all matrices found in a specific species and constructed from the
## SELEX experiment
opts <- list()
opts[["species"]] <- 9606
opts[["type"]] <- "SELEX"
opts[["all_versions"]] <- TRUE
PFMatrixList <- getMatrixSet(JASPAR2022, opts)
PFMatrixList
#> PFMatrixList of length 47
#> names(47): MA0002.1 MA0003.1 MA0018.1 MA0025.1 ... MA0124.1 MA0130.1 MA0131.1
## retrieve all matrices constructed from SELEX experiment
opts2 <- list()
opts2[["type"]] <- "SELEX"
PFMatrixList2 <- getMatrixSet(JASPAR2022, opts2)
PFMatrixList2
#> PFMatrixList of length 156
#> names(156): MA0004.1 MA0006.1 MA0015.1 MA0016.1 ... MA1925.1 MA1926.1 MA1927.1
Additional details about TFBS matrix analysis can be found in the TFBSTools documantation.
Here is the output of sessionInfo()
on the system on which this document was compiled:
#> R Under development (unstable) (2022-03-17 r81925)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] TFBSTools_1.33.0 JASPAR2022_0.99.7 BiocFileCache_2.3.4
#> [4] dbplyr_2.1.1 BiocStyle_2.23.1
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 matrixStats_0.61.0
#> [3] DirichletMultinomial_1.37.0 bit64_4.0.5
#> [5] filelock_1.0.2 httr_1.4.2
#> [7] GenomeInfoDb_1.31.6 tools_4.2.0
#> [9] bslib_0.3.1 utf8_1.2.2
#> [11] R6_2.5.1 colorspace_2.0-3
#> [13] seqLogo_1.61.1 DBI_1.1.2
#> [15] BiocGenerics_0.41.2 tidyselect_1.1.2
#> [17] bit_4.0.4 curl_4.3.2
#> [19] compiler_4.2.0 cli_3.2.0
#> [21] Biobase_2.55.0 formatR_1.11
#> [23] DelayedArray_0.21.2 rtracklayer_1.55.3
#> [25] bookdown_0.25 sass_0.4.0
#> [27] caTools_1.18.2 scales_1.1.1
#> [29] readr_2.1.2 rappdirs_0.3.3
#> [31] stringr_1.4.0 digest_0.6.29
#> [33] Rsamtools_2.11.0 R.utils_2.11.0
#> [35] rmarkdown_2.13 XVector_0.35.0
#> [37] pkgconfig_2.0.3 htmltools_0.5.2
#> [39] MatrixGenerics_1.7.0 highr_0.9
#> [41] BSgenome_1.63.5 fastmap_1.1.0
#> [43] rlang_1.0.2 RSQLite_2.2.10
#> [45] jquerylib_0.1.4 BiocIO_1.5.0
#> [47] generics_0.1.2 jsonlite_1.8.0
#> [49] BiocParallel_1.29.18 gtools_3.9.2
#> [51] R.oo_1.24.0 dplyr_1.0.8
#> [53] RCurl_1.98-1.6 magrittr_2.0.2
#> [55] GO.db_3.14.0 GenomeInfoDbData_1.2.7
#> [57] Matrix_1.4-0 Rcpp_1.0.8.3
#> [59] munsell_0.5.0 S4Vectors_0.33.12
#> [61] fansi_1.0.2 R.methodsS3_1.8.1
#> [63] lifecycle_1.0.1 stringi_1.7.6
#> [65] yaml_2.3.5 SummarizedExperiment_1.25.3
#> [67] zlibbioc_1.41.0 plyr_1.8.6
#> [69] grid_4.2.0 blob_1.2.2
#> [71] parallel_4.2.0 crayon_1.5.0
#> [73] CNEr_1.31.1 lattice_0.20-45
#> [75] Biostrings_2.63.2 annotate_1.73.0
#> [77] KEGGREST_1.35.0 hms_1.1.1
#> [79] magick_2.7.3 knitr_1.37
#> [81] pillar_1.7.0 GenomicRanges_1.47.6
#> [83] rjson_0.2.21 reshape2_1.4.4
#> [85] stats4_4.2.0 TFMPvalue_0.0.8
#> [87] XML_3.99-0.9 glue_1.6.2
#> [89] evaluate_0.15 BiocManager_1.30.16
#> [91] png_0.1-7 vctrs_0.3.8
#> [93] tzdb_0.2.0 poweRlaw_0.70.6
#> [95] gtable_0.3.0 purrr_0.3.4
#> [97] assertthat_0.2.1 cachem_1.0.6
#> [99] ggplot2_3.3.5 xfun_0.30
#> [101] xtable_1.8-4 pracma_2.3.8
#> [103] restfulr_0.0.13 tibble_3.1.6
#> [105] AnnotationDbi_1.57.1 GenomicAlignments_1.31.2
#> [107] memoise_2.0.1 IRanges_2.29.1
#> [109] ellipsis_0.3.2
Castro-Mondragon, Jaime A, Rafael Riudavets-Puig, Ieva Rauluseviciute, Roza Berhanu Lemma, Laura Turchi, Romain Blanc-Mathieu, Jeremy Lucas, et al. 2021. “JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles.” Nucleic Acids Research 50 (D1): D165–D173. https://doi.org/10.1093/nar/gkab1113.
Tan, Ge, and Boris Lenhard. 2016. “TFBSTools: An R/Bioconductor Package for Transcription Factor Binding Site Analysis.” Bioinformatics 32: 1555–6.