CTCF 0.99.4
CTCF
defines an AnnotationHub resource representing genomic coordinates
of FIMO-predicted CTCF binding sites with motif MA0139.1 (Jaspar).
Get the latest stable R
release from CRAN. Then
install CTCF
using from Bioconductor the following
code:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("CTCF")
suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
#> snapshotDate(): 2021-06-15
query_data <- query(ah, "CTCF")
query_data
#> AnnotationHub with 466 records
#> # snapshotDate(): 2021-06-15
#> # $dataprovider: UCSC, Haemcode, UCSC Jaspar, Pazar
#> # $species: Homo sapiens, Mus musculus, NA
#> # $rdataclass: GRanges, BigWigFile
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH22248"]]'
#>
#> title
#> AH22248 | pazar_CTCF_Cui_20120522.csv
#> AH22249 | pazar_CTCF_HEPG2_Schmidt_20120522.csv
#> AH22519 | wgEncodeAwgTfbsBroadDnd41CtcfUniPk.narrowPeak.gz
#> AH22521 | wgEncodeAwgTfbsBroadGm12878CtcfUniPk.narrowPeak.gz
#> AH22524 | wgEncodeAwgTfbsBroadH1hescCtcfUniPk.narrowPeak.gz
#> ... ...
#> AH28453 | CTCF_GSM918744_Immortalized_Erythroid.csv
#> AH95565 | CTCF_hg19.RData
#> AH95566 | CTCF_hg38.RData
#> AH95567 | CTCF_mm9.RData
#> AH95568 | CTCF_mm10.RData
The FIMO-predicted CTCF sites are named as “CTCF_query_data <- query(ah , "CTCF_hg38")
for a more
targeted search.
We can check the details about the object.
query_data["AH95566"]
#> AnnotationHub with 1 record
#> # snapshotDate(): 2021-06-15
#> # names(): AH95566
#> # $dataprovider: UCSC Jaspar
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # $rdatadateadded: 2021-05-18
#> # $title: CTCF_hg38.RData
#> # $description: hg38 genomic coordinates of CTCF binding motif MA0139.1, det...
#> # $taxonomyid: 9606
#> # $genome: hg38
#> # $sourcetype: RData
#> # $sourceurl: https://drive.google.com/drive/folders/19ZXr7IETfks0OdYlmuc1Hq...
#> # $sourcesize: NA
#> # $tags: c("FunctionalAnnotation", "GenomicSequence", "hg38")
#> # retrieve record with 'object[["AH95566"]]'
And retrieve the object.
CTCF_hg38 <- query_data[["AH95566"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
CTCF_hg38
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> GRanges object with 56049 ranges and 5 metadata columns:
#> seqnames ranges strand | motif score p.value
#> <Rle> <IRanges> <Rle> | <character> <numeric> <numeric>
#> [1] chr1 11223-11241 - | MA0139.1 24.4754 1.34e-09
#> [2] chr1 11281-11299 - | MA0139.1 22.7377 1.01e-08
#> [3] chr1 24782-24800 - | MA0139.1 17.3770 7.11e-07
#> [4] chr1 91420-91438 + | MA0139.1 16.2951 1.41e-06
#> [5] chr1 104985-105003 - | MA0139.1 16.7869 1.04e-06
#> ... ... ... ... . ... ... ...
#> [56045] chrY 57044316-57044334 - | MA0139.1 16.4590 1.27e-06
#> [56046] chrY 57189659-57189677 + | MA0139.1 15.7541 1.95e-06
#> [56047] chrY 57203409-57203427 - | MA0139.1 15.6393 2.09e-06
#> [56048] chrY 57215279-57215297 + | MA0139.1 19.5738 1.53e-07
#> [56049] chrY 57215337-57215355 + | MA0139.1 24.4754 1.34e-09
#> q.value sequence
#> <numeric> <character>
#> [1] 0.0216 TCGCCAGCAGGGGGCGCCC
#> [2] 0.0398 GCGCCAGCAGGGGGCGCTG
#> [3] 0.2350 CGTCCAGCAGATGGCGGAT
#> [4] 0.3080 GTGGCACCAGGTGGCAGCA
#> [5] 0.2750 CCAACAGCAGGTGGCAGCC
#> ... ... ...
#> [56045] 0.2990 TGGTCACCTGGGGGCACTA
#> [56046] 0.3430 TGTCCTCTAGGGGTCAGCC
#> [56047] 0.3510 CTGCCGCAAGGGGGCGCAT
#> [56048] 0.1190 gcgccacgagggggcggtg
#> [56049] 0.0216 tcgccagcagggggcgccc
#> -------
#> seqinfo: 24 sequences from hg38 genome
Note that the default q-value cutoff is 0.5. Looking at the q-value distribution:
one may decide to use a more stringent cutoff. E.g., filtering by q-value less than 0.3 filters out more than half of the predicted sites. The remaining sites may be considered as high-confidence CTCF sites.
# Check length before filtering
length(CTCF_hg38)
#> [1] 56049
# Filter and check length after filtering
CTCF_hg38 <- CTCF_hg38[CTCF_hg38$q.value < 0.3]
length(CTCF_hg38)
#> [1] 25474
# hg19 CTCF coordinates
CTCF_hg19 <- query_data[["AH95565"]]
# mm9 CTCF coordinates
CTCF_mm9 <- query_data[["AH95567"]]
# mm10 CTCF coordinates
CTCF_mm10 <- query_data[["AH95568"]]
See ../inst/scripts/make-data.R how to create the CTCF GRanges objects.
Below is the citation output from using citation('CTCF')
in R. Please
run this yourself to check for any updates on how to cite CTCF.
print(citation("CTCF"), bibtex = TRUE)
#>
#> Dozmorov MG, Davis E, Mu W, Lee S, Triche T, Phanstiel D, Love M
#> (2021). _CTCF_. https://github.com/mdozmorov/CTCF/CTCF - R package
#> version 0.99.4, <URL: https://github.com/mdozmorov/CTCF>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {CTCF},
#> author = {Mikhail G. Dozmorov and Eric Davis and Wancen Mu and Stuart Lee and Tim Triche and Douglas Phanstiel and Michael Love},
#> year = {2021},
#> url = {https://github.com/mdozmorov/CTCF},
#> note = {https://github.com/mdozmorov/CTCF/CTCF - R package version 0.99.4},
#> }
Date the vignette was generated.
#> [1] "2021-06-17 09:09:59 EDT"
R
session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.0 Patched (2021-05-24 r80367)
#> os macOS High Sierra 10.13.6
#> system x86_64, darwin17.7.0
#> ui unknown
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2021-06-17
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date lib source
#> AnnotationDbi 1.55.1 2021-06-07 [2] Bioconductor
#> AnnotationHub * 3.1.0 2021-05-20 [2] Bioconductor
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.1.0)
#> Biobase 2.53.0 2021-05-19 [2] Bioconductor
#> BiocFileCache * 2.1.0 2021-05-19 [2] Bioconductor
#> BiocGenerics * 0.39.1 2021-06-08 [2] Bioconductor
#> BiocManager 1.30.16 2021-06-15 [3] CRAN (R 4.1.0)
#> BiocStyle * 2.21.2 2021-06-07 [2] Bioconductor
#> BiocVersion 3.14.0 2021-05-19 [2] Bioconductor
#> Biostrings 2.61.1 2021-06-04 [2] Bioconductor
#> bit 4.0.4 2020-08-04 [2] CRAN (R 4.1.0)
#> bit64 4.0.5 2020-08-30 [2] CRAN (R 4.1.0)
#> bitops 1.0-7 2021-04-24 [2] CRAN (R 4.1.0)
#> blob 1.2.1 2020-01-20 [2] CRAN (R 4.1.0)
#> bookdown 0.22 2021-04-22 [2] CRAN (R 4.1.0)
#> bslib 0.2.5.1 2021-05-18 [2] CRAN (R 4.1.0)
#> cachem 1.0.5 2021-05-15 [2] CRAN (R 4.1.0)
#> cli 2.5.0 2021-04-26 [2] CRAN (R 4.1.0)
#> crayon 1.4.1 2021-02-08 [2] CRAN (R 4.1.0)
#> CTCF 0.99.4 2021-06-17 [1] Bioconductor
#> curl 4.3.1 2021-04-30 [2] CRAN (R 4.1.0)
#> DBI 1.1.1 2021-01-15 [2] CRAN (R 4.1.0)
#> dbplyr * 2.1.1 2021-04-06 [2] CRAN (R 4.1.0)
#> digest 0.6.27 2020-10-24 [2] CRAN (R 4.1.0)
#> dplyr 1.0.6 2021-05-05 [2] CRAN (R 4.1.0)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [2] CRAN (R 4.1.0)
#> fansi 0.5.0 2021-05-25 [2] CRAN (R 4.1.0)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.1.0)
#> filelock 1.0.2 2018-10-05 [2] CRAN (R 4.1.0)
#> generics 0.1.0 2020-10-31 [2] CRAN (R 4.1.0)
#> GenomeInfoDb * 1.29.0 2021-05-19 [2] Bioconductor
#> GenomeInfoDbData 1.2.6 2021-05-24 [2] Bioconductor
#> GenomicRanges * 1.45.0 2021-05-19 [2] Bioconductor
#> glue 1.4.2 2020-08-27 [2] CRAN (R 4.1.0)
#> highr 0.9 2021-04-16 [2] CRAN (R 4.1.0)
#> htmltools 0.5.1.1 2021-01-22 [2] CRAN (R 4.1.0)
#> httpuv 1.6.1 2021-05-07 [2] CRAN (R 4.1.0)
#> httr 1.4.2 2020-07-20 [2] CRAN (R 4.1.0)
#> interactiveDisplayBase 1.31.0 2021-05-19 [2] Bioconductor
#> IRanges * 2.27.0 2021-05-19 [2] Bioconductor
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.1.0)
#> jsonlite 1.7.2 2020-12-09 [2] CRAN (R 4.1.0)
#> KEGGREST 1.33.0 2021-05-19 [2] Bioconductor
#> knitr 1.33 2021-04-24 [2] CRAN (R 4.1.0)
#> later 1.2.0 2021-04-23 [2] CRAN (R 4.1.0)
#> lifecycle 1.0.0 2021-02-15 [2] CRAN (R 4.1.0)
#> magrittr 2.0.1 2020-11-17 [2] CRAN (R 4.1.0)
#> memoise 2.0.0 2021-01-26 [2] CRAN (R 4.1.0)
#> mime 0.10 2021-02-13 [2] CRAN (R 4.1.0)
#> pillar 1.6.1 2021-05-16 [2] CRAN (R 4.1.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.1.0)
#> png 0.1-7 2013-12-03 [2] CRAN (R 4.1.0)
#> promises 1.2.0.1 2021-02-11 [2] CRAN (R 4.1.0)
#> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.1.0)
#> R6 2.5.0 2020-10-28 [2] CRAN (R 4.1.0)
#> rappdirs 0.3.3 2021-01-31 [2] CRAN (R 4.1.0)
#> Rcpp 1.0.6 2021-01-15 [2] CRAN (R 4.1.0)
#> RCurl 1.98-1.3 2021-03-16 [2] CRAN (R 4.1.0)
#> rlang 0.4.11 2021-04-30 [2] CRAN (R 4.1.0)
#> rmarkdown 2.9 2021-06-15 [2] CRAN (R 4.1.0)
#> RSQLite 2.2.7 2021-04-22 [2] CRAN (R 4.1.0)
#> S4Vectors * 0.31.0 2021-05-19 [2] Bioconductor
#> sass 0.4.0 2021-05-12 [2] CRAN (R 4.1.0)
#> sessioninfo * 1.1.1 2018-11-05 [2] CRAN (R 4.1.0)
#> shiny 1.6.0 2021-01-25 [2] CRAN (R 4.1.0)
#> stringi 1.6.2 2021-05-17 [2] CRAN (R 4.1.0)
#> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.1.0)
#> tibble 3.1.2 2021-05-16 [2] CRAN (R 4.1.0)
#> tidyselect 1.1.1 2021-04-30 [2] CRAN (R 4.1.0)
#> utf8 1.2.1 2021-03-12 [2] CRAN (R 4.1.0)
#> vctrs 0.3.8 2021-04-29 [2] CRAN (R 4.1.0)
#> withr 2.4.2 2021-04-18 [2] CRAN (R 4.1.0)
#> xfun 0.24 2021-06-15 [2] CRAN (R 4.1.0)
#> xtable 1.8-4 2019-04-21 [2] CRAN (R 4.1.0)
#> XVector 0.33.0 2021-05-19 [2] Bioconductor
#> yaml 2.2.1 2020-02-01 [2] CRAN (R 4.1.0)
#> zlibbioc 1.39.0 2021-05-19 [2] Bioconductor
#>
#> [1] /private/var/folders/sk/hy088prx12l_cqspv3lbpl9s6_3r11/T/RtmpUmbPA2/Rinst16f5420f1d20
#> [2] /Users/ka36530_ca/R-stuff/bin/R-4-1/4.1-Bioc-3.14/library
#> [3] /Users/ka36530_ca/R-stuff/bin/R-4-1/library