How to use the NCBI Gene, CCDS, Pubchem Comp and Pubchem Subst connectors and their methods.
biodbNcbi 1.10.0
biodbNcbi is a biodb extension package that implements a connector to the NCBI databases (Sayers et al. 2022) Gene, CCDS (Pruitt et al. 2009; Harte et al. 2012; Farrell et al. 2013), Pubchem Comp and Pubchem Subst (Kim et al. 2015).
Install using Bioconductor:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install('biodbNcbi')
The first step in using biodbNcbi, is to create an instance of the biodb
class Biodb
from the main biodb package. This is done by calling the
constructor of the class:
mybiodb <- biodb::newInst()
During this step the configuration is set up, the cache system is initialized and extension packages are loaded.
We will see at the end of this vignette that the biodb instance needs to be
terminated with a call to the terminate()
method.
In biodb the connection to a database is handled by a connector instance that you can get from the factory. biodbNcbi implements a connector to a remote database. Here is the code to instantiate a connector:
gene <- mybiodb$getFactory()$createConn('ncbi.gene')
## Loading required package: biodbNcbi
Creating other connectors follow the same process:
ccds <- mybiodb$getFactory()$createConn('ncbi.ccds')
pubchem.comp <- mybiodb$getFactory()$createConn('ncbi.pubchem.comp')
pubchem.subst <- mybiodb$getFactory()$createConn('ncbi.pubchem.subst')
To get the number of entries stored inside the database, run:
gene$getNbEntries()
## [1] 81974787
To get some of the first entry IDs (accession numbers) from the database, run:
ids <- gene$getEntryIds(2)
ids
## [1] "14910" "7157"
To retrieve entries, use:
entries <- gene$getEntry(ids)
entries
## [[1]]
## Biodb NCBI Gene entry instance 14910.
##
## [[2]]
## Biodb NCBI Gene entry instance 7157.
To convert a list of entries into a dataframe, run:
x <- mybiodb$entriesToDataframe(entries)
x
## accession description aa.seq.location gene.symbol
## 1 14910 gene trap ROSA 26, Philippe Soriano 6 52.73 cM Gt(ROSA)26Sor
## 2 7157 tumor protein p53 17p13.1 TP53
## name ncbi.gene.id
## 1 R26;ROSA26;Gtrgeo26;Gtrosa26;Thumpd3as1 14910
## 2 P53;BCC7;LFS1;BMFS5;TRP53 7157
## uniprot.id
## 1 <NA>
## 2 P04637;Q15086;Q15087;Q15088;Q16535;Q16807;Q16808;Q16809;Q16810;Q16811;Q16848;Q2XN98;Q3LRW1;Q3LRW2;Q3LRW3;Q3LRW4;Q3LRW5;Q86UG1;Q8J016;Q99659;Q9BTM4;Q9HAQ8;Q9NP68;Q9NPJ2;Q9NZD0;Q9UBI2;Q9UQ61
## ncbi.ccds.id
## 1 <NA>
## 2 CCDS11118.1
## aa.seq
## 1 <NA>
## 2 ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCCCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCAGCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACACTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTACCACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGAGGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGGAACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGAAGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACTAAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCACTGGATGGAGAATATTTCACCCTTCAGATCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGCCTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCACTCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCATGTTCAAGACAGAAGGGCCTGACTCAGACTGA
efetch web service is accessible through the wsEfetch()
method, available
on Entrez connectors: ncbi.gene
, ncbi.pubchem.comp
and ncbi.pubchem.subst
.
Get the a Gene entry as an XML object and print the Entrezgene_prot
node:
entryxml <- gene$wsEfetch('2833', retmode='xml', retfmt='parsed')
XML::getNodeSet(entryxml, "//Entrezgene_prot")
## [[1]]
## <Entrezgene_prot>
## <Prot-ref>
## <Prot-ref_name>
## <Prot-ref_name_E>G protein-coupled receptor 9</Prot-ref_name_E>
## <Prot-ref_name_E>IP-10 receptor</Prot-ref_name_E>
## <Prot-ref_name_E>Mig receptor</Prot-ref_name_E>
## <Prot-ref_name_E>chemokine (C-X-C motif) receptor 3</Prot-ref_name_E>
## <Prot-ref_name_E>chemokine receptor 3</Prot-ref_name_E>
## <Prot-ref_name_E>interferon-inducible protein 10 receptor</Prot-ref_name_E>
## </Prot-ref_name>
## <Prot-ref_desc>C-X-C chemokine receptor type 3</Prot-ref_desc>
## </Prot-ref>
## </Entrezgene_prot>
##
## attr(,"class")
## [1] "XMLNodeSet"
The object returned is an XML::XMLInternalDocument
.
esearch web service is accessible through the wsEsearch()
method,
available on Entrez connectors: ncbi.gene
, ncbi.pubchem.comp
and
ncbi.pubchem.subst
.
Search for Gene entries by name and get the IDs of the matching entries
(equivalent of running gene$searchForEntries()
:
gene$wsEsearch(term='"chemokine"[Gene Name]', retmax=10, retfmt='ids')
## [1] "395552" "417536" "128014773" "108261914" "128599176"
The same result can be obtained with a call to searchForEntries()
:
gene$searchForEntries(fields=list(name='chemokine'), max.results=10)
## [1] "395552" "417536" "128014773" "108261914" "128599176"
einfo web service is accessible through the wsEinfo()
method, available
on Entrez connectors: ncbi.gene
, ncbi.pubchem.comp
and ncbi.pubchem.subst
.
Get PubChem Comp database information as an XML object and print information on first field:
infoxml <- pubchem.comp$wsEinfo(retfmt='parsed')
XML::getNodeSet(infoxml, "//Field[1]")
## [[1]]
## <Field>
## <Name>ALL</Name>
## <FullName>All Fields</FullName>
## <Description>All terms from all searchable fields</Description>
## <TermCount>1151299333</TermCount>
## <IsDate>N</IsDate>
## <IsNumerical>N</IsNumerical>
## <SingleToken>N</SingleToken>
## <Hierarchy>N</Hierarchy>
## <IsHidden>N</IsHidden>
## <IsTruncatable>Y</IsTruncatable>
## <IsRangable>N</IsRangable>
## </Field>
##
## attr(,"class")
## [1] "XMLNodeSet"
When done with your biodb instance you have to terminate it, in order to ensure release of resources (file handles, database connection, etc):
mybiodb$terminate()
## INFO [19:43:41.347] Closing BiodbMain instance...
## INFO [19:43:41.348] Connector "ncbi.gene" deleted.
## INFO [19:43:41.356] Connector "ncbi.ccds" deleted.
## INFO [19:43:41.357] Connector "ncbi.pubchem.comp" deleted.
## INFO [19:43:41.358] Connector "ncbi.pubchem.subst" deleted.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biodbNcbi_1.10.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4
## [4] generics_0.1.3 stringi_1.8.4 RSQLite_2.3.7
## [7] hms_1.1.3 digest_0.6.37 magrittr_2.0.3
## [10] evaluate_1.0.1 bookdown_0.41 fastmap_1.2.0
## [13] blob_1.2.4 plyr_1.8.9 jsonlite_1.8.9
## [16] progress_1.2.3 DBI_1.2.3 BiocManager_1.30.25
## [19] httr_1.4.7 fansi_1.0.6 XML_3.99-0.17
## [22] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [25] chk_0.9.2 crayon_1.5.3 dbplyr_2.5.0
## [28] bit64_4.5.2 withr_3.0.2 cachem_1.1.0
## [31] yaml_2.3.10 tools_4.4.1 memoise_2.0.1
## [34] biodb_1.14.0 dplyr_1.1.4 filelock_1.0.3
## [37] curl_5.2.3 vctrs_0.6.5 R6_2.5.1
## [40] BiocFileCache_2.14.0 lifecycle_1.0.4 stringr_1.5.1
## [43] bit_4.5.0 pkgconfig_2.0.3 pillar_1.9.0
## [46] bslib_0.8.0 glue_1.8.0 Rcpp_1.0.13
## [49] lgr_0.4.4 xfun_0.48 tibble_3.2.1
## [52] tidyselect_1.2.1 knitr_1.48 htmltools_0.5.8.1
## [55] rmarkdown_2.28 compiler_4.4.1 prettyunits_1.2.0
## [58] askpass_1.2.1 openssl_2.2.2
Farrell, Catherine M., Nuala A. O’Leary, Rachel A. Harte, Jane E. Loveland, Laurens G. Wilming, Craig Wallin, Mark Diekhans, et al. 2013. “Current status and new features of the Consensus Coding Sequence database.” Nucleic Acids Research 42 (D1): D865–D872. https://doi.org/10.1093/nar/gkt1059.
Harte, Rachel A., Catherine M. Farrell, Jane E. Loveland, Marie-Marthe Suner, Laurens Wilming, Bronwen Aken, Daniel Barrell, et al. 2012. “Tracking and coordinating an international curation effort for the CCDS Project.” Database 2012 (February). https://doi.org/10.1093/database/bas008.
Kim, Sunghwan, Paul A. Thiessen, Evan E. Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, et al. 2015. “PubChem Substance and Compound databases.” Nucleic Acids Research 44 (D1): D1202–D1213. https://doi.org/10.1093/nar/gkv951.
Pruitt, Kim D., Jennifer Harrow, Rachel A. Harte, Craig Wallin, Mark Diekhans, Donna R. Maglott, Steve Searle, et al. 2009. “The Consensus Coding Sequence (Ccds) Project: Identifying a Common Protein-Coding Gene Set for the Human and Mouse Genomes.” Genome Research 19 (7): 1316–23. https://doi.org/10.1101/gr.080531.108.
Sayers, Eric W., Evan E. Bolton, J. Rodney Brister, Kathi Canese, Jessica Chan, Donald C. Comeau, Ryan Connor, et al. 2022. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 50 (D1): D20–D26. https://doi.org/10.1093/nar/gkab1112.