rpx 2.14.0
The goal of the rpx package is to provide programmatic access to proteomics data from R, in particular to the ProteomeXchange (Vizcaino J.A. et al, 2014) central repository (see http://www.proteomexchange.org/ and http://central.proteomexchange.org/). Additional repositories are likely to be added in the future.
The central object that handles data access is the PXDataset
(version 2) class. Such an instance can be generated by passing a
valid PX experiment identifier to the PXDataset()
constructor.
library("rpx")
id <- "PXD000001"
px <- PXDataset(id)
## Loading PXD000001 from cache.
px
## Project PXD000001 with 11 files
##
## Resource ID BFC19 in cache in /home/biocbuild/.cache/R/rpx.
## [1] 'F063721.dat' ... [11] 'erwinia_carotovora.fasta'
## Use 'pxfiles(.)' to see all files.
Several attributes can be extracted from an PXDataset
projects, as
described below.
The experiment identifier, that was originally used to create the
project can be extracted with the pxid()
method:
pxid(px)
## [1] "PXD000001"
The file transfer url where the data files can be accessed can be
queried with the pxurl()
method:
pxurl(px)
## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/generated"
The species the data has been generated the data can be obtain calling
the pxtax()
function:
pxtax(px)
## [1] "Erwinia carotovora"
Relevant bibliographic references can be queried with the
pxref()
method:
strwrap(pxref(px))
## [1] "Gatto L, Christoforou A; Using R and Bioconductor for proteomics data"
## [2] "analysis., Biochim Biophys Acta, 2013 May 18,"
## [3] "doi:10.1016/j.bbapap.2013.04.032 PMID:23692960"
All files available for the PX experiment can be obtained with the
pxfiles
method:
pxfiles(px)
## Project PXD000001 files (11):
## [remote] F063721.dat
## [local] F063721.dat-mztab.txt
## [remote] PRIDE_Exp_Complete_Ac_22134.xml.gz
## [remote] PRIDE_Exp_mzData_Ac_22134.xml.gz
## [remote] PXD000001_mztab.txt
## [remote] README.txt
## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML
## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw
## ...
The complete or partial data set can be downloaded with the pxget()
function. The function takes a project instance as first mandatory
argument.
The next argument, list
, specifies what files to download. If
missing, a menu is printed and the user can select a file. If set to
"all"
, all files of the experiment are downloaded. One of multiple
file names, their indices or logicals can also be used to download
specific files.
f <- pxget(px, "F063721.dat-mztab.txt")
## Loading F063721.dat-mztab.txt from cache.
f
## [1] "/home/biocbuild/.cache/R/rpx/22df89b886fbe_F063721.dat-mztab.txt"
The rpx
package makes use of the BiocFileCache
package to avoid repeatedly dowloading data. When PXDataset
projects
are created and and project files are downloaded, they stored in the
package’s central or a user-defined cache. Next time the project is
instantiated with PXDataset()
or a project file is downloaded with
pxget()
, existing artefacts will be retrieve from cache, instead of
being created/downloaded from the remote server again. See ?rpxCache
for details about caching.
Below, we download the fasta file from the PXD000001 dataset and load it with the Biostrings package.
fas <- grep("fasta", pxfiles(px), value = TRUE)
## Project PXD000001 files (11):
## [remote] F063721.dat
## [local] F063721.dat-mztab.txt
## [remote] PRIDE_Exp_Complete_Ac_22134.xml.gz
## [remote] PRIDE_Exp_mzData_Ac_22134.xml.gz
## [remote] PXD000001_mztab.txt
## [remote] README.txt
## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML
## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw
## ...
fas
## [1] "erwinia_carotovora.fasta"
f <- pxget(px, fas) ## file available in the rpx cache
## Loading erwinia_carotovora.fasta from cache.
f
## [1] "/home/biocbuild/.cache/R/rpx/22df8913d7b66a_erwinia_carotovora.fasta"
library("Biostrings")
readAAStringSet(f)
## AAStringSet object of length 4499:
## width seq names
## [1] 147 MADITLISGSTLGSAEYVAEHL...QHQIPEDPAEEWLGSWVNLLK ECA0001 putative ...
## [2] 153 VAEIYQIDNLDRGILSALMENA...EIQSTETLISLQNPIMRTIAP ECA0002 AsnC-fami...
## [3] 330 MKKQYIEKQQQISFVKSFFSSQ...IGQVQCGVWPQPLRESVSGLL ECA0003 putative ...
## [4] 492 MITLESLEMLLSIDENELLDDL...WRFDTGLKSRLMRRWQHGKAY ECA0004 conserved...
## [5] 499 MRQTAALAERISRLSHALEHGL...AKIEASLQQVAEQIQQSEQQD ECA0005 conserved...
## ... ... ...
## [4495] 634 MSDKIIHLTDDSFDTDVLKADG...RRKVDPLRVFASDMARRLELL trx-rv3790 trx-rv...
## [4496] 93 MTKMNNKARRTARELKHLGASI...RELRDEFPMGYLGDYKDDDDK TimBlower TimBlower
## [4497] 309 MFSNLSKRWAQRTLSKSFYSTA...KFKWAGIKTRKFVFNPPKPRK sp|P07143|CY1_YEA...
## [4498] 231 FPTDDDDKIVGGYTCAANSIPY...PGVYTKVCNYVNWIQQTIAAN sp|P00761|TRYP_PI...
## [4499] 269 GVSGSCNIDVVCPEGNGHRDVI...DAAGTGAQFIDGLDSTGTPPV sp|Q7M135|LYSC_LY...
Either post questions on the Bioconductor support forum or open a GitHub issue.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rpx_2.14.0 Biostrings_2.74.0 GenomeInfoDb_1.42.0
## [4] XVector_0.46.0 IRanges_2.40.0 S4Vectors_0.44.0
## [7] BiocGenerics_0.52.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] bitops_1.0-9 xml2_1.3.6 RSQLite_2.3.7
## [7] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
## [10] bookdown_0.41 fastmap_1.2.0 blob_1.2.4
## [13] jsonlite_1.8.9 DBI_1.2.3 BiocManager_1.30.25
## [16] httr_1.4.7 purrr_1.0.2 fansi_1.0.6
## [19] UCSC.utils_1.2.0 jquerylib_0.1.4 cli_3.6.3
## [22] rlang_1.1.4 crayon_1.5.3 dbplyr_2.5.0
## [25] bit64_4.5.2 withr_3.0.2 cachem_1.1.0
## [28] yaml_2.3.10 tools_4.4.1 memoise_2.0.1
## [31] dplyr_1.1.4 GenomeInfoDbData_1.2.13 filelock_1.0.3
## [34] curl_5.2.3 vctrs_0.6.5 R6_2.5.1
## [37] BiocFileCache_2.14.0 lifecycle_1.0.4 zlibbioc_1.52.0
## [40] bit_4.5.0 pkgconfig_2.0.3 bslib_0.8.0
## [43] pillar_1.9.0 glue_1.8.0 xfun_0.48
## [46] tibble_3.2.1 tidyselect_1.2.1 knitr_1.48
## [49] htmltools_0.5.8.1 rmarkdown_2.28 compiler_4.4.1
## [52] RCurl_1.98-1.16