The BREW3R.r package has been written to be part of the BREW3R workflow. Today, the package contains a single function which enable to extend three prime of gene annotations using another gene annotation as template. This is very helpful when you are using a technique that only sequence three-prime end of genes like 10X scRNA-seq or BRB-seq.
To install from Bioconductor use:
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("BREW3R.r")
To install from github use:
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("lldelisle/BREW3R.r")
library(rtracklayer)
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
library(GenomicRanges)
In this example, I will extend the transcripts from gencode using RefSeq on mm10. In order to decrease the size of the input files, the input files of this vignette have been subsetted to the chromosome 19. Original gtf for gencode is available here and gtf for RefSeq is available here.
input_gtf_file_to_extend <-
system.file(
"extdata/chr19.gencode.vM25.annotation.gtf.gz",
package = "BREW3R.r",
mustWork = TRUE
)
input_gtf_file_template <-
system.file(
"extdata/chr19.mm10.ncbiRefSeq.gtf.gz",
package = "BREW3R.r",
mustWork = TRUE
)
We will use the rtracklayer package to import gtf:
input_gr_to_extend <- rtracklayer::import(input_gtf_file_to_extend)
input_gr_template <- rtracklayer::import(input_gtf_file_template)
The package only use exon information. It may be interesting to save the other annotations like ‘CDS’, ‘start_codon’, ‘end_codon’.
You should not save the ‘gene’ and ‘transcript’ annotations as they will be out of date. Same for three prime UTR.
input_gr_CDS <- subset(input_gr_to_extend, type == "CDS")
Now we can run the main function of the package:
library(BREW3R.r)
new_gr_exons <- extend_granges(
input_gr_to_extend = input_gr_to_extend,
input_gr_to_overlap = input_gr_template
)
## Found 4343 last exons to potentially extend.
## Compute overlap between 4343 exons and 38576 exons.
## 2331 exons may be extended.
## 2263 exons have been extended while preventing collision with other genes.
## Found 32563 exons that may be included into 723 transcripts.
## Stay 68 candidate exons that may be included into 26 transcripts.
## Finally 29 combined exons will be included into 26 transcripts.
By default, you get few statistics. You can change the verbosity with
options(rlib_message_verbosity = "quiet")
to mute it or on the contrary
you can set options(BREW3R.r.verbose = "progression")
to get messages with
all steps.
Among them, you can read that you extended about half of last exons,
then you could add 29 exons to 26 transcripts.
Here is an example for the Btrc gene that have been extended:
Here is an example for the Mrpl21 gene that have a new exon on the 3’ end of one of its transcript:
We can put back annotations that have been stored:
new_gr <- c(new_gr_exons, input_gr_CDS)
rtracklayer::export.gff(sort(new_gr), "my_new.gtf")
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BREW3R.r_1.2.0 rtracklayer_1.66.0 GenomicRanges_1.58.0
## [4] GenomeInfoDb_1.42.0 IRanges_2.40.0 S4Vectors_0.44.0
## [7] BiocGenerics_0.52.0 BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.6.0
## [3] bitops_1.0-9 lattice_0.22-6
## [5] magrittr_2.0.3 digest_0.6.37
## [7] evaluate_1.0.1 grid_4.4.1
## [9] bookdown_0.41 fastmap_1.2.0
## [11] jsonlite_1.8.9 Matrix_1.7-1
## [13] restfulr_0.0.15 tinytex_0.53
## [15] BiocManager_1.30.25 httr_1.4.7
## [17] UCSC.utils_1.2.0 XML_3.99-0.17
## [19] Biostrings_2.74.0 codetools_0.2-20
## [21] jquerylib_0.1.4 abind_1.4-8
## [23] cli_3.6.3 rlang_1.1.4
## [25] crayon_1.5.3 XVector_0.46.0
## [27] Biobase_2.66.0 cachem_1.1.0
## [29] DelayedArray_0.32.0 yaml_2.3.10
## [31] S4Arrays_1.6.0 tools_4.4.1
## [33] parallel_4.4.1 BiocParallel_1.40.0
## [35] GenomeInfoDbData_1.2.13 Rsamtools_2.22.0
## [37] SummarizedExperiment_1.36.0 curl_5.2.3
## [39] R6_2.5.1 magick_2.8.5
## [41] BiocIO_1.16.0 matrixStats_1.4.1
## [43] lifecycle_1.0.4 zlibbioc_1.52.0
## [45] bslib_0.8.0 Rcpp_1.0.13
## [47] highr_0.11 xfun_0.48
## [49] GenomicAlignments_1.42.0 MatrixGenerics_1.18.0
## [51] knitr_1.48 rjson_0.2.23
## [53] htmltools_0.5.8.1 rmarkdown_2.28
## [55] compiler_4.4.1 RCurl_1.98-1.16