Originally, this package was written when the kallisto | bustools
concept was still experimental, to test a new and fast way to generate the gene count matrix from fastq files for scRNA-seq. In the past year, kallisto | bustools
has matured. Now there’s a wrapper kb-python that can download a prebuilt kallisto
index for human and mice and call kallisto bus
and bustools
to get the gene count matrix. So largely, the old way of calling kallisto bus
and bustools
, and some functionalities of BUSpaRse
, such as getting transcript to gene mapping, are obsolete.
So now the focus of BUSpaRse
has shifted to finer control of the transcripts that go into the transcriptome and more options. Now all tr2g_*
functions (except tr2g_ensembl
) can filter transcripts for gene and transcript biotypes, only keep standard chromosomes (so no scaffolds and haplotypes), and extract the filtered transcripts from the transcriptome. GTF files from Ensembl, Ensembl fasta files, GFF3 files from Ensembl and RefSeq, TxDb, and EnsDb can all be used here.
library(BUSpaRse)
library(BSgenome.Hsapiens.UCSC.hg38)
#> Loading required package: GenomeInfoDb
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:Matrix':
#>
#> expand, unname
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: BSgenome
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#>
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#>
#> FileForFormat
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(EnsDb.Hsapiens.v86)
#> Loading required package: ensembldb
#> Loading required package: AnnotationFilter
#>
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#>
#> filter
The transcriptome can be downloaded from a specified version of Ensembl and filtered for biotypes and standard chromosomes, not only for the vertebrate database (www.ensembl.org and its mirrors), but also other Ensembl sites for plants, fungi, protists, and metazoa. The gene_biotype_use = "cellranger"
means that the same gene biotypes Cell Ranger uses for its reference package are used here. By default, only standard chromosomes are kept. The dl_transcriptome
function not only downloads the transcriptome and filters it, it also output the tr2g.tsv
file of all transcripts in the filtered transcriptome, without column names, so can be directly used for bustools
.
Wonder which biotypes are available? The lists of all gene and transcript biotypes from Ensembl are now provided in this package, and can be queried by data("ensembl_gene_biotypes")
and data("ensembl_tx_biotypes")
.
Resources for common invertebrate model organisms such as Drosophila melanogaster and C. elegans are actually available on the vertebrate site (www.ensembl.org).
# For Drosophila
dl_transcriptome("Drosophila melanogaster", out_path = "fly",
gene_biotype_use = "cellranger", verbose = FALSE)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("fly")
#> [1] "Drosophila_melanogaster.BDGP6.46.cdna.all.fa.gz"
#> [2] "tr2g.tsv"
#> [3] "tx_filtered.fa"
The first file is the original fasta file. The second is the tr2g
file without column names. The third is the filtered fasta file.
For C. elegans, from an archived version of Ensembl. Note that archives older than version 98 might not work.
dl_transcriptome("Caenorhabditis elegans", out_path = "worm", verbose = FALSE,
gene_biotype_use = "cellranger", ensembl_version = 98)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("worm")
#> [1] "Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz"
#> [2] "tr2g.tsv"
#> [3] "tx_filtered.fa"
For Saccharomyces cerevisiae. Note that the versioning of Ensembl for the plants, fungi, and etc. sites, that are actually www.ensemblgenomes.org, is different from that of the vertebrate site.
dl_transcriptome("Saccharomyces cerevisiae", out_path = "yeast",
type = "fungus", gene_biotype_use = "cellranger",
verbose = FALSE)
#> Version is not applicable to IDs not of the form ENS[species prefix][feature type prefix][a unique eleven digit number].
list.files("yeast")
#> [1] "Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz"
#> [2] "tr2g.tsv"
#> [3] "tx_filtered.fa"
The transcript to gene data frame can be generated by directly querying Ensembl with biomart. This can query not only the vertebrate database (www.ensembl.org), but also the Ensembl databases for other organisms, such as plants (plants.ensembl.org) and fungi (fungi.ensembl.org). By default, this will use the most recent version of Ensembl, but older versions can also be used. By default, Ensembl transcript ID (with version number), gene ID (with version number), and gene symbol are downloaded, but other attributes available on Ensembl can be downloaded as well. Make sure that the Ensembl version matches the Ensembl version of transcriptome used for kallisto index.
# Specify other attributes
tr2g_mm <- tr2g_ensembl("Mus musculus", ensembl_version = 99,
other_attrs = "description",
gene_biotype_use = "cellranger")
#> Querying biomart for transcript and gene IDs of Mus musculus
head(tr2g_mm)
# Plants
tr2g_at <- tr2g_ensembl("Arabidopsis thaliana", type = "plant")
#> Version is only available to vertebrates.
#> Querying biomart for transcript and gene IDs of Arabidopsis thaliana
#> File /home/biocbuild/bbs-3.20-bioc/tmpdir/RtmpcZCkAh/Rbuild3a5a3fd2baae9/BUSpaRse/vignettes/tr2g.tsv already exists.
head(tr2g_at)
We need a FASTA file for the transcriptome used to build kallisto index. Transcriptome FASTA files from Ensembl contains gene annotation in the sequence name of each transcript. Transcript and gene information can be extracted from the sequence name. At present, only Ensembl FASTA files or FASTA files with sequence names formatted like in Ensembl are accepted.
By default, the tr2g.tsv
file and filtered fasta file (if filtering for biotypes and chromosomes) are written to disk, but these can be turned off so only the tr2g
data frame is returned into the R session.
# Subset of a real Ensembl FASTA file
toy_fasta <- system.file("testdata/fasta_test.fasta", package = "BUSpaRse")
tr2g_fa <- tr2g_fasta(file = toy_fasta, write_tr2g = FALSE, save_filtered = FALSE)
head(tr2g_fa)
If you have GTF or GFF3 files for other purposes, these can also be used to generate the transcript to gene file. Now tr2g_gtf
and tr2g_gff3
can extract transcriptome from a genome that is either a BSgenome
or a DNAStringSet
.
# Subset of a reral GTF file from Ensembl
toy_gtf <- system.file("testdata/gtf_test.gtf", package = "BUSpaRse")
tr2g_tg <- tr2g_gtf(toy_gtf, Genome = BSgenome.Hsapiens.UCSC.hg38,
gene_biotype_use = "cellranger",
out_path = "gtf")
#> 706 sequences in the genome are absent from the annotation.
head(tr2g_tg)
A new GTF or GFF3 file after filtering biotypes and chromosomes is also written, and this can be turned off by setting save_filtered_gtf = FALSE
or save_filtered_gff = FALSE
. The transcriptome, with biotypes filtered and only standard chromosomes, is in transcriptome.fa
. Use compress_fa = TRUE
to gzip it.
list.files("gtf")
#> [1] "gtf_filtered.gtf" "tr2g.tsv" "transcriptome.fa"
TxDb
is a class for storing transcript annotations from the Bioconductor package GenomicFeatures
. Unfortunately, TxDb.Hsapiens.UCSC.hg38.knownGene
does not have biotype information or gene symbols.
tr2g_hs <- tr2g_TxDb(TxDb.Hsapiens.UCSC.hg38.knownGene, get_transcriptome = FALSE,
write_tr2g = FALSE)
#> 'select()' returned 1:1 mapping between keys and columns
head(tr2g_hs)
EnsDb
is a class for Ensembl gene annotations, from the Bioconductor package ensembldb
. Ensembl annotations as EnsDb
are available on AnnotationHub
(since version 87), and some older versions are stand alone packages (e.g. EnsDb.Hsapiens.v86
).
tr2g_hs86 <- tr2g_EnsDb(EnsDb.Hsapiens.v86, get_transcriptome = FALSE,
write_tr2g = FALSE, gene_biotype_use = "cellranger",
use_gene_version = FALSE, use_transcript_version = FALSE)
head(tr2g_hs86)
There used to be sections about sort_tr2g
and save_tr2g_bustools
, but these functions have been superseded by the new version of tr2g
functions and dl_transcriptome
, which sort the transcriptome after extracting it so the tr2g
and the transcriptome are in the same order. The new version of tr2g
functions and dl_transcriptome
also by default writes the tr2g.tsv
without column names with the first column as transcript and the second as gene to disk.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] EnsDb.Hsapiens.v86_2.99.0
#> [2] ensembldb_2.30.0
#> [3] AnnotationFilter_1.30.0
#> [4] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
#> [5] GenomicFeatures_1.58.0
#> [6] AnnotationDbi_1.68.0
#> [7] Biobase_2.66.0
#> [8] BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [9] BSgenome_1.74.0
#> [10] rtracklayer_1.66.0
#> [11] BiocIO_1.16.0
#> [12] Biostrings_2.74.0
#> [13] XVector_0.46.0
#> [14] GenomicRanges_1.58.0
#> [15] GenomeInfoDb_1.42.0
#> [16] IRanges_2.40.0
#> [17] S4Vectors_0.44.0
#> [18] BiocGenerics_0.52.0
#> [19] ggplot2_3.5.1
#> [20] zeallot_0.1.0
#> [21] Matrix_1.7-1
#> [22] BUSpaRse_1.20.0
#> [23] TENxBUSData_1.19.0
#> [24] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.0.5 biomaRt_2.62.0
#> [5] rlang_1.1.4 magrittr_2.0.3
#> [7] matrixStats_1.4.1 compiler_4.4.1
#> [9] RSQLite_2.3.7 png_0.1-8
#> [11] vctrs_0.6.5 stringr_1.5.1
#> [13] ProtGenerics_1.38.0 pkgconfig_2.0.3
#> [15] crayon_1.5.3 fastmap_1.2.0
#> [17] magick_2.8.5 dbplyr_2.5.0
#> [19] utf8_1.2.4 Rsamtools_2.22.0
#> [21] rmarkdown_2.28 UCSC.utils_1.2.0
#> [23] tinytex_0.53 purrr_1.0.2
#> [25] bit_4.5.0 xfun_0.48
#> [27] zlibbioc_1.52.0 cachem_1.1.0
#> [29] jsonlite_1.8.9 progress_1.2.3
#> [31] blob_1.2.4 highr_0.11
#> [33] DelayedArray_0.32.0 BiocParallel_1.40.0
#> [35] parallel_4.4.1 prettyunits_1.2.0
#> [37] plyranges_1.26.0 R6_2.5.1
#> [39] bslib_0.8.0 stringi_1.8.4
#> [41] jquerylib_0.1.4 Rcpp_1.0.13
#> [43] bookdown_0.41 SummarizedExperiment_1.36.0
#> [45] knitr_1.48 tidyselect_1.2.1
#> [47] abind_1.4-8 yaml_2.3.10
#> [49] codetools_0.2-20 curl_5.2.3
#> [51] lattice_0.22-6 tibble_3.2.1
#> [53] withr_3.0.2 KEGGREST_1.46.0
#> [55] evaluate_1.0.1 BiocFileCache_2.14.0
#> [57] xml2_1.3.6 ExperimentHub_2.14.0
#> [59] pillar_1.9.0 BiocManager_1.30.25
#> [61] filelock_1.0.3 MatrixGenerics_1.18.0
#> [63] generics_0.1.3 RCurl_1.98-1.16
#> [65] BiocVersion_3.20.0 hms_1.1.3
#> [67] munsell_0.5.1 scales_1.3.0
#> [69] glue_1.8.0 lazyeval_0.2.2
#> [71] tools_4.4.1 AnnotationHub_3.14.0
#> [73] GenomicAlignments_1.42.0 XML_3.99-0.17
#> [75] grid_4.4.1 tidyr_1.3.1
#> [77] colorspace_2.1-1 GenomeInfoDbData_1.2.13
#> [79] restfulr_0.0.15 cli_3.6.3
#> [81] rappdirs_0.3.3 fansi_1.0.6
#> [83] S4Arrays_1.6.0 dplyr_1.1.4
#> [85] gtable_0.3.6 sass_0.4.9
#> [87] digest_0.6.37 SparseArray_1.6.0
#> [89] farver_2.1.2 rjson_0.2.23
#> [91] memoise_2.0.1 htmltools_0.5.8.1
#> [93] lifecycle_1.0.4 httr_1.4.7
#> [95] mime_0.12 bit64_4.5.2