Analysis starts with sample table

The first step using tximeta is to read in the sample table, which will become the column data, colData, of the final object, a SummarizedExperiment. The sample table should contain all the information we need to identify the Salmon quantification directories. Here we will use a Salmon quantification file in the tximportData package to demonstrate the usage of tximeta. We do not have a sample table, so we construct one in R. It is recommended to keep a sample table as a CSV or TSV file while working on an RNA-seq project with multiple samples.

dir <- system.file("extdata/salmon_dm", package="tximportData")
# here gzipped, normally these are not
files <- file.path(dir, "SRR1197474_cdna", "quant.sf.gz") 
file.exists(files)

## [1] TRUE

coldata <- data.frame(files, names="SRR1197474", condition="A", stringsAsFactors=FALSE)
coldata

##                                                                                               files
## 1 /home/biocbuild/bbs-3.8-bioc/R/library/tximportData/extdata/salmon_dm/SRR1197474_cdna/quant.sf.gz
##        names condition
## 1 SRR1197474         A

tximeta expects at least two columns in coldata:

files - a pointer to the quant.sf files
names - the unique names that should be used to identify samples

Running tximeta from a sample table

Normally, we would just run tximeta like so:

library(tximeta)
se <- tximeta(coldata)

However, to avoid downloading remote GTF files during this vignette, we will point to a GTF file saved locally (in the tximportData package). We link the transcriptome of the Salmon index to its locally saved GTF. The standard recommended usage of tximeta would be the code chunk above, or to specify a remote GTF source, not a local one. This following code is therefore not recommended for a typically workflow, but is particular to the vignette code.

dir <- system.file("extdata", package="tximeta")
indexDir <- file.path(dir, "Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2")
fastaFTP <- "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz"
dir2 <- system.file("extdata/salmon_dm", package="tximportData")
gtfPath <- file.path(dir2,"Drosophila_melanogaster.BDGP6.92.gtf.gz")
suppressPackageStartupMessages(library(tximeta))
makeLinkedTxome(indexDir=indexDir,
                source="Ensembl",
                organism="Drosophila melanogaster",
                release="92",
                genome="BDGP6",
                fasta=fastaFTP,
                gtf=gtfPath,
                write=FALSE)

## saving linkedTxome in bfc (first time)

library(tximeta)
se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## found matching linked transcriptome:
## [ Ensembl - Drosophila melanogaster - release 92 ]
## building EnsDb with 'ensembldb' package
## Importing GTF file ... OK
## Processing metadata ... OK
## Processing genes ... 
##  Attribute availability:
##   o gene_id ... OK
##   o gene_name ... OK
##   o entrezid ... Nope
##   o gene_biotype ... OK
## OK
## Processing transcripts ... 
##  Attribute availability:
##   o transcript_id ... OK
##   o gene_id ... OK
##   o transcript_biotype ... OK
## OK
## Processing exons ... OK
## Processing chromosomes ... Fetch seqlengths from ensembl ... OK
## Generating index ... OK
##   -------------
## Verifying validity of the information in the database:
## Checking transcripts ... OK
## Checking exons ... OK
## generating transcript ranges

## Warning in checkAssays2Txps(assays, txps): missing some transcripts!
##  9 out of 30061 are missing from the GTF and dropped from SummarizedExperiment output

What happened?

tximeta recognized the signature of the transcriptome that the files were quantified against, it accessed the GTF file of the transcriptome source, found and attached the transcript ranges, and added the appropriate transcriptome and genome metadata. A remote GTF is only downloaded once, and a local or remote GTF is only parsed to build a TxDb once: if tximeta recognizes that it has seen this Salmon index before, it will use a cached version of the metadata and transcript ranges.

Note the warning above that 9 of the transcripts are missing from the GTF file and so are dropped from the final output. This is a problem coming from the annotation source, and not easily avoided by tximeta.

We plan to create and maintain a large table of signatures for as many sources, organisms, releases of transcriptomes as possible. tximeta also has functions to support for linked transcriptomes, where one or more sources for transcript sequences have been combined or filtered. See the Linked transcriptome section below for a demonstration. (The makeLinkedTxome function was used above to avoid downloading the GTF during the vignette building process.)

Examining SummarizedExperiment output

We, of course, have our coldata from before. Note that we’ve removed files.

suppressPackageStartupMessages(library(SummarizedExperiment))
colData(se)

## DataFrame with 1 row and 2 columns
##                  names   condition
##            <character> <character>
## SRR1197474  SRR1197474           A

Here we show the three matrices that were imported. tximeta does not yet support import of inferential replicates (Gibbs samples or bootstrap samples), but this functionality will be added in a future version.

assayNames(se)

## [1] "counts"    "abundance" "length"

tximeta has imported the correct ranges for the transcripts:

rowRanges(se)

## GRanges object with 30052 ranges and 6 metadata columns:
##               seqnames            ranges strand |       tx_id
##                  <Rle>         <IRanges>  <Rle> | <character>
##   FBtr0075502       3L 15808322-15808883      + | FBtr0075502
##   FBtr0300738       3R   5783105-5787336      + | FBtr0300738
##   FBtr0300739       3R   5781762-5787336      + | FBtr0300739
##   FBtr0300737       3R   5781762-5787336      + | FBtr0300737
##   FBtr0300736       3R   5783105-5787336      + | FBtr0300736
##           ...      ...               ...    ... .         ...
##   FBtr0347432        Y   3074924-3075180      + | FBtr0347432
##   FBtr0307579        X 21156259-21156621      - | FBtr0307579
##   FBtr0089614       3R 30212903-30213142      + | FBtr0089614
##   FBtr0299927        X   7897250-7897987      - | FBtr0299927
##   FBtr0303313       3R 12893972-12894529      - | FBtr0303313
##                   tx_biotype tx_cds_seq_start tx_cds_seq_end     gene_id
##                  <character>        <integer>      <integer> <character>
##   FBtr0075502 protein_coding         15808418       15808716 FBgn0036531
##   FBtr0300738 protein_coding          5783217        5787117 FBgn0037375
##   FBtr0300739 protein_coding          5781900        5787117 FBgn0037375
##   FBtr0300737 protein_coding          5781900        5787117 FBgn0037375
##   FBtr0300736 protein_coding          5783217        5787117 FBgn0037375
##           ...            ...              ...            ...         ...
##   FBtr0347432     pseudogene             <NA>           <NA> FBgn0267873
##   FBtr0307579     pseudogene             <NA>           <NA> FBgn0052511
##   FBtr0089614     pseudogene             <NA>           <NA> FBgn0000281
##   FBtr0299927     pseudogene             <NA>           <NA> FBgn0260447
##   FBtr0303313     pseudogene             <NA>           <NA> FBgn0053929
##                   tx_name
##               <character>
##   FBtr0075502 FBtr0075502
##   FBtr0300738 FBtr0300738
##   FBtr0300739 FBtr0300739
##   FBtr0300737 FBtr0300737
##   FBtr0300736 FBtr0300736
##           ...         ...
##   FBtr0347432 FBtr0347432
##   FBtr0307579 FBtr0307579
##   FBtr0089614 FBtr0089614
##   FBtr0299927 FBtr0299927
##   FBtr0303313 FBtr0303313
##   -------
##   seqinfo: 25 sequences from BDGP6 genome

We have appropriate genome information, which prevents us from making bioinformatic mistakes:

seqinfo(se)

## Seqinfo object with 25 sequences from BDGP6 genome:
##   seqnames             seqlengths isCircular genome
##   211000022278279           12714       <NA>  BDGP6
##   211000022278436            2815       <NA>  BDGP6
##   211000022278449            1947       <NA>  BDGP6
##   211000022278760            1144       <NA>  BDGP6
##   211000022279165            1118       <NA>  BDGP6
##   ...                         ...        ...    ...
##   Unmapped_Scaffold_8       88768       <NA>  BDGP6
##   X                      23542271       <NA>  BDGP6
##   Y                       3667352       <NA>  BDGP6
##   mitochondrion_genome      19517       <NA>  BDGP6
##   rDNA                      76973       <NA>  BDGP6

Easy summarization to gene-level

Because the SummarizedExperiment maintains all the metadata of its creation, it also keeps a pointer to the necessary database for summarizing transcript-level quantifications and bias corrections to the gene-level. If necessary, summarizeToGene can pull down the remote source for summarization, but given that we’ve already built a TxDb once, it simply loads the stashed version. In order to remove the stashed TxDb and regenerate, one can remove the relevant entry from the tximeta file cache that resides at the location given by getTximetaBFC().

gse <- summarizeToGene(se)

## loading existing EnsDb created: 2019-01-12 03:17:46

## obtaining transcript-to-gene mapping from TxDb

## summarizing abundance

## summarizing counts

## summarizing length

rowRanges(gse)

## GRanges object with 14026 ranges and 6 metadata columns:
##               seqnames            ranges strand |     gene_id   gene_name
##                  <Rle>         <IRanges>  <Rle> | <character> <character>
##   FBgn0000008       2R 22136968-22172834      + | FBgn0000008           a
##   FBgn0000014       3R 16807214-16830049      - | FBgn0000014       abd-A
##   FBgn0000015       3R 16927212-16972236      - | FBgn0000015       Abd-B
##   FBgn0000017       3L 16615866-16647882      - | FBgn0000017         Abl
##   FBgn0000018       2L 10973443-10975293      - | FBgn0000018         abo
##           ...      ...               ...    ... .         ...         ...
##   FBgn0285958       3L 11518798-11522713      - | FBgn0285958        Fuca
##   FBgn0285962       2R   9864510-9875072      - | FBgn0285962     CG46338
##   FBgn0285963       3R 26915761-26945309      + | FBgn0285963     CG46339
##   FBgn0285970        X 21621541-21623750      - | FBgn0285970     CG32500
##   FBgn0285971       2L   8464488-8466694      + | FBgn0285971         prg
##                entrezid   gene_biotype seq_coord_system      symbol
##               <integer>    <character>        <integer> <character>
##   FBgn0000008      <NA> protein_coding             <NA>           a
##   FBgn0000014      <NA> protein_coding             <NA>       abd-A
##   FBgn0000015      <NA> protein_coding             <NA>       Abd-B
##   FBgn0000017      <NA> protein_coding             <NA>         Abl
##   FBgn0000018      <NA> protein_coding             <NA>         abo
##           ...       ...            ...              ...         ...
##   FBgn0285958      <NA> protein_coding             <NA>        Fuca
##   FBgn0285962      <NA> protein_coding             <NA>     CG46338
##   FBgn0285963      <NA> protein_coding             <NA>     CG46339
##   FBgn0285970      <NA> protein_coding             <NA>     CG32500
##   FBgn0285971      <NA> protein_coding             <NA>         prg
##   -------
##   seqinfo: 25 sequences from BDGP6 genome

Add different identifiers

We would like to add support to easily map transcript or gene identifiers from one annotation to another. This is just a prototype function, but we show how we can easily add alternate IDs given that we know the organism and the source of the transcriptome. (This function currently only works for Gencode and Ensembl gene or transcript IDs but could be extended to work for arbitrary sources.)

library(org.Dm.eg.db)

## Loading required package: AnnotationDbi

##

gse <- addIds(gse, "REFSEQ", gene=TRUE)

## mapping to new IDs using 'org.Dm.eg.db' data package

mcols(gse)

## DataFrame with 14026 rows and 7 columns
##                 gene_id   gene_name  entrezid   gene_biotype
##             <character> <character> <integer>    <character>
## FBgn0000008 FBgn0000008           a        NA protein_coding
## FBgn0000014 FBgn0000014       abd-A        NA protein_coding
## FBgn0000015 FBgn0000015       Abd-B        NA protein_coding
## FBgn0000017 FBgn0000017         Abl        NA protein_coding
## FBgn0000018 FBgn0000018         abo        NA protein_coding
## ...                 ...         ...       ...            ...
## FBgn0285958 FBgn0285958        Fuca        NA protein_coding
## FBgn0285962 FBgn0285962     CG46338        NA protein_coding
## FBgn0285963 FBgn0285963     CG46339        NA protein_coding
## FBgn0285970 FBgn0285970     CG32500        NA protein_coding
## FBgn0285971 FBgn0285971         prg        NA protein_coding
##             seq_coord_system      symbol       REFSEQ
##                    <integer> <character>  <character>
## FBgn0000008               NA           a NM_001014543
## FBgn0000014               NA       abd-A NM_001170161
## FBgn0000015               NA       Abd-B NM_001275719
## FBgn0000017               NA         Abl NM_001104153
## FBgn0000018               NA         abo    NM_080045
## ...                      ...         ...          ...
## FBgn0285958               NA        Fuca NM_001316434
## FBgn0285962               NA     CG46338 NM_001273908
## FBgn0285963               NA     CG46339 NM_001104469
## FBgn0285970               NA     CG32500    NM_167766
## FBgn0285971               NA         prg NM_001273324

Run a differential expression analysis

The following code chunk demonstrates how to build a DESeqDataSet and begin a differential expression analysis.

suppressPackageStartupMessages(library(DESeq2))
# here there is a single sample so we use ~1.
# expect a warning that there is only a single sample...
suppressWarnings({dds <- DESeqDataSet(gse, ~1)})

## using counts and average transcript lengths from tximeta

dds <- estimateSizeFactors(dds)

## using 'avgTxLength' from assays(dds), correcting for library size

# ... and so on

Metadata galore

The following information is attached to the SummarizedExperiment by tximeta:

names(metadata(se))

## [1] "tximetaInfo"         "quantInfo"           "countsFromAbundance"
## [4] "txomeInfo"           "txdbInfo"

str(metadata(se)[["quantInfo"]])

## List of 24
##  $ salmon_version       : chr "0.10.2"
##  $ samp_type            : chr "none"
##  $ quant_errors         :List of 1
##   ..$ : list()
##  $ num_libraries        : int 1
##  $ library_types        : chr "ISR"
##  $ frag_dist_length     : int 1001
##  $ seq_bias_correct     : logi TRUE
##  $ gc_bias_correct      : logi TRUE
##  $ num_bias_bins        : int 4096
##  $ mapping_type         : chr "mapping"
##  $ num_targets          : int 30061
##  $ serialized_eq_classes: logi FALSE
##  $ length_classes       : int [1:5, 1] 1071 1736 2594 4068 71382
##  $ index_seq_hash       : chr "b41ea9ba9c81e2cad7cfa49e4bf9ee67dd297dc0b9ff40bdb1142699f00c8f7d"
##  $ index_name_hash      : chr "6aba201931d0fa4c6cebd3c1d7dd6350bf65cc1c968e88a308fe147f8a1c7083"
##  $ index_seq_hash512    : chr "365e58ceacde84989cb2bcc01e5b5c3320345ef23b23f1b49456f2ae429b5be5c418e233729d7a065e59645b5f26e25defbc1df40a601e6"| __truncated__
##  $ index_name_hash512   : chr "df81eababb8186637181132cf98221f9fa6fe77cb45ed771cc81bcfdda281cea5b89877ba233d81a06a1caa5aeedbe1b3161bd19c29bd6a"| __truncated__
##  $ num_bootstraps       : int 0
##  $ num_processed        : int 42422337
##  $ num_mapped           : int 29341160
##  $ percent_mapped       : num 69.2
##  $ call                 : chr "quant"
##  $ start_time           : chr "Fri Jul 13 08:45:38 2018"
##  $ end_time             : chr "Fri Jul 13 08:57:39 2018"

str(metadata(se)[["txomeInfo"]])

## List of 8
##  $ index   : chr "Drosophila_melanogaster.BDGP6.cdna.v92_salmon_0.10.2"
##  $ source  : chr "Ensembl"
##  $ organism: chr "Drosophila melanogaster"
##  $ release : chr "92"
##  $ genome  : chr "BDGP6"
##  $ fasta   :List of 1
##   ..$ : chr "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz"
##  $ gtf     : chr "/home/biocbuild/bbs-3.8-bioc/R/library/tximportData/extdata/salmon_dm/Drosophila_melanogaster.BDGP6.92.gtf.gz"
##  $ sha256  : chr "b41ea9ba9c81e2cad7cfa49e4bf9ee67dd297dc0b9ff40bdb1142699f00c8f7d"

str(metadata(se)[["tximetaInfo"]])

## List of 2
##  $ version   :Classes 'package_version', 'numeric_version'  hidden list of 1
##   ..$ : int [1:3] 1 0 3
##  $ importTime: POSIXct[1:1], format: "2019-01-11 22:17:45"

str(metadata(se)[["txdbInfo"]])

##  Named chr [1:12] "EnsDb" "Ensembl Gene ID" "ensembldb" ...
##  - attr(*, "names")= chr [1:12] "Db type" "Type of Gene ID" "Supporting package" "Db created by" ...

Quantification files with an unknown transcriptome

tximeta automatically imports relevant metadata when the transcriptome matches a known source, but also facilitates the linking of transcriptomes used as for a Salmon index with relevant public sources. The linking is important in the case that the transcript sequence no longer matches a known source (combined or filtered FASTA files), or if the source is not known to tximeta. Below we demonstrate how to make a linkedTxome and how to share and load a linkedTxome.

Here we point to Salmon quantification files which were quantified against a transcriptome combining two Ensembl FASTA files: the cDNA and the non-coding transcripts for Drosophila melanogaster.

dir <- system.file("extdata/salmon_dm/SRR1197474", package="tximportData")
file <- file.path(dir, "quant.sf.gz")
file.exists(file)

## [1] TRUE

coldata <- data.frame(files=file, names="SRR1197474", sample="1",
                      stringsAsFactors=FALSE)

Trying to import the files gives a message that tximeta couldn’t find a matching transcriptome, so it returns an un-ranged SummarizedExperiment.

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## couldn't find matching transcriptome, returning un-ranged SummarizedExperiment

Linked transcriptome for reproducible analysis

If the transcriptome used to generate the Salmon index does not match any transcriptomes from known sources (e.g. from combining or filtering known transcriptome files), there is not much that can be done to automatically populate the metadata during quantification import. However, we can facilitate the following two cases:

the transcriptome was created locally and has been linked to its public source(s)
the transcriptome was produced by another group, and they have produced and shared a file that links the transcriptome to public source(s)

tximeta offers functionality to assist reproducible analysis in both of these cases.

In the case of the quantification file above, the transcriptome was generated locally by downloading and combining the Ensembl cDNA and non-coding FASTA files Drosophila melanogaster, release 92. The following un-evaluated command line code chunk reproduces the production of the transcriptome from publicly available sources.

wget ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz 
wget ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.ncrna.fa.gz
cat Drosophila_melanogaster.BDGP6.cdna.all.fa.gz Drosophila_melanogaster.BDGP6.ncrna.fa.gz > Drosophila_melanogaster.BDGP6.v92.fa.gz

To make this quantification reproducible, we make a linkedTxome which records key information about the sources of the transcript FASTA files, and the location of the relevant GTF file. It also records the signature of the transcriptome that was computed by Salmon during the index step.

By default, linkedTxome will write out a JSON file which can be shared with others, linking the signature of the index with the other metadata, including FASTA and GTF sources. By default, it will write out to a file with the same name as the indexDir, but with a .json extension added. This can be prevented with write=FALSE, and the file location can be changed with jsonFile.

First we specify the path where the Salmon index is located.

Typically you would not use system.file to find this directory, but simply define indexDir to be the path of the Salmon directory on your machine. Here we use system.file because we have included parts of a Salmon index directory in the tximeta package itself for demonstration of functionality in this vignette.

dir <- system.file("extdata", package="tximeta")
indexDir <- file.path(dir, "Drosophila_melanogaster.BDGP6.v92_salmon_0.10.2")

Now we provide the location of the FASTA files and the GTF file for this transcriptome. The recommended usage of tximeta would be to specify a remote GTF source, as seen in the commented-out line below:

fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.cdna.all.fa.gz",
              "ftp://ftp.ensembl.org/pub/release-92/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.ncrna.fa.gz")
#gtfFTP <- "ftp://ftp.ensembl.org/pub/release-92/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.92.gtf.gz"

Instead of the above commented-out FTP location for the GTF file, we specify a location within an R package. This step is just to avoid downloading from a remote FTP during vignette building. This use of system.file to point to a file in an R package is specific to this vignette and would not be used in a typical workflow.

dir2 <- system.file("extdata/salmon_dm", package="tximportData")
gtfPath <- file.path(dir2,"Drosophila_melanogaster.BDGP6.92.gtf.gz")

Finally, we create a linkedTxome. In this vignette, we point to a temporary directory for the JSON file, but a more typical workflow would write the JSON file to the same location as the Salmon index by not specifying jsonFile.

makeLinkedTxome performs two operation: (1) it creates a new entry in an internal table that links the transcriptome used in the Salmon index to its sources, and (2) it creates a JSON file such that this linkedTxome can be shared.

tmp <- tempdir()
jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
makeLinkedTxome(indexDir=indexDir,
                source="Ensembl", organism="Drosophila melanogaster",
                release="92", genome="BDGP6",
                fasta=fastaFTP, gtf=gtfPath,
                jsonFile=jsonFile)

## writing linkedTxome to /tmp/RtmphgPDpH/Drosophila_melanogaster.BDGP6.v92_salmon_0.10.2.json

## saving linkedTxome in bfc

After running makeLinkedTxome, the connection between this Salmon index (and its signature) with the sources is saved for persistent usage.

With use of tximeta and a linkedTxome – as with tximeta on a known, un-filtered, un-combined transcriptome – the software figures out if the remote GTF has been accessed and compiled into a TxDb before, and on future calls, it will simply load the pre-computed metadata and transcript ranges.

Note the warning that 9 of the transcripts are missing from the GTF file and so are dropped from the final output. This is a problem coming from the annotation source, and not easily avoided by tximeta.

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## found matching linked transcriptome:
## [ Ensembl - Drosophila melanogaster - release 92 ]
## loading existing EnsDb created: 2019-01-12 03:17:46
## generating transcript ranges

## Warning in checkAssays2Txps(assays, txps): missing some transcripts!
##  9 out of 33681 are missing from the GTF and dropped from SummarizedExperiment output

We can see that the appropriate metadata and transcript ranges are attached.

rowRanges(se)

## GRanges object with 33672 ranges and 6 metadata columns:
##               seqnames            ranges strand |       tx_id
##                  <Rle>         <IRanges>  <Rle> | <character>
##   FBtr0075502       3L 15808322-15808883      + | FBtr0075502
##   FBtr0300738       3R   5783105-5787336      + | FBtr0300738
##   FBtr0300739       3R   5781762-5787336      + | FBtr0300739
##   FBtr0300737       3R   5781762-5787336      + | FBtr0300737
##   FBtr0300736       3R   5783105-5787336      + | FBtr0300736
##           ...      ...               ...    ... .         ...
##   FBtr0086850       2R 17701229-17701297      + | FBtr0086850
##   FBtr0113576       3R   5596201-5596340      - | FBtr0113576
##   FBtr0076635       3L   8601948-8602031      + | FBtr0076635
##   FBtr0309760       3L     891250-891475      + | FBtr0309760
##   FBtr0113549       2L 20419932-20420065      + | FBtr0113549
##                   tx_biotype tx_cds_seq_start tx_cds_seq_end     gene_id
##                  <character>        <integer>      <integer> <character>
##   FBtr0075502 protein_coding         15808418       15808716 FBgn0036531
##   FBtr0300738 protein_coding          5783217        5787117 FBgn0037375
##   FBtr0300739 protein_coding          5781900        5787117 FBgn0037375
##   FBtr0300737 protein_coding          5781900        5787117 FBgn0037375
##   FBtr0300736 protein_coding          5783217        5787117 FBgn0037375
##           ...            ...              ...            ...         ...
##   FBtr0086850         snoRNA             <NA>           <NA> FBgn0063388
##   FBtr0113576         snoRNA             <NA>           <NA> FBgn0082961
##   FBtr0076635         snoRNA             <NA>           <NA> FBgn0060291
##   FBtr0309760         snoRNA             <NA>           <NA> FBgn0263461
##   FBtr0113549         snoRNA             <NA>           <NA> FBgn0083032
##                   tx_name
##               <character>
##   FBtr0075502 FBtr0075502
##   FBtr0300738 FBtr0300738
##   FBtr0300739 FBtr0300739
##   FBtr0300737 FBtr0300737
##   FBtr0300736 FBtr0300736
##           ...         ...
##   FBtr0086850 FBtr0086850
##   FBtr0113576 FBtr0113576
##   FBtr0076635 FBtr0076635
##   FBtr0309760 FBtr0309760
##   FBtr0113549 FBtr0113549
##   -------
##   seqinfo: 25 sequences from BDGP6 genome

seqinfo(se)

## Seqinfo object with 25 sequences from BDGP6 genome:
##   seqnames             seqlengths isCircular genome
##   211000022278279           12714       <NA>  BDGP6
##   211000022278436            2815       <NA>  BDGP6
##   211000022278449            1947       <NA>  BDGP6
##   211000022278760            1144       <NA>  BDGP6
##   211000022279165            1118       <NA>  BDGP6
##   ...                         ...        ...    ...
##   Unmapped_Scaffold_8       88768       <NA>  BDGP6
##   X                      23542271       <NA>  BDGP6
##   Y                       3667352       <NA>  BDGP6
##   mitochondrion_genome      19517       <NA>  BDGP6
##   rDNA                      76973       <NA>  BDGP6

Clear linkedTxomes

The following code removes the entire table with information about the linkedTxomes. This is just for demonstration, so that we can show how to load a JSON file below.

Note: Running this code will clear any information about linkedTxomes. Don’t run this unless you really want to clear this table!

library(BiocFileCache)

## Loading required package: dbplyr

if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)

## # A tibble: 2 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t…
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl>
## 1 BFC1  link… 2019-01-12… 2019-01-12… /tmp… rela… 3665…               NA
## 2 BFC2  Dros… 2019-01-12… 2019-01-12… /tmp… rela… 3665…               NA
## # … with 2 more variables: etag <chr>, expires <dbl>

bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)

## # A tibble: 1 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t…
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl>
## 1 BFC2  Dros… 2019-01-12… 2019-01-12… /tmp… rela… 3665…               NA
## # … with 2 more variables: etag <chr>, expires <dbl>

Loading linkedTxome JSON files

If a collaborator or the Suppmentary Files for a publication shares a linkedTxome JSON file, we can likewise use tximeta to automatically assemble the relevant metadata and transcript ranges. This implies that the other person has used tximeta with the function makeLinkedTxome demonstrated above, pointing to their Salmon index and to the FASTA and GTF source(s).

We point to the JSON file and use loadLinkedTxome and then the relevant metadata is saved for persistent usage. In this case, we saved the JSON file in a temporary directory.

jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
loadLinkedTxome(jsonFile)

## saving linkedTxome in bfc (first time)

Again, using tximeta figures out whether it needs to access the remote GTF or not, and assembles the appropriate object on the user’s behalf.

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## found matching linked transcriptome:
## [ Ensembl - Drosophila melanogaster - release 92 ]
## loading existing EnsDb created: 2019-01-12 03:17:46
## generating transcript ranges

## Warning in checkAssays2Txps(assays, txps): missing some transcripts!
##  9 out of 33681 are missing from the GTF and dropped from SummarizedExperiment output

Clear linkedTxomes

Finally, we clear the linkedTxomes table again so that the above examples will work. This is just for the vignette code and not part of a typical workflow.

Note: Running this code will clear any information about linkedTxomes. Don’t run this unless you really want to clear this table!

if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)

## # A tibble: 2 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t…
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl>
## 1 BFC2  Dros… 2019-01-12… 2019-01-12… /tmp… rela… 3665…               NA
## 2 BFC3  link… 2019-01-12… 2019-01-12… /tmp… rela… 3665…               NA
## # … with 2 more variables: etag <chr>, expires <dbl>

bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)

## # A tibble: 1 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t…
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl>
## 1 BFC2  Dros… 2019-01-12… 2019-01-12… /tmp… rela… 3665…               NA
## # … with 2 more variables: etag <chr>, expires <dbl>

Other quantifiers

tximeta can import the output from any quantifiers that are supported by tximport, and if these are not Salmon or Sailfish output, it will simply return a un-ranged SummarizedExperiment. We are working to allow manually passing of the hash value of the transcriptome, the cDNA sequences of which can be hashed with FastaDigest (can be installed with pip install fasta_digest).

Next steps

Basic functionality

Switching rowRanges from transcript ranges to exons-by-transcript ranges list, or from gene ranges to exons-by-gene ranges list.
As is already supported in tximport, also import inferential variance matrices (Gibbs samples or bootstrap samples)

Facilitate plots and summaries

Basic plots across samples: abundances, mapping rates, rich bias model parameters
Time summaries: when quantified? when imported? I would love to know when the library was prepared and sequenced but this seems hopeless.

Challenges

Building out actual, sustainable plan for supporting as many organisms and sources as possible. We can define rules which determine where the FASTA and GTF files will be based on source and release (also here we ignored something like “type”, e.g. CHR or ALL gene files from Gencode)
Some support already for linked transcriptomes, see linkedTxomes vignette. Need to work more on combining multiple sources (potentially meta-transcriptomes from different organisms?), and also on how to approach de novo transcriptomes, and how to support reproducibility there.
Facilitate functional annotation, either with vignettes/workflow or with additional functionality. E.g.: housekeeping genes, arbitrary gene sets, genes expressed in GTEx tissues
liftOver is clunky and doesn’t integrate with GenomeInfoDb. It requires user input and there’s a chance to mis-annotate. Ideally this should all be automated.

Session info

library(devtools)
session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Ubuntu 16.04.5 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2019-01-11                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package              * version   date       lib source        
##  acepack                1.4.1     2016-10-29 [2] CRAN (R 3.5.2)
##  annotate               1.60.0    2019-01-11 [2] Bioconductor  
##  AnnotationDbi        * 1.44.0    2019-01-11 [2] Bioconductor  
##  AnnotationFilter       1.6.0     2019-01-11 [2] Bioconductor  
##  assertthat             0.2.0     2017-04-11 [2] CRAN (R 3.5.2)
##  backports              1.1.3     2018-12-14 [2] CRAN (R 3.5.2)
##  base64enc              0.1-3     2015-07-28 [2] CRAN (R 3.5.2)
##  bindr                  0.1.1     2018-03-13 [2] CRAN (R 3.5.2)
##  bindrcpp             * 0.2.2     2018-03-29 [2] CRAN (R 3.5.2)
##  Biobase              * 2.42.0    2019-01-11 [2] Bioconductor  
##  BiocFileCache        * 1.6.0     2019-01-11 [2] Bioconductor  
##  BiocGenerics         * 0.28.0    2019-01-11 [2] Bioconductor  
##  BiocParallel         * 1.16.5    2019-01-11 [2] Bioconductor  
##  biomaRt                2.38.0    2019-01-11 [2] Bioconductor  
##  Biostrings             2.50.2    2019-01-11 [2] Bioconductor  
##  bit                    1.1-14    2018-05-29 [2] CRAN (R 3.5.2)
##  bit64                  0.9-7     2017-05-08 [2] CRAN (R 3.5.2)
##  bitops                 1.0-6     2013-08-17 [2] CRAN (R 3.5.2)
##  blob                   1.1.1     2018-03-25 [2] CRAN (R 3.5.2)
##  callr                  3.1.1     2018-12-21 [2] CRAN (R 3.5.2)
##  checkmate              1.9.0     2019-01-09 [2] CRAN (R 3.5.2)
##  cli                    1.0.1     2018-09-25 [2] CRAN (R 3.5.2)
##  cluster                2.0.7-1   2018-04-13 [2] CRAN (R 3.5.2)
##  colorspace             1.3-2     2016-12-14 [2] CRAN (R 3.5.2)
##  crayon                 1.3.4     2017-09-16 [2] CRAN (R 3.5.2)
##  curl                   3.3       2019-01-10 [2] CRAN (R 3.5.2)
##  data.table             1.11.8    2018-09-30 [2] CRAN (R 3.5.2)
##  DBI                    1.0.0     2018-05-02 [2] CRAN (R 3.5.2)
##  dbplyr               * 1.3.0     2019-01-09 [2] CRAN (R 3.5.2)
##  DelayedArray         * 0.8.0     2019-01-11 [2] Bioconductor  
##  desc                   1.2.0     2018-05-01 [2] CRAN (R 3.5.2)
##  DESeq2               * 1.22.2    2019-01-11 [2] Bioconductor  
##  devtools             * 2.0.1     2018-10-26 [2] CRAN (R 3.5.2)
##  digest                 0.6.18    2018-10-10 [2] CRAN (R 3.5.2)
##  dplyr                  0.7.8     2018-11-10 [2] CRAN (R 3.5.2)
##  ensembldb              2.6.3     2019-01-11 [2] Bioconductor  
##  evaluate               0.12      2018-10-09 [2] CRAN (R 3.5.2)
##  fansi                  0.4.0     2018-10-05 [2] CRAN (R 3.5.2)
##  foreign                0.8-71    2018-07-20 [2] CRAN (R 3.5.2)
##  Formula                1.2-3     2018-05-03 [2] CRAN (R 3.5.2)
##  fs                     1.2.6     2018-08-23 [2] CRAN (R 3.5.2)
##  genefilter             1.64.0    2019-01-11 [2] Bioconductor  
##  geneplotter            1.60.0    2019-01-11 [2] Bioconductor  
##  GenomeInfoDb         * 1.18.1    2019-01-11 [2] Bioconductor  
##  GenomeInfoDbData       1.2.0     2018-12-21 [2] Bioconductor  
##  GenomicAlignments      1.18.1    2019-01-11 [2] Bioconductor  
##  GenomicFeatures        1.34.1    2019-01-11 [2] Bioconductor  
##  GenomicRanges        * 1.34.0    2019-01-11 [2] Bioconductor  
##  ggplot2                3.1.0     2018-10-25 [2] CRAN (R 3.5.2)
##  glue                   1.3.0     2018-07-17 [2] CRAN (R 3.5.2)
##  gridExtra              2.3       2017-09-09 [2] CRAN (R 3.5.2)
##  gtable                 0.2.0     2016-02-26 [2] CRAN (R 3.5.2)
##  Hmisc                  4.1-1     2018-01-03 [2] CRAN (R 3.5.2)
##  hms                    0.4.2     2018-03-10 [2] CRAN (R 3.5.2)
##  htmlTable              1.13.1    2019-01-07 [2] CRAN (R 3.5.2)
##  htmltools              0.3.6     2017-04-28 [2] CRAN (R 3.5.2)
##  htmlwidgets            1.3       2018-09-30 [2] CRAN (R 3.5.2)
##  httr                   1.4.0     2018-12-11 [2] CRAN (R 3.5.2)
##  IRanges              * 2.16.0    2019-01-11 [2] Bioconductor  
##  jsonlite               1.6       2018-12-07 [2] CRAN (R 3.5.2)
##  knitr                  1.21      2018-12-10 [2] CRAN (R 3.5.2)
##  lattice                0.20-38   2018-11-04 [2] CRAN (R 3.5.2)
##  latticeExtra           0.6-28    2016-02-09 [2] CRAN (R 3.5.2)
##  lazyeval               0.2.1     2017-10-29 [2] CRAN (R 3.5.2)
##  locfit                 1.5-9.1   2013-04-20 [2] CRAN (R 3.5.2)
##  magrittr               1.5       2014-11-22 [2] CRAN (R 3.5.2)
##  Matrix                 1.2-15    2018-11-01 [2] CRAN (R 3.5.2)
##  matrixStats          * 0.54.0    2018-07-23 [2] CRAN (R 3.5.2)
##  memoise                1.1.0     2017-04-21 [2] CRAN (R 3.5.2)
##  munsell                0.5.0     2018-06-12 [2] CRAN (R 3.5.2)
##  nnet                   7.3-12    2016-02-02 [2] CRAN (R 3.5.2)
##  org.Dm.eg.db         * 3.7.0     2018-12-21 [2] Bioconductor  
##  pillar                 1.3.1     2018-12-15 [2] CRAN (R 3.5.2)
##  pkgbuild               1.0.2     2018-10-16 [2] CRAN (R 3.5.2)
##  pkgconfig              2.0.2     2018-08-16 [2] CRAN (R 3.5.2)
##  pkgload                1.0.2     2018-10-29 [2] CRAN (R 3.5.2)
##  plyr                   1.8.4     2016-06-08 [2] CRAN (R 3.5.2)
##  prettyunits            1.0.2     2015-07-13 [2] CRAN (R 3.5.2)
##  processx               3.2.1     2018-12-05 [2] CRAN (R 3.5.2)
##  progress               1.2.0     2018-06-14 [2] CRAN (R 3.5.2)
##  ProtGenerics           1.14.0    2019-01-11 [2] Bioconductor  
##  ps                     1.3.0     2018-12-21 [2] CRAN (R 3.5.2)
##  purrr                  0.2.5     2018-05-29 [2] CRAN (R 3.5.2)
##  R6                     2.3.0     2018-10-04 [2] CRAN (R 3.5.2)
##  rappdirs               0.3.1     2016-03-28 [2] CRAN (R 3.5.2)
##  RColorBrewer           1.1-2     2014-12-07 [2] CRAN (R 3.5.2)
##  Rcpp                   1.0.0     2018-11-07 [2] CRAN (R 3.5.2)
##  RCurl                  1.95-4.11 2018-07-15 [2] CRAN (R 3.5.2)
##  readr                  1.3.1     2018-12-21 [2] CRAN (R 3.5.2)
##  remotes                2.0.2     2018-10-30 [2] CRAN (R 3.5.2)
##  rlang                  0.3.1     2019-01-08 [2] CRAN (R 3.5.2)
##  rmarkdown              1.11      2018-12-08 [2] CRAN (R 3.5.2)
##  rpart                  4.1-13    2018-02-23 [2] CRAN (R 3.5.2)
##  rprojroot              1.3-2     2018-01-03 [2] CRAN (R 3.5.2)
##  Rsamtools              1.34.0    2019-01-11 [2] Bioconductor  
##  RSQLite                2.1.1     2018-05-06 [2] CRAN (R 3.5.2)
##  rstudioapi             0.9.0     2019-01-09 [2] CRAN (R 3.5.2)
##  rtracklayer            1.42.1    2019-01-11 [2] Bioconductor  
##  S4Vectors            * 0.20.1    2019-01-11 [2] Bioconductor  
##  scales                 1.0.0     2018-08-09 [2] CRAN (R 3.5.2)
##  sessioninfo            1.1.1     2018-11-05 [2] CRAN (R 3.5.2)
##  stringi                1.2.4     2018-07-20 [2] CRAN (R 3.5.2)
##  stringr                1.3.1     2018-05-10 [2] CRAN (R 3.5.2)
##  SummarizedExperiment * 1.12.0    2019-01-11 [2] Bioconductor  
##  survival               2.43-3    2018-11-26 [2] CRAN (R 3.5.2)
##  testthat               2.0.1     2018-10-13 [2] CRAN (R 3.5.2)
##  tibble                 2.0.0     2019-01-04 [2] CRAN (R 3.5.2)
##  tidyselect             0.2.5     2018-10-11 [2] CRAN (R 3.5.2)
##  tximeta              * 1.0.3     2019-01-11 [1] Bioconductor  
##  tximport               1.10.1    2019-01-11 [2] Bioconductor  
##  usethis              * 1.4.0     2018-08-14 [2] CRAN (R 3.5.2)
##  utf8                   1.1.4     2018-05-24 [2] CRAN (R 3.5.2)
##  withr                  2.1.2     2018-03-15 [2] CRAN (R 3.5.2)
##  xfun                   0.4       2018-10-23 [2] CRAN (R 3.5.2)
##  XML                    3.98-1.16 2018-08-19 [2] CRAN (R 3.5.2)
##  xtable                 1.8-3     2018-08-29 [2] CRAN (R 3.5.2)
##  XVector                0.22.0    2019-01-11 [2] Bioconductor  
##  yaml                   2.2.0     2018-07-25 [2] CRAN (R 3.5.2)
##  zlibbioc               1.28.0    2019-01-11 [2] Bioconductor  
## 
## [1] /tmp/Rtmp0xZqIJ/Rinst307b3ac45736
## [2] /home/biocbuild/bbs-3.8-bioc/R/library

tximeta: transcript quantification import with automatic metadata

Michael Love, Rob Patro, Charlotte Soneson, Peter Hickey

01/11/2019

Analysis starts with sample table

Running tximeta from a sample table

What happened?

Examining SummarizedExperiment output

Easy summarization to gene-level

Add different identifiers

Run a differential expression analysis

Metadata galore

Quantification files with an unknown transcriptome

Linked transcriptome for reproducible analysis

Clear linkedTxomes

Loading linkedTxome JSON files

Clear linkedTxomes

Other quantifiers

Next steps

Basic functionality

Facilitate plots and summaries

Challenges

Session info