R version: R version 3.5.2 (2018-12-20)
Bioconductor version: 3.8
Package: 1.0.1
Publication DOI: 10.12688/f1000research.15398.3
If you have questions about this workflow or any Bioconductor
software, please post these to the
Bioconductor support site.
If the questions concern a specific package, you can tag the post with
the name of the package, or for general questions about the workflow,
tag the post with rnaseqdtu
. Note the
posting guide
for crafting an optimal question for the support site.
RNA-seq experiments can be analyzed to detect differences across groups of samples in total gene expression – the total expression produced by all isoforms of a gene – and additionally differences in transcript isoform usage within a gene. If the amount of expression switches among two or more isoforms of a gene, then the total gene expression may not change by a detectable amount, but the differential transcript usage (DTU) is nevertheless biologically relevant. DTU is common when comparing expression across cell types: recent analysis of the Genotype-Tissue Expression Project (GTEx) (GTEx Consortium et al. 2017) dataset demonstrated that half of all expressed genes contained tissue-specific isoforms (Reyes and Huber 2018). DTU may produce functionally different gene products through alternative splicing and changes to the coding sequence of the transcript, and DTU may result in transcripts with different untranslated regions on the 5’ or 3’ end of the transcript, which may affect binding sites of RNA binding proteins. (Reyes and Huber 2018) found that alternative usage of transcription start and termination sites was a more common event than alternative splicing when examining DTU events across tissues in GTEx. Specific patterns of DTU have been identified in a number of diseases, including cancer, retinal diseases, and neurological disorders, among others (Scotti and Swanson 2018). Large-scale analyses of cancer transcriptomic data, comparing tumor to normal samples, have identified that protein domain losses are a common feature of DTU in cancer, including domains involved in protein-protein interactions (Vitting-Seerup and Sandelin 2017; Climente-González et al. 2017).
While many tutorials and workflows in the
Bioconductor project address differential gene expression, there are
fewer workflows for performing a differential transcript usage analysis,
which provides critical and complementary information to a
gene-level analysis. Some of the existing Bioconductor packages and
functions that can be used for statistical analysis of DTU include
DEXSeq (originally designed for differential exon usage) (Anders, Reyes, and Huber 2012),
diffSpliceDGE
from the edgeR package (Robinson, McCarthy, and Smyth 2009; McCarthy, Chen, and Smyth 2012),
diffSplice
from the limma package (Smyth 2004; Law et al. 2014), and
DRIMSeq (Nowicka and Robinson 2016).
Other Bioconductor packages which are designed around a DTU analysis include
stageR (Van den Berge et al. 2017), SGSeq (Goldstein et al. 2016), and
IsoformSwitchAnalyzeR (Vitting-Seerup and Sandelin 2018).
stageR can be used for post-processing of transcript-
and gene-level p-values from DTU detection methods, and will be discussed in this workflow.
SGSeq can be used to visualize
splice events, and leverages DEXSeq or limma
for differential testing of splice variant usage.
The Bioconductor package IsoformSwitchAnalyzeR is
well documented and the vignette available from the
IsoformSwitchAnalyzeR landing page
can be seen as an alternative to this workflow.
IsoformSwitchAnalyzeR is designed for the downstream analysis of
functional consequences of identified isoform switches.
It allows for import of data from various
quantification methods, including Salmon, and allows for statistical
inference using DRIMSeq. In addition, IsoformSwitchAnalyzeR includes
functions for obtaining the nucleotide and amino acid sequence
consequences of isoform switching, which is not covered in this workflow.
Other packages related to splicing can be found at the BiocViews links for
DifferentialSplicing and
AlternativeSplicing.
For more information about the Bioconductor project and its core
infrastructure, please refer to the overview by Huber et al. (2015).
We note that there are numerous other methods for detecting differential transcript usage outside of the Bioconductor project. The DRIMSeq publication is a good reference for these, having descriptions and comparisons with many current methods (Nowicka and Robinson 2016). This workflow will build on the methods and vignettes from three Bioconductor packages: DRIMSeq, DEXSeq, and stageR.
Previously, some of the developers of the Bioconductor packages edgeR and DESeq2 have collaborated to develop the tximport package (Soneson, Love, and Robinson 2016) for summarizing the output of fast transcript-level quantifiers, such as Salmon (Patro et al. 2017), Sailfish (Patro, Mount, and Kingsford 2014), and kallisto (Bray et al. 2016). The tximport package focuses on preparing estimated transcript-level counts, abundances and effective transcript lengths, for gene-level statistical analysis using edgeR (Robinson, McCarthy, and Smyth 2009), DESeq2 (Love, Huber, and Anders 2014), or limma-voom (Law et al. 2014). tximport produces an offset matrix to accompany gene-level counts, that accounts for a number of RNA-seq biases as well as differences in transcript usage among transcripts of different length that would bias an estimator of gene fold change based on the gene-level counts (Trapnell et al. 2013). tximport can alternatively produce a matrix of data that is roughly on the scale of counts, by scaling transcript-per-million (TPM) abundances to add up to the total number of mapped reads. This counts-from-abundance approach directly corrects for technical biases and differential transcript usage across samples, obviating the need for the accompanying offset matrix.
Complementary to an analysis of differential gene expression, one can use tximport to import transcript-level estimated counts, and then pass these counts to packages such as DRIMSeq or DEXSeq for statistical analysis of differential transcript usage. Following a transcript-level analysis, one can aggregate evidence of differential transcript usage to the gene level. The stageR package in Bioconductor provides a statistical framework to screen at the gene-level for differential transcript usage with gene-level adjusted p-values, followed by confirmation of which transcripts within the significant genes show differential usage with transcript-level adjusted p-values (Van den Berge et al. 2017). The method controls the overall false discovery rate (OFDR) (Heller et al. 2009) for such a two-stage procedure, which will be discussed in more detail later in the workflow. We believe that stageR represents a principled approach to analyzing transcript usage changes, as the methods can be evaluated against a target error rate in a manner that mimics how the methods will be used in practice. That is, following rejection of the null hypothesis at the gene-level, investigators would likely desire to know which transcripts within a gene participated in the differential usage.
Here we provide a basic workflow for detecting differential transcript usage using Bioconductor packages, following quantification of transcript abundance using the Salmon method (Figure 1). This workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing stage-wise testing of differential transcript usage using the stageR package. For the workflow, we use data that is simulated, so that we can also evaluate the performance of methods for differential transcript usage, as well as differential gene and transcript expression. The simulation was constructed using distributional parameters estimated from the GEUVADIS project RNA-seq dataset (Lappalainen et al. 2013) quantified by the recount2 project (Collado-Torres et al. 2017), including the expression levels of the transcripts, the amount of biological variability of gene expression levels across samples, and realistic coverage of reads along the transcript.
Note that a published version of this workflow exists in the Bioconductor channel of the journal F1000Research. The published version additionally contains evaluations of the methods shown here alongside other popular methods for DTU, DGE, and DTE.
First we describe details of the simulated data, which will be used in the following workflow. Understanding the details of the simulation will be useful for assessing the methods in the later sections. All of the code used to simulate RNA-seq experiments and write paired-end reads to FASTQ files can be found at an associated GitHub repository for the simulation code (Love 2018a), and the reads and quantification files can be downloaded from Zenodo (Love 2018b). Salmon (Patro et al. 2017) was used to estimate transcript-level abundances for a single sample (ERR188297) of the GEUVADIS project (Lappalainen et al. 2013), and this was used as a baseline for transcript abundances in the simulation. Transcripts that were associated with estimated counts less than 10 had abundance thresholded to 0, all other transcripts were considered “expressed” (n=46,933). alpine (Love, Hogenesch, and Irizarry 2016) was used to estimate realistic fragment GC bias from 12 samples from the GEUVADIS project, all from the same sequencing center (the first 12 samples from CNAG-CRG in Supplementary Table 2 from Love, Hogenesch, and Irizarry (2016)). DESeq2 (Love, Huber, and Anders 2014) was used to estimate mean and dispersion parameters for a Negative Binomial distribution for gene-level counts for 458 GEUVADIS samples provided by the recount2 project (Collado-Torres et al. 2017). Note that, while gene-level dispersion estimates were used to generate underlying transcript-level counts, additional uncertainty on the transcript-level data is a natural consequence of the simulation, as the transcript-level counts must be estimated (the underlying transcript counts are not provided to the methods).
polyester (Frazee et al. 2015) was used to simulate paired-end RNA-seq reads for two groups of 12 samples each, with realistic fragment GC bias, and with dispersion on transcript-level counts drawn from the joint distribution of mean and dispersion values estimated from the GEUVADIS samples. To compare DRIMSeq and DEXSeq in further detail, we generated an additional simulation in which dispersion parameters were assigned to genes via matching on the gene-level count, and then all transcripts of a gene had counts generated using the same per-gene dispersion. The first sample for group 1 and the first sample for group 2 followed the realistic GC bias profile of the same GEUVADIS sample, and so on for all 12 samples. This pairing of the samples was used to generate balanced data, but not used in the statistical analysis. countsimQC (Soneson and Robinson 2017) was used to examine the properties of the simulation relative to the dataset used for parameter estimation, and the full report can be accessed at the associated GitHub repository for simulation code (Love 2018a).
Differential expression across two groups was generated as follows: The 46,933 expressed transcripts as defined above belonged to 15,017 genes. 70% (n=10,514) of these genes with expressed transcripts were set as null genes, where abundance was not changed across the two groups. For 10% (n=1,501) of genes, all isoforms were differentially expressed at a log fold change between 1 and 2.58 (fold change between 2 and 6). The set of transcripts in these genes was classified as DGE (differential gene expression) by construction, and the expressed transcripts were also DTE (differential transcript expression), but they did not count as DTU (differential transcript usage), as the proportions within the gene remained constant. To simulate balanced differential expression, one of the two groups was randomly chosen to be the baseline, and the other group would have its counts multiplied by the fold change. For 10% (n=1,501) of genes, a single expressed isoform was differentially expressed at a log fold change between 1 and 2.58. This set of transcripts was DTE by construction. If the chosen transcript was the only expressed isoform of a gene, this counted also as DGE and not as DTU, but if there were other isoforms that were expressed, this counted for both DGE and DTU, as the proportion of expression among the isoforms was affected. For 10% (n=1,501) of genes, differential transcript usage was constructed by exchanging the TPM abundance of two expressed isoforms, or, if only one isoform was expressed, exchanging the abundance of the expressed isoform with a non-expressed one. This counted for DTU and DTE, but not for DGE. An MA plot of the simulated transcript abundances for the two groups is shown in Figure 2.
This workflow was designed to work with R 3.5 or higher, and the DRIMSeq, DEXSeq, stageR, and tximport packages for Bioconductor version 3.7 or higher. Bioconductor packages should always be installed following the official instructions. The workflow uses a subset of all genes to speed up the analysis, but the Bioconductor packages can easily be run for this dataset on all human genes on a laptop in less than an hour. Timing for the various packages is included within each section.
As described in the Introduction, this workflow uses transcript-level quantification estimates produced by Salmon (Patro et al. 2017) and imported into R/Bioconductor with tximport (Soneson, Love, and Robinson 2016). Details about how to run Salmon, and which type of transcript-level estimated counts should be imported, is covered in the Workflow, with the exact code used to run the DTU analysis. Salmon estimates the relative abundance of each annotated transcript for each sample in units of transcripts-per-million (TPM); the estimated TPM values should be proportional to the abundance of the transcripts in the population of cells that were assayed. One critical point is that Salmon only considers the transcripts that are provided in the annotation; it is not able to detect expression of any novel transcripts. If many un-annotated transcripts are expressed in the particular set of samples, successful application of this workflow may require first building out a more representative set of annotated transcripts via transcriptome assembly or full transcript sequencing.
In addition to relative abundance, Salmon estimates the effective length of each transcript, which can take into account a number of sample-specific technical biases including fragment size distribution (default), fragment GC content, random hexamer priming bias, and positional bias along the transcript. If a transcript had certain properties, related to its length and its sequence content, that made it difficult to produce cDNA fragments and sequence reads from these fragments, then its effective length should be lower for that sample, than if these technical biases were absent. The estimates of TPM and the effective lengths can be used to estimate the number of fragments that each transcript produced, which will be called estimated counts in this workflow. Different types of estimated counts may be correlated with effective transcript length, and so we will discuss in the Workflow our recommended type to use for DTU and DGE analysis (also diagrammed in Figure 1).
We focus on two statistical models for DTU testing in this workflow, DRIMSeq (Nowicka and Robinson 2016) and DEXSeq (Anders, Reyes, and Huber 2012). DEXSeq was published first, as a statistical model for detecting differences in exon usage across samples in different conditions. Most of the DEXSeq functions and documentation refer specifically to exons or exonic parts within a gene, while the final results table refers more generally to these as features within a group. It is this more general usage that we will employ in this workflow, substituting estimated transcript counts in place of exonic part counts.
DEXSeq assumes a Negative Binomial (NB) distribution for the feature counts, and considers the counts for each feature (originally, the exonic parts) relative to counts for all other features of the group (the gene), using an interaction term in a generalized linear model (GLM). The GLM framework is an extension of the linear model (LM), but shares with LM the usage of a design matrix, typically represented by \(\mathbf{X}\), which is made up of columns of covariates that are multiplied by scalar coefficients, typically represented by \(\mathbf{\beta}\). The design matrix with its multiple coefficients is useful for extending statistical models beyond simple group comparisons, allowing for more complex situations, such as within-patient comparisons, batch correction, or testing of ratios.
DEXSeq models each feature independently in the GLM fitting stage, and so does not take into account any correlation among the counts across features in a group (e.g. transcript counts within a gene), except insofar as such correlation is accounted for by columns in the design matrix. This last point is important, as correlation induced by DTU across condition groups, or even DTU that can be associated with cell-type heterogeneity, can be modeled in the DEXSeq framework by interaction terms with relevant covariates introduced into the design matrix. In the current workflow, we provide an example of capturing DTU across condition using DEXSeq, but future iterations of this workflow may also include controlling for additional covariates, such as estimated cell type proportions. DEXSeq was evaluated on transcript counts by Soneson et al. (2016) and then later by Nowicka and Robinson (2016), where it was shown in both cases that DEXSeq could be used to detect DTU in this configuration, though isoform pre-filtering greatly improved its performance in reducing the observed false discovery rate (FDR) closer to its nominal level.
In contrast to the NB model, DRIMSeq assumes an Dirichlet Multinomial model (DM) for each gene, where the total count for the gene is considered fixed, and the quantity of interest is the proportion for the transcript within a gene for each sample. If the proportion for one transcript increases, it must result in a decrease for the proportions of the other transcripts of the gene. Genes that are detected as having statistically significant DTU are those in which the proportion changes observed across condition were large, considering the variation in proportions seen within condition. The variation in proportions across biological replicates is modeled using a single precision parameter per gene, which will be discussed in the workflow below. DRIMSeq also uses a design matrix, and so can be used to analyze DTU for complex experimental designs, including within-patient comparisons, batch correction, or testing of ratios.
A critical difference between the two models is that DRIMSeq directly models the correlations among transcript counts within a gene via the DM distribution, and so can capture these correlations across exchangeable samples within a condition. DEXSeq with the NB distribution only can capture correlations among transcript counts within a gene when the DTU is across sample groups defined by covariates in the design matrix. On the other hand, DRIMSeq is limited in modeling a single precision parameter per gene. If there are two transcripts within a gene with very different biological variability – perhaps they have separate promoters under different regulatory control – then DEXSeq may do a better job modeling such heterogeneity in the biological variability of transcript expression, as it estimates a separate dispersion parameter for each transcript.
We used Salmon version 0.10.0 to quantify abundance and effective transcript lengths for all of the 24 simulated samples. For this workflow, we will use the first six samples from each group. We quantified against the GENCODE human annotation version 28, which was the same reference used to generate the simulated reads. We used the transcript sequences FASTA file that contains “Nucleotide sequences of all transcripts on the reference chromosomes”. When downloading the FASTA file, it is useful to download the corresponding GTF file, as this will be used in later sections.
To build the Salmon index, we used the following command. Recent
versions of Salmon will discard identical sequence duplicate
transcripts, and keep a log of these within the index directory.
The --gencode
command trims the Gencode FASTA headers to only
include the transcript identifier.
salmon index --gencode -t gencode.v28.transcripts.fa -i gencode.v28_salmon-0.10.0
To quantify each sample, we used the following command, which says to
quantify with six threads using the GENCODE index, with inward and
unstranded paired end reads, using fragment GC bias correction,
writing out to the directory sample
and using as input these two
reads files. The library type is specified by -l IU
(inward and
unstranded) and the options are discussed in the
Salmon documentation.
Recent versions of Salmon can automatically detect the library type by
setting -l A
. Such a command can be automated in a bash loop using
bash variables, or one can use more advanced workflow management
systems such as Snakemake (Köster and Rahmann 2012) or Nextflow
(Di Tommaso et al. 2017).
salmon quant -p 6 -i gencode.v28_salmon-0.10.0 -l IU \
--gcBias -o sample -1 sample_1.fa.gz -2 sample_2.fa.gz
We can use tximport to import the estimated counts, abundances and effective transcript lengths into R. The tximport package offers three methods for producing count matrices from transcript-level quantification files, as described in detail in Soneson, Love, and Robinson (2016), and already mentioned in the introduction. To recap these are: (1) original estimated counts with effective transcript length information used as a statistical offset, (2) generating “counts from abundance” by scaling transcript-per-million (TPM) abundance estimates per sample such that they sum to the total number of mapped reads (scaledTPM), or generating “counts from abundance” by scaling TPM abundance estimates first by the average effective transcript length over samples, and then per sample such that they sum to the total number of mapped reads (lengthScaledTPM). We will use scaledTPM for DTU detection, with the statistical motivation described below, and the original estimated counts with length offset for DGE detection.
We recommend constructing a CSV file that keeps track of the sample identifiers and any relevant variables, e.g. condition, time point, batch, and so on. Here we have made a sample CSV file and provided it along with this workflow’s R package, rnaseqDTU. If a user is applying the code in this workflow with her own RNA-seq data, she does not need to load the rnaseqDTU package. If a user is running through the code in this workflow with the workflow simulated data, she does need to load the rnaseqDTU package.
In order to find this CSV file, we first need to know where on the
machine this workflow package lives, so we can point to the extdata
directory where the CSV file is located. These two lines of code load
the workflow package and find this directory on the machine. Again,
these two lines of code would therefore not be part of a typical
analysis using one’s own RNA-seq data.
library(rnaseqDTU)
csv.dir <- system.file("extdata", package="rnaseqDTU")
The CSV file records which samples are condition 1 and which are
condition 2. The columns of this CSV file can have any names, although
sample_id
will be used later by DRIMSeq, and so using this column
name allows us to pass this data.frame directly to DRIMSeq at a
later step.
samps <- read.csv(file.path(csv.dir, "samples.csv"))
head(samps)
## sample_id condition
## 1 s1_1 1
## 2 s2_1 1
## 3 s3_1 1
## 4 s4_1 1
## 5 s5_1 1
## 6 s6_1 1
samps$condition <- factor(samps$condition)
table(samps$condition)
##
## 1 2
## 6 6
files <- file.path("/path/to/dir", samps$sample_id, "quant.sf")
names(files) <- samps$sample_id
head(files)
## s1_1 s2_1
## "/path/to/dir/s1_1/quant.sf" "/path/to/dir/s2_1/quant.sf"
## s3_1 s4_1
## "/path/to/dir/s3_1/quant.sf" "/path/to/dir/s4_1/quant.sf"
## s5_1 s6_1
## "/path/to/dir/s5_1/quant.sf" "/path/to/dir/s6_1/quant.sf"
We can then import transcript-level counts using tximport.
For DTU analysis, we suggest generating counts from abundance, using the
scaledTPM
method described by Soneson, Love, and Robinson (2016).
The countsFromAbundance
option of tximport uses estimated
abundances to generate roughly count-scaled data, such that each
column will sum to the number of reads mapped for that library.
By using scaledTPM
counts, the estimated proportions
fit by DRIMSeq, which are generated from counts, will be
equivalent to proportions of the abundance of the isoforms.
If instead of scaledTPM
, we used the original estimated transcript counts
(countsFromAbundance="no"
), or if we used lengthScaledTPM
transcript counts, then a change in transcript usage among
transcripts of different length could result in a changed total count
for the gene, even if there is no change in total gene
expression. For more detail and a diagram of this
effect, we refer the reader to Figure 1 of Trapnell et al. (2013).
Briefly, this is because the original transcript counts and
lengthScaledTPM
transcript counts scale with transcript
length, while scaledTPM
transcript counts do not.
A change in the total count for the gene, not corrected by an offset, could then
bias the calculation of proportions and therefore confound DTU analysis.
For testing DTU using DRIMSeq and DEXSeq, it is convenient if the count-scale data
do not scale with transcript length within a gene.
That this could be corrected by an offset, but this is not
easily implemented in the current DTU analysis packages.
For gene-level analysis (DGE), we can use gene-level original
counts with an average transcript length offset,
gene-level scaledTPM
counts, or gene-level lengthScaledTPM
counts,
as all of these three methods control for the length bias
described by Trapnell et al. (2013) and Soneson, Love, and Robinson (2016).
Our DTU and DGE countsFromAbundance
recommendations are
summarized in Figure 1.
A final note is that, the motivation for using scaledTPM
counts
hinges on the fact that estimated fragment counts scale with
transcript length in fragmented RNA-seq data. If a different experiment
is performed and a different quantification method used to produce
counts per transcript which do not scale with transcript length,
then the recommendation would be to use these counts per transcript directly.
Examples of experiments producing counts per transcript that would potentially
not scale with transcript length include
counts of full-transcript-length or nearly-full-transcript-length reads,
or counts of 3’ tagged RNA-seq reads aggregated to transcript groups.
In either case, the statistical methods for DTU could be provided directly
with the transcript counts.
The following code chunk is what one would use in a typical analysis, but is not evaluated in this workflow because the quantification files are not provided in the rnaseqDTU package due to size restrictions. Instead we will load a pre-constructed matrix of counts below. In a typical workflow, the code below would be used to generate the matrix of counts from the quantification files. We have also trimmed the counts matrix to two decimal places per count (which is more than enough precision on estimated counts). All of the quantification files and simulated reads for this dataset have been made publicly available on Zenodo; see the Data availability section at the end of this workflow.
library(tximport)
txi <- tximport(files, type="salmon", txOut=TRUE,
countsFromAbundance="scaledTPM")
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]
Bioconductor offers numerous approaches for building a TxDb object,
a transcript database that can be used to link transcripts to genes
(among other uses).
The following code chunks were used to generate a TxDb,
and then used the select
function with the TxDb to produce
a corresponding data.frame called txdf
which links transcript IDs
to gene IDs. In this TxDb, the transcript IDs are called TXNAME
and
the gene IDs are called GENEID
. The version 28 human GTF file was
downloaded from the GENCODE website when downloading the transcripts
FASTA file.
Due to size restrictions, neither the gencode.v28.annotation.gtf.gz
file nor the generated .sqlite
file are included in the rnaseqDTU package.
library(GenomicFeatures)
gtf <- "gencode.v28.annotation.gtf.gz"
txdb.filename <- "gencode.v28.annotation.sqlite"
txdb <- makeTxDbFromGFF(gtf)
saveDb(txdb, txdb.filename)
Once the TxDb database has been generated and saved, it can be quickly reloaded:
txdb <- loadDb(txdb.filename)
txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
tab <- table(txdf$GENEID)
txdf$ntx <- tab[match(txdf$GENEID, names(tab))]
We load the cts
object as created in the tximport code chunks. This
contains count-scale data, generated from abundance using the
scaledTPM
method. The column sums are equal to the number of mapped
paired-end reads per experiment. The experiments have between 31 and
38 million paired-end reads that were mapped to the transcriptome
using Salmon.
data(salmon_cts)
cts[1:3,1:3]
## s1_1 s2_1 s3_1
## ENST00000488147.1 179.8 184.44 229.05
## ENST00000469289.1 0.0 0.00 0.00
## ENST00000466430.5 5.0 3.63 9.46
range(colSums(cts)/1e6)
## [1] 31.37738 38.47173
We also have the txdf
object giving the transcript-to-gene
mappings (for construction, see previous section).
This is contained in a file called simulate.rda
that
contains a number of R objects with information about the
simulation, that we will use later to assess the methods’
performance.
data(simulate)
head(txdf)
## GENEID TXNAME ntx
## 1 ENSG00000000003.14 ENST00000612152.4 5
## 2 ENSG00000000003.14 ENST00000373020.8 5
## 3 ENSG00000000003.14 ENST00000614008.4 5
## 4 ENSG00000000003.14 ENST00000496771.5 5
## 5 ENSG00000000003.14 ENST00000494424.1 5
## 6 ENSG00000000005.5 ENST00000373031.4 2
all(rownames(cts) %in% txdf$TXNAME)
## [1] TRUE
txdf <- txdf[match(rownames(cts),txdf$TXNAME),]
all(rownames(cts) == txdf$TXNAME)
## [1] TRUE
In order to run DRIMSeq, we build a data.frame with the gene ID, the feature (transcript) ID, and then columns for each of the samples:
counts <- data.frame(gene_id=txdf$GENEID,
feature_id=txdf$TXNAME,
cts)
We can now load the DRIMSeq package and create a dmDSdata object,
with our counts
and samps
data.frames. Typing in the object name
and pressing return will give information about the number of genes:
library(DRIMSeq)
d <- dmDSdata(counts=counts, samples=samps)
d
## An object of class dmDSdata
## with 16612 genes and 12 samples
## * data accessors: counts(), samples()
The dmDSdata object has a number of specific methods. Note that the rows of the object are gene-oriented, so pulling out the first row corresponds to all of the transcripts of the first gene:
methods(class=class(d))
## [1] [ counts dmFilter dmPrecision length names
## [7] plotData show
## see '?methods' for accessing help and source code
counts(d[1,])[,1:4]
## gene_id feature_id s1_1 s2_1
## 1 ENSG00000000419.12 ENST00000371588.9 1394.71 1210.13
## 2 ENSG00000000419.12 ENST00000466152.5 135.16 18.20
## 3 ENSG00000000419.12 ENST00000371582.8 154.78 35.39
## 4 ENSG00000000419.12 ENST00000371584.8 42.86 86.05
## 5 ENSG00000000419.12 ENST00000413082.1 0.00 0.00
It will be useful to first filter the object, before running
procedures to estimate model parameters. This greatly speeds up the
fitting and removes transcripts that may be troublesome for
parameter estimation, e.g. estimating the proportion of expression
among the transcripts of a gene when the total count is very low. We first
define n
to be the total number of samples, and n.small
to be the
sample size of the smallest group. We use all
three of the possible filters: for a transcript to be retained in the dataset,
we require that (1) it has a count of at least 10 in at least n.small
samples, (2) it has a relative abundance proportion of at least 0.1 in
at least n.small
samples, and (3) the total count of the corresponding gene
is at least 10 in all n
samples. We used all three possible filters,
whereas only the two count
filters are used in the DRIMSeq vignette example code.
It is important to consider what types of transcripts may be removed
by the filters, and potentially adjust depending on the dataset. If
n
was large, it would make sense to allow perhaps a few samples to
have very low counts, so lowering min_samps_gene_expr
to some
factor multiple (\(< 1\)) of n
, and likewise for the first two filters for
n.small
. The second filter means that if a transcript does not make
up more than 10% of the gene’s expression for at least n.small
samples, it will be removed. If this proportion seems too high, for
example, if very lowly expressed isoforms are of particular interest,
then the filter can be omitted or the min_feature_prop
lowered. For
a concrete example, if a transcript goes from a proportion of 0% in the
control group to a proportion of 9% in the treatment group, this would
be removed by the above 10% filter. After filtering, this dataset has
7,764 genes.
n <- 12
n.small <- 6
d <- dmFilter(d,
min_samps_feature_expr=n.small, min_feature_expr=10,
min_samps_feature_prop=n.small, min_feature_prop=0.1,
min_samps_gene_expr=n, min_gene_expr=10)
d
## An object of class dmDSdata
## with 7764 genes and 12 samples
## * data accessors: counts(), samples()
The dmDSdata object only contains genes that have more than one isoform, which makes sense as we are testing for differential transcript usage. We can find out how many of the remaining genes have \(N\) isoforms by tabulating the number of times we see a gene ID, then tabulating the output again:
table(table(counts(d)$gene_id))
##
## 2 3 4 5 6 7
## 4062 2514 931 222 34 1
We create a design matrix, using a design formula and the sample information contained in the object, accessed via samples. Here we use a simple design with just two groups, but more complex designs are possible. For some discussion of complex designs, one can refer to the vignettes of the limma, edgeR, or DESeq2 packages.
design_full <- model.matrix(~condition, data=DRIMSeq::samples(d))
colnames(design_full)
## [1] "(Intercept)" "condition2"
Only for speeding up running the live code chunks in this workflow, we subset to the first 250 genes, representing about one thirtieth of the dataset. This step would not be run in a typical workflow.
d <- d[1:250,]
7764 / 250
## [1] 31.056
We then use the following three functions to estimate the model
parameters and test for DTU. We first estimate the precision, which is related to the
dispersion in the Dirichlet Multinomial model via the formula below. Because
precision is in the denominator of the right hand side of the
equation, they are inversely related. Higher dispersion – counts more
variable around their expected value – is associated with lower
precision. For full details about the DRIMSeq model, one should read
both the detailed software vignette and the publication
(Nowicka and Robinson 2016). After estimating the precision, we fit regression coefficients
and perform null hypothesis testing on the coefficient of
interest. Because we have a simple two-group model, we test the
coefficient associated with the difference between condition 2 and
condition 1, called condition2
. The following code takes about half
a minute, and so a full analysis on this dataset takes about 15
minutes on a laptop.
\[ \textrm{dispersion} = \frac{1}{1 + \textrm{precision}} \]
set.seed(1)
system.time({
d <- dmPrecision(d, design=design_full)
d <- dmFit(d, design=design_full)
d <- dmTest(d, coef="condition2")
})
## ! Using a subset of 0.1 genes to estimate common precision !
## ! Using common_precision = 21.2862 as prec_init !
## ! Using 0 as a shrinkage factor !
## user system elapsed
## 31.120 0.396 31.581
To build a results table, we run the results
function. We can
generate a single p-value per gene, which tests whether there is any
differential transcript usage within the gene, or a single p-value per transcript,
which tests whether the proportions for this transcript changed within
the gene:
res <- DRIMSeq::results(d)
head(res)
## gene_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 1.491799 4 8.280932e-01 9.123682e-01
## 2 ENSG00000000460.16 1.068576 3 7.846649e-01 9.102145e-01
## 3 ENSG00000000938.12 4.366944 2 1.126497e-01 2.749979e-01
## 4 ENSG00000001084.11 1.630030 3 6.526000e-01 8.643478e-01
## 5 ENSG00000001167.14 28.402814 1 9.852198e-08 5.006525e-07
## 6 ENSG00000001461.16 9.810597 1 1.735092e-03 6.750590e-03
res.txp <- DRIMSeq::results(d, level="feature")
head(res.txp)
## gene_id feature_id lr df pvalue adj_pvalue
## 1 ENSG00000000457.13 ENST00000367771.10 0.16576698 1 0.6839016 0.9158606
## 2 ENSG00000000457.13 ENST00000367770.5 0.01672234 1 0.8971085 0.9786638
## 3 ENSG00000000457.13 ENST00000367772.8 1.02499211 1 0.3113378 0.6669254
## 4 ENSG00000000457.13 ENST00000423670.1 0.06046783 1 0.8057580 0.9324041
## 5 ENSG00000000457.13 ENST00000470238.1 0.28912004 1 0.5907850 0.8712111
## 6 ENSG00000000460.16 ENST00000496973.5 0.83448692 1 0.3609783 0.7239390
Because the pvalue
column may contain NA
values, we use the
following function to turn these into 1’s. The NA
values would
otherwise cause problems for the stage-wise analysis.
From investigating these NA
p-value cases for DRIMSeq,
they all occur when one condition group has all zero counts for a
transcript, but sufficient counts from the other condition group, and
sufficient counts for the gene. DRIMSeq will not estimate a
precision for such a gene. These all happen to be true positive genes
for DTU in the simulation, where the isoform switch is total or nearly
total. DEXSeq, shown in a later section, does not produce NA
p-values for these genes. A potential fix would be to use a
plug-in common or trended precision for such genes, but this is not
implemented in the current version of DRIMSeq.
no.na <- function(x) ifelse(is.na(x), 1, x)
res$pvalue <- no.na(res$pvalue)
res.txp$pvalue <- no.na(res.txp$pvalue)
We can plot the estimated proportions for one of the significant genes, where we can see evidence of switching (Figure 3).
idx <- which(res$adj_pvalue < 0.05)[1]
res[idx,]
## gene_id lr df pvalue adj_pvalue
## 5 ENSG00000001167.14 28.40281 1 9.852198e-08 5.006525e-07
plotProportions(d, res$gene_id[idx], "condition")
Because we have been working with only a subset of the data, we now load the results tables that would have been generated by running DRIMSeq functions on the entire dataset.
data(drim_tables)
nrow(res)
## [1] 7764
nrow(res.txp)
## [1] 20711
A typical analysis of differential transcript usage would involve asking first: “which genes contain any evidence of DTU?”, and secondly, “which transcripts in the genes that contain some evidence may be participating in the DTU?” Note that a gene may pass the first stage without exhibiting enough evidence to identify one or more transcripts that are participating in the DTU. The stageR package is designed to allow for such two-stage testing procedures, where the first stage is called a screening stage and the second stage a confirmation stage (Van den Berge et al. 2017). The methods are general, and can also be applied to testing, for example, changes across a time series followed by investigation of individual time points, as shown in the stageR package vignette. We show below how stageR is used to detect DTU and how to interpret its output.
We first construct a vector of p-values for the screening stage. Because of how the stageR package will combine transcript and gene names, we need to strip the gene and transcript version numbers from their Ensembl IDs (this is done by keeping only the first 15 characters of the gene and transcript IDs).
pScreen <- res$pvalue
strp <- function(x) substr(x,1,15)
names(pScreen) <- strp(res$gene_id)
We construct a one column matrix of the confirmation p-values:
pConfirmation <- matrix(res.txp$pvalue, ncol=1)
rownames(pConfirmation) <- strp(res.txp$feature_id)
We arrange a two column data.frame with the transcript and gene identifiers.
tx2gene <- res.txp[,c("feature_id", "gene_id")]
for (i in 1:2) tx2gene[,i] <- strp(tx2gene[,i])
The following functions then perform the stageR analysis. We must specify an
alpha
, which will be the overall false discovery rate target for the
analysis, defined below. Unlike typical adjusted p-values or
q-values, we cannot choose an arbitrary threshold later: after
specifying alpha=0.05
, we need to use 5% as the target in downstream
steps. There are also convenience functions getSignificantGenes and
getSignificantTx, which are demonstrated in the stageR vignette.
library(stageR)
stageRObj <- stageRTx(pScreen=pScreen, pConfirmation=pConfirmation,
pScreenAdjusted=FALSE, tx2gene=tx2gene)
stageRObj <- stageWiseAdjustment(stageRObj, method="dtu", alpha=0.05)
suppressWarnings({
drim.padj <- getAdjustedPValues(stageRObj, order=FALSE,
onlySignificantGenes=TRUE)
})
head(drim.padj)
## geneID txID gene transcript
## 1 ENSG00000001167 ENST00000341376 1.446731e-05 0.000000
## 2 ENSG00000001167 ENST00000353205 1.446731e-05 0.000000
## 3 ENSG00000001461 ENST00000003912 8.263160e-03 0.000000
## 4 ENSG00000001461 ENST00000339255 8.263160e-03 0.000000
## 5 ENSG00000001631 ENST00000394507 1.287012e-04 0.060474
## 6 ENSG00000001631 ENST00000475770 1.287012e-04 1.000000
The final table with adjusted p-values summarizes the information from
the two-stage analysis. Only genes that passed the filter are
included in the table, so the table already represents screened genes.
The transcripts with values in the column, transcript
, less than 0.05 pass the
confirmation stage on a target 5% overall false
discovery rate, or OFDR. This means that, in expectation, no more
than 5% of the genes that pass screening will either (1) not contain
any DTU, so be falsely screened genes, or (2) contain a falsely confirmed transcript.
A falsely confirmed transcript is a transcript with an adjusted p-value
less than 0.05 which does not exhibit differential usage across condition.
The stageR procedure allows us to look at both the genes that passed the
screening stage and the transcripts with adjusted p-values less than
our target alpha
, and understand what kind of overall error rate
this procedure entails. This cannot be said for an arbitrary procedure
of looking at standard gene adjusted p-values and transcript adjusted
p-values, where the adjustment was performed independently.
We found that DRIMSeq was sensitive to detect DTU, but could exceed its false discovery rate (FDR) bounds, particularly on the transcript-level tests, and that a post-hoc, non-specific filtering of the DRIMSeq transcript p-values and adjusted p-values improved the FDR and OFDR control. We considered the standard deviation (SD) of the per-sample proportions as a filtering statistic. This statistic does not use the information about which samples belong to which condition group. We set the p-values and adjusted p-values for transcripts with small per-sample proportion SD to 1. We do not recompute adjusted p-values, although we will provide the filtered p-values to the stageR procedure.
We note that the p-values are no longer necessarily uniform after filtering out small effect size transcripts and genes, although we find that in this simulation at least, the filtering made the procedure more conservative: excluding transcripts with small SD of the per-sample proportions brought both the observed FDR and the observed OFDR closer to their nominal targets, as will be shown in the evaluations below.
res.txp.filt <- DRIMSeq::results(d, level="feature")
smallProportionSD <- function(d, filter=0.1) {
cts <- as.matrix(subset(counts(d), select=-c(gene_id, feature_id)))
gene.cts <- rowsum(cts, counts(d)$gene_id)
total.cts <- gene.cts[match(counts(d)$gene_id, rownames(gene.cts)),]
props <- cts/total.cts
propSD <- sqrt(rowVars(props))
propSD < filter
}
filt <- smallProportionSD(d)
res.txp.filt$pvalue[filt] <- 1
res.txp.filt$adj_pvalue[filt] <- 1
The above post-hoc filter is not part of the DRIMSeq modeling steps, and to avoid interfering with the modeling, we run it after DRIMSeq. The other three filters used before have been tested by the DRIMSeq package authors and are therefore a recommended part of an analysis before the modeling begins. We do not apply this post-hoc filter to DEXSeq in this workflow, as DEXSeq seemed to be closer to controlling its FDR and OFDR in the evaluations, after using the DRIMSeq filters recommended in this workflow.
The DEXSeq package was originally designed for detecting differential exon usage (Anders, Reyes, and Huber 2012), but can also be adapted to run on estimated transcript counts, in order to detect DTU. Using DEXSeq on transcript counts was evaluated by Soneson et al. (2016), showing the benefits in FDR control from filtering lowly expressed transcripts for a transcript-level analysis. We benchmarked DEXSeq here, beginning with the DRIMSeq filtered object, as these filters are intuitive, they greatly speed up the analysis, and such filtering was shown to be beneficial in FDR control.
The two factors of (1) working on isoform counts rather than individual exons and (2) using the DRIMSeq filtering procedure dramatically increase the speed of DEXSeq, compared to running an exon-level analysis. Another advantage is that we benefit from the sophisticated bias models of Salmon, which account for drops in coverage on alternative exons that can otherwise throw off estimates of transcript abundance (Love, Hogenesch, and Irizarry 2016). A disadvantage over the exon-level analysis is that we must know in advance all of the possible isoforms that can be generated from a gene locus, all of which are assumed to be contained in the annotation files (FASTA and GTF).
We first load the DEXSeq package and then build a DEXSeqDataSet from the data contained in the dmDStest object (the class of the DRIMSeq object changes as the results are added). The design formula of the DEXSeqDataSet here uses the language “exon” but this should be read as “transcript” for our analysis. DEXSeq will test – after accounting for total gene expression for this sample and for the proportion of this transcript relative to the others – whether there is a condition-specific difference in the transcript proportion relative to the others.
library(DEXSeq)
sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(counts(d)[,-c(1:2)]))
dxd <- DEXSeqDataSet(countData=count.data,
sampleData=sample.data,
design=~sample + exon + condition:exon,
featureID=counts(d)$feature_id,
groupID=counts(d)$gene_id)
The following functions run the DEXSeq analysis. While we are only working on a subset of the data, the full analysis for this dataset took less than 3 minutes on a laptop.
system.time({
dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd, quiet=TRUE)
dxd <- testForDEU(dxd, reducedModel=~sample + exon)
})
## user system elapsed
## 6.716 0.008 6.732
We then extract the results table, not filtering on mean counts (as we have already conducted filtering via DRIMSeq functions). We compute a per-gene adjusted p-value, using the perGeneQValue function, which aggregates evidence from multiple tests within a gene to a single p-value for the gene and then corrects for multiple testing across genes (Anders, Reyes, and Huber 2012). Other methods for aggregative evidence from the multiple tests within genes have been discussed in a recent publication and may be substituted at this step (Yi et al. 2018). Finally, we build a simple results table with the per-gene adjusted p-values.
dxr <- DEXSeqResults(dxd, independentFiltering=FALSE)
qval <- perGeneQValue(dxr)
dxr.g <- data.frame(gene=names(qval),qval)
For size consideration of the workflow R package, we reduce also the transcript-level results table to a simple data.frame:
columns <- c("featureID","groupID","pvalue")
dxr <- as.data.frame(dxr[,columns])
head(dxr)
## featureID groupID
## ENSG00000000457.13:ENST00000367771.10 ENST00000367771.10 ENSG00000000457.13
## ENSG00000000457.13:ENST00000367770.5 ENST00000367770.5 ENSG00000000457.13
## ENSG00000000457.13:ENST00000367772.8 ENST00000367772.8 ENSG00000000457.13
## ENSG00000000457.13:ENST00000423670.1 ENST00000423670.1 ENSG00000000457.13
## ENSG00000000457.13:ENST00000470238.1 ENST00000470238.1 ENSG00000000457.13
## ENSG00000000460.16:ENST00000496973.5 ENST00000496973.5 ENSG00000000460.16
## pvalue
## ENSG00000000457.13:ENST00000367771.10 0.5620081
## ENSG00000000457.13:ENST00000367770.5 0.8399434
## ENSG00000000457.13:ENST00000367772.8 0.5675043
## ENSG00000000457.13:ENST00000423670.1 0.7032904
## ENSG00000000457.13:ENST00000470238.1 0.8476920
## ENSG00000000460.16:ENST00000496973.5 0.2108527
Again, as we have been working with only a subset of the data, we now load the results tables that would have been generated by running DEXSeq functions on the entire dataset.
data(dex_tables)
If the stageR package has not already been loaded, we make sure to
load it, and run code very similar to that used above for DRIMSeq
two-stage testing, with a target alpha=0.05
.
library(stageR)
strp <- function(x) substr(x,1,15)
pConfirmation <- matrix(dxr$pvalue,ncol=1)
dimnames(pConfirmation) <- list(strp(dxr$featureID),"transcript")
pScreen <- qval
names(pScreen) <- strp(names(pScreen))
tx2gene <- as.data.frame(dxr[,c("featureID", "groupID")])
for (i in 1:2) tx2gene[,i] <- strp(tx2gene[,i])
The following three functions provide a table with the OFDR control described above. To repeat, the set of genes passing screening should not have more than 5% of either genes which have in fact no DTU or genes which contain a transcript with an adjusted p-value less than 5% which do not participate in DTU.
stageRObj <- stageRTx(pScreen=pScreen, pConfirmation=pConfirmation,
pScreenAdjusted=TRUE, tx2gene=tx2gene)
stageRObj <- stageWiseAdjustment(stageRObj, method="dtu", alpha=0.05)
suppressWarnings({
dex.padj <- getAdjustedPValues(stageRObj, order=FALSE,
onlySignificantGenes=TRUE)
})
## The returned adjusted p-values are based on a stage-wise testing approach and are only valid for the provided target OFDR level of 5%. If a different target OFDR level is of interest,the entire adjustment should be re-run.
head(dex.padj)
## geneID txID gene transcript
## 1 ENSG00000001167 ENST00000341376 1.379695e-05 0
## 2 ENSG00000001167 ENST00000353205 1.379695e-05 0
## 3 ENSG00000001461 ENST00000003912 1.011322e-03 0
## 4 ENSG00000001461 ENST00000339255 1.011322e-03 0
## 5 ENSG00000001630 ENST00000003100 4.979296e-03 0
## 6 ENSG00000001630 ENST00000450723 4.979296e-03 0
This concludes the DTU section of the workflow. If you use DRIMSeq (Nowicka and Robinson 2016), DEXSeq (Anders, Reyes, and Huber 2012), stageR (Van den Berge et al. 2017), tximport (Soneson, Love, and Robinson 2016), or Salmon (Patro et al. 2017) in published research, please cite the relevant methods publications, which can be found in the References section of this workflow.
In the final section of the workflow containing live code examples, we demonstrate how differential transcript usage, summarized to the gene-level, can be visualized with respect to differential gene expression analysis results. We use tximport and summarize counts to the gene level and compute an average transcript length offset for count-based methods (Soneson, Love, and Robinson 2016). We will then show code for using DESeq2 and edgeR to assess differential gene expression. Because we have simulated the genes according to three different categories, we can color the final plot by the true simulated state of the genes. We note that we will pair DEXSeq with DESeq2 results in the following plot, and DRIMSeq with edgeR results. However, this pairing is arbitrary, and any DTU method can reasonably be paired with any DGE method.
The following line of code is unevaluated, but was used to generate an
object txi.g
which contains the gene-level counts, abundances and
average transcript lengths.
txi.g <- tximport(files, type="salmon", tx2gene=txdf[,2:1])
For the workflow, we load the txi.g
object which is saved in a file
salmon_gene_txi.rda
. We then load the DESeq2 package and build a
DESeqDataSet from txi.g
, providing also the sample information and
a design formula.
data(salmon_gene_txi)
library(DESeq2)
dds <- DESeqDataSetFromTximport(txi.g, samps, ~condition)
## using counts and average transcript lengths from tximport
The following two lines of code run the DESeq2 analysis (Love, Huber, and Anders 2014).
dds <- DESeq(dds)
dres <- DESeq2::results(dds)
Because we happen to know the true status of each of the genes, we can make a scatterplot of the results, coloring the genes by their status (whether DGE, DTE, or DTU by construction).
all(dxr.g$gene %in% rownames(dres))
## [1] TRUE
dres <- dres[dxr.g$gene,]
# we can only color because we simulated...
col <- rep(8, nrow(dres))
col[rownames(dres) %in% dge.genes] <- 1
col[rownames(dres) %in% dte.genes] <- 2
col[rownames(dres) %in% dtu.genes] <- 3
Figure 4 displays the evidence for differential transcript usage over that for differential gene expression. We can see that the DTU genes cluster on the y-axis (mostly not captured in the DGE analysis), and the DGE genes cluster on the x-axis (mostly not captured in the DTU analysis). The DTE genes fall in the middle, as all of them represent DGE, and some of them additionally represent DTU (if the gene had other expressed transcripts). Because DEXSeq outputs an adjusted p-value of 0 for some of the genes, we set these instead to a jittered value around \(10^{-20}\), so that their number and location on the x-axis could be visualized. These jittered values should only be used for visualization.
bigpar()
# here cap the smallest DESeq2 adj p-value
cap.padj <- pmin(-log10(dres$padj), 100)
# this vector only used for plotting
jitter.padj <- -log10(dxr.g$qval + 1e-20)
jp.idx <- jitter.padj == 20
jitter.padj[jp.idx] <- rnorm(sum(jp.idx),20,.25)
plot(cap.padj, jitter.padj, col=col,
xlab="Gene expression",
ylab="Transcript usage")
legend("topright",
c("DGE","DTE","DTU","null"),
col=c(1:3,8), pch=20, bty="n")
We can repeat the same analysis using edgeR as the inference engine (Robinson, McCarthy, and Smyth 2009). The following code incorporates the average transcript length matrix as an offset for an edgeR analysis.
library(edgeR)
cts.g <- txi.g$counts
normMat <- txi.g$length
normMat <- normMat / exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts.g/normMat)) + log(colSums(cts.g/normMat))
y <- DGEList(cts.g)
y <- scaleOffset(y, t(t(log(normMat)) + o))
keep <- filterByExpr(y)
y <- y[keep,]
The basic edgeR model fitting and results extraction can be accomplished with the following lines:
y <- estimateDisp(y, design_full)
fit <- glmFit(y, design_full)
lrt <- glmLRT(fit)
tt <- topTags(lrt, n=nrow(y), sort="none")[[1]]
Again, we can color the genes by their true status in the simulation:
common <- intersect(res$gene_id, rownames(tt))
tt <- tt[common,]
res.sub <- res[match(common, res$gene_id),]
# we can only color because we simulated...
col <- rep(8, nrow(tt))
col[rownames(tt) %in% dge.genes] <- 1
col[rownames(tt) %in% dte.genes] <- 2
col[rownames(tt) %in% dtu.genes] <- 3
Figure 5 displays the evidence for differential transcript usage over that for differential gene expression, now using DRIMSeq and edgeR. One obvious contrast with Figure 4 is that DRIMSeq outputs lower non-zero adjusted p-values than DEXSeq does, where DEXSeq instead outputs 0 for many genes. The plots look more similar when zooming in on the DRIMSeq y-axis, as can be seen in Figure 6.
bigpar()
plot(-log10(tt$FDR), -log10(res.sub$adj_pvalue), col=col,
xlab="Gene expression",
ylab="Transcript usage")
legend("topright",
c("DGE","DTE","DTU","null"),
col=c(1:3,8), pch=20, bty="n")
bigpar()
plot(-log10(tt$FDR), -log10(res.sub$adj_pvalue), col=col,
xlab="Gene expression",
ylab="Transcript usage", ylim=c(0,20))
legend("topright",
c("DGE","DTE","DTU","null"),
col=c(1:3,8), pch=20, bty="n")
This marks the end of the DTU workflow. Please consult the published version of this workflow for a further section on Evaluation, including comparison of the Bioconductor packages and methods shown in the workflow with other popular packages and methods.
Here we presented a workflow for analyzing RNA-seq experiments for differential transcript usage across groups of samples. The Bioconductor packages used, DRIMSeq, DEXSeq, and stageR, are simple to use and fast when run on transcript-level data. We show how these can be used downstream of transcript abundance quantification with Salmon.
We recommend the use of stageR for its formal statistical procedure involving a screening and confirmation stage, as this fits closely to what we expect a typical analysis to entail. It is likely that an investigator would want both a list of statistically significant genes and transcripts participating in DTU, and stageR provides error control on this pair of lists, assuming that the underlying tests are well calibrated.
One potential limitation of this workflow is that, in contrast to other methods such as the standard DEXSeq analysis, SUPPA2, or LeafCutter (Li et al. 2018), here we considered and detected expression switching between annotated transcripts. Other methods such as DEXSeq (exon-based), SUPPA2, or LeafCutter may benefit in terms of power and interpretability from performing statistical analysis directly on exon usage or splice events. Methods such as DEXSeq (exon-based) and LeafCutter benefit in the ability to detect un-annotated events. The workflow presented here would require further processing to attribute transcript usage changes to specific splice events, and is limited to considering the estimated abundance of annotated transcripts.
The following provides the session information used when compiling this document.
devtools::session_info()
## ─ Session info ────────────────────────────────────────────────────────────
## setting value
## version R version 3.5.2 (2018-12-20)
## os Ubuntu 16.04.5 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate C
## ctype en_US.UTF-8
## tz America/New_York
## date 2019-01-09
##
## ─ Packages ────────────────────────────────────────────────────────────────
## package * version date lib source
## acepack 1.4.1 2016-10-29 [2] CRAN (R 3.5.2)
## annotate 1.60.0 2019-01-08 [2] Bioconductor
## AnnotationDbi * 1.44.0 2019-01-08 [2] Bioconductor
## assertthat 0.2.0 2017-04-11 [2] CRAN (R 3.5.2)
## backports 1.1.3 2018-12-14 [2] CRAN (R 3.5.2)
## base64enc 0.1-3 2015-07-28 [2] CRAN (R 3.5.2)
## bindr 0.1.1 2018-03-13 [2] CRAN (R 3.5.2)
## bindrcpp 0.2.2 2018-03-29 [2] CRAN (R 3.5.2)
## Biobase * 2.42.0 2019-01-08 [2] Bioconductor
## BiocGenerics * 0.28.0 2019-01-08 [2] Bioconductor
## BiocManager 1.30.4 2018-11-13 [2] CRAN (R 3.5.2)
## BiocParallel * 1.16.5 2019-01-08 [2] Bioconductor
## BiocStyle * 2.10.0 2019-01-08 [2] Bioconductor
## biomaRt 2.38.0 2019-01-08 [2] Bioconductor
## Biostrings 2.50.2 2019-01-08 [2] Bioconductor
## bit 1.1-14 2018-05-29 [2] CRAN (R 3.5.2)
## bit64 0.9-7 2017-05-08 [2] CRAN (R 3.5.2)
## bitops 1.0-6 2013-08-17 [2] CRAN (R 3.5.2)
## blob 1.1.1 2018-03-25 [2] CRAN (R 3.5.2)
## bookdown 0.9 2018-12-21 [2] CRAN (R 3.5.2)
## callr 3.1.1 2018-12-21 [2] CRAN (R 3.5.2)
## checkmate 1.8.5 2017-10-24 [2] CRAN (R 3.5.2)
## cli 1.0.1 2018-09-25 [2] CRAN (R 3.5.2)
## cluster 2.0.7-1 2018-04-13 [2] CRAN (R 3.5.2)
## codetools 0.2-16 2018-12-24 [2] CRAN (R 3.5.2)
## colorspace 1.3-2 2016-12-14 [2] CRAN (R 3.5.2)
## crayon 1.3.4 2017-09-16 [2] CRAN (R 3.5.2)
## data.table 1.11.8 2018-09-30 [2] CRAN (R 3.5.2)
## DBI 1.0.0 2018-05-02 [2] CRAN (R 3.5.2)
## DelayedArray * 0.8.0 2019-01-08 [2] Bioconductor
## desc 1.2.0 2018-05-01 [2] CRAN (R 3.5.2)
## DESeq2 * 1.22.2 2019-01-08 [2] Bioconductor
## devtools * 2.0.1 2018-10-26 [2] CRAN (R 3.5.2)
## DEXSeq * 1.28.1 2019-01-08 [2] Bioconductor
## digest 0.6.18 2018-10-10 [2] CRAN (R 3.5.2)
## dplyr 0.7.8 2018-11-10 [2] CRAN (R 3.5.2)
## DRIMSeq * 1.10.1 2019-01-08 [2] Bioconductor
## edgeR * 3.24.3 2019-01-08 [2] Bioconductor
## evaluate 0.12 2018-10-09 [2] CRAN (R 3.5.2)
## foreign 0.8-71 2018-07-20 [2] CRAN (R 3.5.2)
## Formula 1.2-3 2018-05-03 [2] CRAN (R 3.5.2)
## fs 1.2.6 2018-08-23 [2] CRAN (R 3.5.2)
## genefilter 1.64.0 2019-01-08 [2] Bioconductor
## geneplotter 1.60.0 2019-01-08 [2] Bioconductor
## GenomeInfoDb * 1.18.1 2019-01-08 [2] Bioconductor
## GenomeInfoDbData 1.2.0 2018-12-21 [2] Bioconductor
## GenomicRanges * 1.34.0 2019-01-08 [2] Bioconductor
## ggplot2 3.1.0 2018-10-25 [2] CRAN (R 3.5.2)
## glue 1.3.0 2018-07-17 [2] CRAN (R 3.5.2)
## gridExtra 2.3 2017-09-09 [2] CRAN (R 3.5.2)
## gtable 0.2.0 2016-02-26 [2] CRAN (R 3.5.2)
## highr 0.7 2018-06-09 [2] CRAN (R 3.5.2)
## Hmisc 4.1-1 2018-01-03 [2] CRAN (R 3.5.2)
## hms 0.4.2 2018-03-10 [2] CRAN (R 3.5.2)
## htmlTable 1.13.1 2019-01-07 [2] CRAN (R 3.5.2)
## htmltools 0.3.6 2017-04-28 [2] CRAN (R 3.5.2)
## htmlwidgets 1.3 2018-09-30 [2] CRAN (R 3.5.2)
## httr 1.4.0 2018-12-11 [2] CRAN (R 3.5.2)
## hwriter 1.3.2 2014-09-10 [2] CRAN (R 3.5.2)
## IRanges * 2.16.0 2019-01-08 [2] Bioconductor
## knitr * 1.21 2018-12-10 [2] CRAN (R 3.5.2)
## labeling 0.3 2014-08-23 [2] CRAN (R 3.5.2)
## lattice 0.20-38 2018-11-04 [2] CRAN (R 3.5.2)
## latticeExtra 0.6-28 2016-02-09 [2] CRAN (R 3.5.2)
## lazyeval 0.2.1 2017-10-29 [2] CRAN (R 3.5.2)
## limma * 3.38.3 2019-01-08 [2] Bioconductor
## locfit 1.5-9.1 2013-04-20 [2] CRAN (R 3.5.2)
## magrittr 1.5 2014-11-22 [2] CRAN (R 3.5.2)
## Matrix 1.2-15 2018-11-01 [2] CRAN (R 3.5.2)
## matrixStats * 0.54.0 2018-07-23 [2] CRAN (R 3.5.2)
## memoise 1.1.0 2017-04-21 [2] CRAN (R 3.5.2)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 3.5.2)
## nnet 7.3-12 2016-02-02 [2] CRAN (R 3.5.2)
## pillar 1.3.1 2018-12-15 [2] CRAN (R 3.5.2)
## pkgbuild 1.0.2 2018-10-16 [2] CRAN (R 3.5.2)
## pkgconfig 2.0.2 2018-08-16 [2] CRAN (R 3.5.2)
## pkgload 1.0.2 2018-10-29 [2] CRAN (R 3.5.2)
## plyr 1.8.4 2016-06-08 [2] CRAN (R 3.5.2)
## prettyunits 1.0.2 2015-07-13 [2] CRAN (R 3.5.2)
## processx 3.2.1 2018-12-05 [2] CRAN (R 3.5.2)
## progress 1.2.0 2018-06-14 [2] CRAN (R 3.5.2)
## ps 1.3.0 2018-12-21 [2] CRAN (R 3.5.2)
## purrr 0.2.5 2018-05-29 [2] CRAN (R 3.5.2)
## R6 2.3.0 2018-10-04 [2] CRAN (R 3.5.2)
## rafalib * 1.0.0 2015-08-09 [2] CRAN (R 3.5.2)
## RColorBrewer * 1.1-2 2014-12-07 [2] CRAN (R 3.5.2)
## Rcpp 1.0.0 2018-11-07 [2] CRAN (R 3.5.2)
## RCurl 1.95-4.11 2018-07-15 [2] CRAN (R 3.5.2)
## remotes 2.0.2 2018-10-30 [2] CRAN (R 3.5.2)
## reshape2 1.4.3 2017-12-11 [2] CRAN (R 3.5.2)
## rlang 0.3.1 2019-01-08 [2] CRAN (R 3.5.2)
## rmarkdown * 1.11 2018-12-08 [2] CRAN (R 3.5.2)
## rnaseqDTU * 1.0.1 2019-01-09 [1] Bioconductor
## rpart 4.1-13 2018-02-23 [2] CRAN (R 3.5.2)
## rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.5.2)
## Rsamtools 1.34.0 2019-01-08 [2] Bioconductor
## RSQLite 2.1.1 2018-05-06 [2] CRAN (R 3.5.2)
## rstudioapi 0.8 2018-10-02 [2] CRAN (R 3.5.2)
## S4Vectors * 0.20.1 2019-01-08 [2] Bioconductor
## scales 1.0.0 2018-08-09 [2] CRAN (R 3.5.2)
## sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 3.5.2)
## stageR * 1.4.0 2019-01-08 [2] Bioconductor
## statmod 1.4.30 2017-06-18 [2] CRAN (R 3.5.2)
## stringi 1.2.4 2018-07-20 [2] CRAN (R 3.5.2)
## stringr 1.3.1 2018-05-10 [2] CRAN (R 3.5.2)
## SummarizedExperiment * 1.12.0 2019-01-08 [2] Bioconductor
## survival 2.43-3 2018-11-26 [2] CRAN (R 3.5.2)
## testthat 2.0.1 2018-10-13 [2] CRAN (R 3.5.2)
## tibble 2.0.0 2019-01-04 [2] CRAN (R 3.5.2)
## tidyselect 0.2.5 2018-10-11 [2] CRAN (R 3.5.2)
## usethis * 1.4.0 2018-08-14 [2] CRAN (R 3.5.2)
## withr 2.1.2 2018-03-15 [2] CRAN (R 3.5.2)
## xfun 0.4 2018-10-23 [2] CRAN (R 3.5.2)
## XML 3.98-1.16 2018-08-19 [2] CRAN (R 3.5.2)
## xtable 1.8-3 2018-08-29 [2] CRAN (R 3.5.2)
## XVector 0.22.0 2019-01-08 [2] Bioconductor
## yaml 2.2.0 2018-07-25 [2] CRAN (R 3.5.2)
## zlibbioc 1.28.0 2019-01-08 [2] Bioconductor
##
## [1] /tmp/Rtmpy8bZv3/Rinst4b5c2b1799cb
## [2] /home/biocbuild/bbs-3.8-bioc/R/library
The samples were quantified with Salmon version 0.10.0 and kallisto version 0.44.0. polyester version 1.16.0 and alpine version 1.6.0 were used in generating the simulated dataset.
The simulated paired-end read FASTQ files have been uploaded in three batches of eight samples each to Zenodo, and the quantification files are also available as a separate Zenodo dataset (Love 2018b). The scripts used to generate the simulated dataset are available at the simulation GitHub repository (Love 2018a)
The authors thank Koen Van den Berge, Malgorzata Nowicka, and Botond Sipos for helpful comments on the workflow.
Anders, Simon, Alejandro Reyes, and Wolfgang Huber. 2012. “Detecting Differential Usage of Exons from RNA-Seq Data.” Genome Research 22 (10):2008–17.
Bray, Nicolas, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic Rna-Seq Quantification.” Nature Biotechnology 34:525–27.
Climente-González, Héctor, Eduard Porta-Pardo, Adam Godzik, and Eduardo Eyras. 2017. “The Functional Impact of Alternative Splicing in Cancer.” Cell Reports 20 (9):2215–26.
Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E Ellis, Margaret A Taub, Kasper D Hansen, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. 2017. “Reproducible RNA-seq analysis using recount2.” Nature Biotechnology 25:319–21.
Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35 (4). Nature Research:316–19.
Frazee, Alyssa C., Andrew E. Jaffe, Ben Langmead, and Jeffrey T. Leek. 2015. “Polyester: Simulating RNA-Seq Datasets with Differential Transcript Expression.” Bioinformatics 31 (17):2778–84.
Goldstein, LD, Y Cao, G Pau, M Lawrence, TD Wu, S Seshagiri, and R Gentleman. 2016. “Prediction and Quantification of Splice Events from RNA-Seq Data.” PLoS ONE 11 (5).
GTEx Consortium, François Aguet, Andrew A. Brown, Stephane E. Castel, Joe R. Davis, Yuan He, Brian Jo, et al. 2017. “Genetic Effects on Gene Expression Across Human Tissues.” Nature 550:204 EP.
Heller, R, E Manduchi, GR Grant, and WJ Ewens. 2009. “A Flexible Two-Stage Procedure for Identifying Gene Sets That Are Differentially Expressed.” Bioinformatics 25 (8):1019–25.
Huber, Wolfgang, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group:115–21.
Köster, Johannes, and Sven Rahmann. 2012. “Snakemake—a Scalable Bioinformatics Workflow Engine.” Bioinformatics 28 (19). Oxford University Press:2520–2.
Lappalainen, Tuuli, Michael Sammeth, Marc R. Friedländer, Peter A. C. ’t Hoen, Jean Monlong, Manuel A. Rivas, Mar Gonzàlez-Porta, et al. 2013. “Transcriptome and Genome Sequencing Uncovers Functional Variation in Humans.” Nature 501 (7468):506–11.
Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2):R29+.
Li, Yang I, David A Knowles, Jack Humphrey, Alvaro N. Barbeira, Scott P. Dickinson, Hae Kyung Im, and Jonathan K Pritchard. 2018. “Annotation-free quantification of RNA splicing using LeafCutter.” Nature Genetics 50 (1):151–58.
Love, Michael I. 2018a. Scripts Used in Constructing and Evaluating the Simulated Data for Swimming Downstream. https://doi.org/10.5281/zenodo.1410443.
———. 2018b. Simulation Data and Quantification Files for Swimming Downstream.
Love, Michael I., John B. Hogenesch, and Rafael A. Irizarry. 2016. “Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation.” Nature Biotechnology 34 (12). Nature Publishing Group:1287.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12):550+.
McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10):4288–97.
Nowicka, M, and MD Robinson. 2016. “DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics [version 2; referees: 2 approved].” F1000Research 5 (1356).
Patro, Rob, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods.
Patro, Rob, Stephen M. Mount, and Carl Kingsford. 2014. “Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.” Nature Biotechnology 32:462–64.
Reyes, Alejandro, and Wolfgang Huber. 2018. “Alternative Start and Termination Sites of Transcription Drive Most Transcript Isoform Differences Across Human Tissues.” Nucleic Acids Research 46 (2):582–92.
Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2009. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1):139–40.
Scotti, Marina M., and Maurice S. Swanson. 2018. “RNA mis-splicing in disease.” Nature Reviews Genetics 17 (1):19–32.
Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1).
Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2016. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences [version 2; referees: 2 approved].” F1000Research 4 (1521).
Soneson, Charlotte, Katarina L. Matthes, Malgorzata Nowicka, Charity W. Law, and Mark D. Robinson. 2016. “Isoform Prefiltering Improves Performance of Count-Based Methods for Analysis of Differential Transcript Usage.” Genome Biology 17 (1):12.
Soneson, Charlotte, and Mark D Robinson. 2017. “Towards unified quality verification of synthetic count data with countsimQC.” Bioinformatics 34 (4).
Trapnell, Cole, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and Lior Pachter. 2013. “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology 31:46–53.
Van den Berge, Koen, Charlotte Soneson, Mark D Robinson, and Lieven Clement. 2017. “stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage.” Genome Biology 18 (1). BioMed Central:151.
Vitting-Seerup, Kristoffer, and Albin Sandelin. 2017. “The Landscape of Isoform Switches in Human Cancers.” Molecular Cancer Research 15 (9).
———. 2018. “IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequences.” bioRxiv.
Yi, Lynn, Harold Pimentel, Nicolas L Bray, and Lior Pachter. 2018. “Gene-level differential analysis at transcript-level resolution.” Genome Biology 19 (1). BioMed Central:53.