consensusDE aims to make differential expression (DE) analysis, with reporting of significance scores from multiple methods easy. It implements wrappers for Voom, DEseq2 and EdgeR and reports differential expression results seperately for each algorithm, as well as merging the results into a single table for determining consensus. The results of the merged table, are ordered by the summed ranks of the p-values for each algorithm and the intersect at minimum p-value threshols accross all methods is provided as the p_max (see below).
consensusDE is simplified into two functions:
buildSummarized()
generate a summarized experiment that counts reads mapped (from bam files) against a transcriptomemulti_de_pairs()
perform DE analysis (all possible pairwise comparisons)Below, we demonstrate the core functionality of consensusDE as well as how to plot results from obtained results using the diag_plots
function.
Begin by first installing and then loading the consensusDE
library. To illustrate functionality of consensusDE
, we will utilise data from the airway
and annotation libraries as follows. Begin by installing and attaching data from these libraries as follows:
A summarized experiment is an object that stores all relevant information for performing differential expression analysis. buildSummarized()
allows users to build a summarized experiment object by simply providing 1) a table of bam files (more below on format), 2) a directory of where to locate the bam files and 3) a transcript database to map the reads to (either a gtf file or txdb). We will use bam files attached to this package (from GenomicAlignments) as an example:
# build a design table that lists the files and their grouping
file_list <- list.files(system.file("extdata", package="GenomicAlignments"),
recursive = TRUE,
pattern = "*bam$",
full = TRUE)
# Prepare a sample table to be used with buildSummarized()
# must be comprised of a minimum of two columns, named "file" and "group",
# with one additional column: "pairs" if the data is paired
sample_table <- data.frame("file" = basename(file_list),
"group" = c("treat", "untreat"))
# extract the path to the bam directory - where to search for files listed in "sample_table"
bam_dir <- as.character(gsub(basename(file_list)[1], "", file_list[1]))
The minimum information is now ready to build a summarized experiment:
# NB. force_build = TRUE, is set to allow the Summarized Experiment to be built.
# This will report a Warning message that less than two replicates are present
# in the sample_table.
summarized_dm3 <- buildSummarized(sample_table = sample_table,
bam_dir = bam_dir,
tx_db = TxDb.Dmelanogaster.UCSC.dm3.ensGene,
read_format = "paired",
force_build = TRUE)
This will output a summarized object that has mapped the reads for the bam files that are listed in sample_table
, located in bam_dir
, against the transcript database provided: TxDb.Dmelanogaster.UCSC.dm3.ensGene
. Bam file format, whether “paired” or “single” end (the type of sequencing technology used) must be specified using the read_format
parameter. gtf formatted transcript databases can also be used instead of a txdb, by providing the full path to the gtf file using the gtf
parameter. To save the summarized experiment externally, for future use, specify the path to save the summarized experiment using output_log
To see details of all parameters see ?buildSummarized
.
Overview of the summarized experiment:
## class: RangedSummarizedExperiment
## dim: 15682 2
## metadata(0):
## assays(1): counts
## rownames(15682): FBgn0000003 FBgn0000008 ... FBgn0264726
## FBgn0264727
## rowData names(0):
## colnames(2): sm_treated1.bam sm_untreated1.bam
## colData names(2): file group
buildSummarized()
also allows users to filter out low read counts. This can be done when building the summarized experiment, or re-running with the summarized experiment output using buildSummarized()
. See “Performing Differential Expresssion” below with filter example.
Sometimes it will be convenient to first build a txdb
object and then pass this txdb
object to buildSummarized using the tx_db parameter. This can be done as follows:
txdb <- makeTxDbFromGFF("/path/to/my.gtf", format="gtf", circ_seqs=character())
For differential expression (DE) analysis we will use the airway
data for demonstration. See ?airway
for more details for this experiment. NOTE: the summarized meta-data must include the columns “group” and “file” to build the correct models. For illustration, we sample 1000 genes from this dataset.
# for compatability, add "group" and "file" columns
colData(airway)$group <- colData(airway)$dex
colData(airway)$file <- rownames(colData(airway))
# filter low count data
airway_filter <- buildSummarized(summarized = airway,
filter = TRUE)
# for illustration, we only use sa random sample of 1000 transcripts
set.seed(1234)
airway_filter <- sample(airway_filter, 1000)
# call multi_de_pairs()
all_pairs_airway <- multi_de_pairs(summarized = airway_filter,
paired = "unpaired",
ruv_correct = FALSE)
Running multi_de_pairs()
will perform DE analysis on all possible pairs of “groups” and save these results as a simple list of “merged” - being the merged results of “deseq”, “voom” and “edger” into one table, as well as the latter three as objects independently. The data frame is sorted by the rank_sum
. The following columns are now included:
ID
- IdentifierAveExpr
- Average Expression (taken from EdgeR)LogFC
- Log2 Fold-Change, also known as a log-ratio (taken from EdgeR)edger_adj_p
- EdgeR p-value adjusted for multiple hypothesesdeseq_adj_p
- DESeq2 p-value adjusted for multiple hypothesesvoom_adj_p
- Limma/voom p-value adjusted for multiple hypothesesedger_rank
- rank of the p-value obtained by EdgeRdeseq_rank
- rank of the p-value obtained by DESeq2voom_rank
- rank of the p-value obtained by Limma/voomrank_sum
- sum of the ranks from edger_rank, voom_rank, rank_sump_max
- the largest p-value observed from all methods tested.
To access the merged results:
## [1] "untrt-trt"
# [1] "untrt-trt"
# to access data of a particular comparison
head(all_pairs_airway$merged[["untrt-trt"]])
## ID AveExpr LogFC edger_adj_p deseq_adj_p
## 1 ENSG00000152583 7.848478 -4.579048 4.555352e-81 1.689976e-105
## 2 ENSG00000196517 6.707906 2.238648 7.174030e-29 2.375309e-40
## 3 ENSG00000123562 11.471330 -1.008375 9.124547e-14 8.201734e-31
## 4 ENSG00000103485 8.891566 1.231192 1.586426e-14 6.843112e-19
## 5 ENSG00000250978 2.929392 -6.159170 3.441794e-28 1.985784e-18
## 6 ENSG00000072958 10.214318 -1.068407 2.420887e-12 2.204612e-18
## voom_adj_p edger_rank deseq_rank voom_rank rank_sum p_max
## 1 2.434237e-05 1 1 1.0 3.0 2.434237e-05
## 2 7.700210e-05 2 2 2.0 6.0 7.700210e-05
## 3 3.724219e-04 6 3 3.0 12.0 3.724219e-04
## 4 5.406701e-04 5 5 4.0 14.0 5.406701e-04
## 5 7.780662e-04 3 6 5.0 14.0 7.780662e-04
## 6 8.197765e-04 9 7 6.5 22.5 8.197765e-04
It is often useful to add additional annotated information to the output tables. This can be acheived by providing a database for annotations via ensembl_annotate
. Annotations needs to be a Genome Wide Annotation object, e.g. org.Mm.eg.db
for mouse or org.Hs.eg.db
for human from BioConductor. For example, to install the database for the mouse annotation, go to http://bioconductor.org/packages/org.Mm.eg.db and follow the instructions. Ensure that after installing the database package that the library is loaded using library(org.Mm.eg.db)
. When running, “‘select()’ returned 1:many mapping between keys and columns” will appear on the command line. This is the result of multiple mapped transcript ID to Annotations. Only the first annotation is reported. See ?multi_de_pairs
for additional documentation.
An example of annotating the above filtered airway data is provided below:
# first ensure annotation database in installed
# load annotation library:
library(org.Hs.eg.db)
# call multi_de_pairs(),
# set ensembl_annotate argument to org.Hs.eg.db
all_pairs_airway <- multi_de_pairs(summarized = airway_filter,
paired = "unpaired",
ruv_correct = FALSE,
ensembl_annotate = org.Hs.eg.db)
# to access data of a particular comparison
head(all_pairs_airway$merged[["untrt-trt"]])
## ID AveExpr LogFC edger_adj_p deseq_adj_p
## 1 ENSG00000152583 7.848478 -4.579048 4.555352e-81 1.689976e-105
## 2 ENSG00000196517 6.707906 2.238648 7.174030e-29 2.375309e-40
## 3 ENSG00000123562 11.471330 -1.008375 9.124547e-14 8.201734e-31
## 4 ENSG00000103485 8.891566 1.231192 1.586426e-14 6.843112e-19
## 5 ENSG00000250978 2.929392 -6.159170 3.441794e-28 1.985784e-18
## 6 ENSG00000072958 10.214318 -1.068407 2.420887e-12 2.204612e-18
## voom_adj_p edger_rank deseq_rank voom_rank rank_sum p_max
## 1 2.434237e-05 1 1 1.0 3.0 2.434237e-05
## 2 7.700210e-05 2 2 2.0 6.0 7.700210e-05
## 3 3.724219e-04 6 3 3.0 12.0 3.724219e-04
## 4 5.406701e-04 5 5 4.0 14.0 5.406701e-04
## 5 7.780662e-04 3 6 5.0 14.0 7.780662e-04
## 6 8.197765e-04 9 7 6.5 22.5 8.197765e-04
## genename symbol kegg
## 1 SPARC like 1 SPARCL1 <NA>
## 2 solute carrier family 6 member 9 SLC6A9 <NA>
## 3 mortality factor 4 like 2 MORF4L2 <NA>
## 4 quinolinate phosphoribosyltransferase QPRT 00760
## 5 uncharacterized LOC101928819 LOC101928819 <NA>
## 6 adaptor related protein complex 1 subunit mu 1 AP1M1 04142
multi_de_pairs
provides options to automatically write all results to output directories when a full path is provided. Which results are output depends on which directories are provided. Full paths provided to the parameters of output_voom
, output_edger
, output_deseq
and output_combined
will output Voom, EdgeR, DEseq and the merged results to the directories provided, respectively.
consensusDE also provides the option to remove batch effects through RUVseq functionality. consensusDE currently implements RUVr which models a first pass generalised linear model (GLM) using EdgeR and obtaining residuals for incorporation into the SummarizedExperiment object for inclusion in the models for DE analysis. The following example, uses RUV to identify these residuals. To view the residuals in the model see the resisuals section below in the plotting functions. Note, that if ruv_correct = TRUE
and a path to a plot_dir
is provided, diagnostic plots before and after RUV correction will be produced. The residuals can also be accessed in the summarizedExperiment as below. These are present in the “W_1” column. At present only one factor of variation is determined.
# call multi_de_pairs()
all_pairs_airway_ruv <- multi_de_pairs(summarized = airway_filter,
paired = "unpaired",
ruv_correct = TRUE)
# access the summarized experiment (now including the residuals under the "W_1" column)
colData(all_pairs_airway_ruv$summarized)
## DataFrame with 8 rows and 3 columns
## file group W_1
## <factor> <factor> <numeric>
## SRR1039508 SRR1039508 untrt -0.142570931417509
## SRR1039509 SRR1039509 trt -0.178946892818302
## SRR1039512 SRR1039512 untrt -0.0505852352305622
## SRR1039513 SRR1039513 trt -0.199848364862254
## SRR1039516 SRR1039516 untrt 0.472530975797468
## SRR1039517 SRR1039517 trt 0.703116609742485
## SRR1039520 SRR1039520 untrt -0.250514433796523
## SRR1039521 SRR1039521 trt -0.353181727414802
# view the results, now with RUV correction applied
head(all_pairs_airway_ruv$merged[["untrt-trt"]])
## ID AveExpr LogFC edger_adj_p deseq_adj_p
## 1 ENSG00000152583 7.848478 -4.544730 5.570866e-120 7.572664e-137
## 2 ENSG00000100242 10.044047 -1.819899 8.219712e-58 8.109859e-92
## 3 ENSG00000196517 6.707906 2.238471 1.022200e-31 6.177842e-37
## 4 ENSG00000250978 2.929392 -6.119005 2.135570e-52 1.046472e-20
## 5 ENSG00000123562 11.471330 -1.009971 2.445718e-19 2.384673e-42
## 6 ENSG00000100784 5.213175 2.497069 1.925716e-19 3.711088e-20
## voom_adj_p edger_rank deseq_rank voom_rank rank_sum p_max
## 1 4.377780e-06 1 1 1.5 3.5 4.377780e-06
## 2 4.377780e-06 2 2 1.5 5.5 4.377780e-06
## 3 8.480127e-05 4 4 4.0 12.0 8.480127e-05
## 4 5.668476e-05 3 7 3.0 13.0 5.668476e-05
## 5 9.475577e-05 6 3 5.0 14.0 9.475577e-05
## 6 3.452733e-04 5 9 7.5 21.5 3.452733e-04
multi_de_pairs
supports DE with paired samples. Paired samples may include, for example, the same patient observed before and after a treatment. For demonstration purposes, we assume that each untreated and treated sample is a pair.
NB. paired analysis with more than two groups is not currently supported. If there are more than two groups, consider testing each of the groups and their pairs seperately, or see the edgeR, limma/voom or DESeq2 vignettes for establishing a multi-variate model with blocking factors.
First we will update the summarized experiment object to include a “pairs” column and set paired = "paired"
in multi_de_pairs
.
# add "pairs" column to airway_filter summarized object
colData(airway_filter)$pairs <- as.factor(c("pair1", "pair1", "pair2", "pair2", "pair3", "pair3", "pair4", "pair4"))
# run multi_de_pairs in "paired" mode
all_pairs_airway_paired <- multi_de_pairs(summarized = airway_filter,
paired = "paired",
ruv_correct = TRUE)
head(all_pairs_airway_paired$merged[["untrt-trt"]])
## ID AveExpr LogFC edger_adj_p deseq_adj_p
## 1 ENSG00000152583 7.848478 -4.544966 1.792611e-290 5.392982e-204
## 2 ENSG00000163394 8.581289 2.232018 1.908774e-108 1.148966e-74
## 3 ENSG00000197381 9.588311 -2.297167 1.118845e-93 2.446339e-81
## 4 ENSG00000100242 10.044047 -1.832420 2.158772e-93 1.353246e-74
## 5 ENSG00000108960 9.759728 -1.646724 6.591573e-73 5.359287e-55
## 6 ENSG00000196517 6.707906 2.248395 9.743677e-86 1.411187e-32
## voom_adj_p edger_rank deseq_rank voom_rank rank_sum p_max
## 1 2.275040e-11 1 1 1.0 3.0 2.275040e-11
## 2 9.356984e-10 2 3 2.5 7.5 9.356984e-10
## 3 1.194899e-09 3 2 4.0 9.0 1.194899e-09
## 4 9.356984e-10 4 4 2.5 10.5 9.356984e-10
## 5 2.919886e-09 7 5 5.0 17.0 2.919886e-09
## 6 8.646645e-09 5 6 6.0 17.0 8.646645e-09
The design matrix can be retrieved as follows (from e.g. the voom model fit)
## Intercept W_1 pair2 pair3 pair4 untrt
## SRR1039508 1 -0.2610130 0 0 0 1
## SRR1039509 1 0.2589440 0 0 0 0
## SRR1039512 1 -0.2155299 1 0 0 1
## SRR1039513 1 0.2216438 1 0 0 0
## SRR1039516 1 0.6054848 0 1 0 1
## SRR1039517 1 -0.5962055 0 1 0 0
## SRR1039520 1 -0.1600887 0 0 1 1
## SRR1039521 1 0.1467646 0 0 1 0
## attr(,"assign")
## [1] 0 1 2 2 2 3
## attr(,"contrasts")
## attr(,"contrasts")$pairs
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
When performing DE analysis, a series of plots (currently 10) can be generated and saved as .pdf files in a plot directory provided to multi_de_pairs()
with the parameter: plot_dir = "/path/to/save/pdfs/
. See ?multi_de_pairs
for description.
In addition, each of the 10 plots can be plotted individually using the diag_plots
function. See ?diag_plots
for description, which provides wrappers for 10 different plots. Next we will plot each of these using the example data.
Plot the number of reads that mapped to the transcriptome of each sample. The sample numbers on the x-axis correspond to the sample row number in the summarizedExperiment built, accessible using colData(airway)
. Samples are coloured by their “group”.
Residuals for the RUV model can be plotted as follows:
This will perform an MA plot given a dataset of the appropriate structure. This will plot the Log-fold change (M) versus the average expression level (A). To use independently of multi_de_pairs()
and plot to only one comparison, constructing a list with one data.frame with the columns labelled “ID”, “AveExpr”, and “Adj_PVal” is required. The following illustrates an example for using the merged data, which needs to be put into a list and labelled appropriately. Note that this is done automatically with multi_de_pairs()
.
## [1] "untrt-trt"
# 2. Extract the data.frame of interest of a particular comparison
comparison <- all_pairs_airway$merged[["untrt-trt"]]
# this will not work unless in a list and will stop, producing an error. E.g.
diag_plots(merged_in = comparison,
name = "untrt-trt",
ma = TRUE)
# Error message:
merged_in is not a list. If you want to plot with one comparison only,
put the single dataframe into a list as follows. my_list <- list("name"=
merged_in)
# 3. Put into a new list as instructed by the error
comparison_list <- list("untrt-trt" = comparison)
# this will not work unless the appropriate columns are labelled
# "ID", "AveExpr", and "Adj_PVal"
# 4. Relabel the columns for plotting
# inspecting the column names reveals that the "Adj_PVal" column needs to be specified.
colnames(comparison_list[["untrt-trt"]])
## [1] "ID" "AveExpr" "LogFC" "edger_adj_p" "deseq_adj_p"
## [6] "voom_adj_p" "edger_rank" "deseq_rank" "voom_rank" "rank_sum"
## [11] "p_max" "genename" "symbol" "kegg"
# Here, we will relabel "edger_adj_p" with "Adj_PVal" to use this p-value, using
# the "gsub" command as follows (however, we could also use one of the others or
# the p_max column)
colnames(comparison_list[["untrt-trt"]]) <- gsub("edger_adj_p", "Adj_PVal",
colnames(comparison_list[["untrt-trt"]]))
# after label
colnames(comparison_list[["untrt-trt"]])
## [1] "ID" "AveExpr" "LogFC" "Adj_PVal" "deseq_adj_p"
## [6] "voom_adj_p" "edger_rank" "deseq_rank" "voom_rank" "rank_sum"
## [11] "p_max" "genename" "symbol" "kegg"
This plot a volcano plot, which compares the Log-fold change versus significance of change -log transformed score. As above and described in the MA plot section, to use independently of multi_de_pairs()
and plot to only one comparison, constructing a list with one data.frame with the columns labelled “ID”, “AveExpr”, and “Adj_PVal” is required.
This plot the distribution of p-values for diagnostic analyses. As above and described in the MA plot section, to use independently of multi_de_pairs()
and plot to only one comparison, constructing a list with one data.frame with the columns labelled “ID”, “AveExpr”, and “Adj_PVal” is required.
The legend and labels can be turned off using legend = FALSE
and label = TRUE
for diag_plots()
. See ?diag_plots
for more details of these parameters.
When performing DE analysis, data is stored in simple list object that can be accessed. Below are the levels of data available from the output of a DE analysis. We use the all_pairs_airway
results from the above analysis to demonstrate how to locate these tables.
all_pairs_airway$merged
In addition to the list with the combined results of DESeq2, Voom and EdgeR, the full results can be accessed for each method, as well as fit tables and the contrasts performed.
all_pairs_airway$deseq
(list of the DEseq2 results)all_pairs_airway$voom
(list of the Voom results)all_pairs_airway$edger
(list of the edgeR results)Within each list the following data is accessible. Each object is list of all the comparisons performed.
all_pairs_airway$deseq$short_results
all_pairs_airway$deseq$short_results[[1]]
all_pairs_airway$deseq$full_results
all_pairs_airway$deseq$fitted
all_pairs_airway$deseq$contrasts
consensusDE
When using this package, please cite consensusDE as follows and all methods used in your analysis.
For consensus DE:
##
## To cite package 'consensusDE' in publications use:
##
## Ashley J. Waardenberg (2019). consensusDE: RNA-seq analysis
## using multiple algorithms. R package version 1.0.6.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {consensusDE: RNA-seq analysis using multiple algorithms},
## author = {Ashley J. Waardenberg},
## year = {2019},
## note = {R package version 1.0.6},
## }
##
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.