DEqMS 1.0.1
DEqMS
builds on top of Limma
, a widely-used R package for microarray data
analysis (Smyth G. et al 2004), and improves it with mass spectrometry specific
data properties, accounting for variance dependence on the number of quantified
peptides or PSMs for statistical testing of differential protein expression.
library(DEqMS)
## Loading required package: ggplot2
## Loading required package: limma
We analyzed a protemoics dataset (TMT10plex labelled) in which A431 cells (human epidermoid carcinoma cell line) were treated with three different miRNA mimics (Yan Z. et al. Oncogene 2016). The raw MS data was searched with MS-GF+ (Kim et al Nat Communications 2016) and post processed with Percolator (Kall L. et al Nat Method 2007). A tabular text output of PSM intensity table filtered at 1% peptide level FDR is used as input for downstream analysis.
Read a tabular input in which PSMs are in rows and samples are in columns. Intenisty values were log transformed since systematic effects and variance components are usually assumed to be additive on log scale (Oberg AL. et al JPR 2008; Hill EG. et al JPR 2008).
### retrieve example dataset
library(ExperimentHub)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:limma':
##
## plotMA
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colMeans, colSums, colnames,
## dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
## intersect, is.unsorted, lapply, lengths, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
## rowMeans, rowSums, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: AnnotationHub
eh = ExperimentHub()
## snapshotDate(): 2018-10-30
query(eh, "DEqMS")
## ExperimentHub with 1 record
## # snapshotDate(): 2018-10-30
## # names(): EH1663
## # package(): DEqMS
## # $dataprovider: Swedish BioMS infrastructure
## # $species: Homo sapiens
## # $rdataclass: data.frame
## # $rdatadateadded: 2018-09-11
## # $title: microRNA treated A431 cell proteomics data
## # $description: High resolution isoelectric focusing LC-MS/MS data for A4...
## # $taxonomyid: 9606
## # $genome: hg38
## # $sourcetype: TXT
## # $sourceurl: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2016/06/PXD004...
## # $sourcesize: NA
## # $tags: c("Proteomics", "MassSpectrometry", "Preprocessing",
## # "DifferentialExpression", "MultipleComparison", "Normalization",
## # "Bayesian")
## # retrieve record with 'object[["EH1663"]]'
dat.psm = eh[["EH1663"]]
## see ?DEqMS and browseVignettes('DEqMS') for documentation
## downloading 0 resources
## loading from cache
## '/home/biocbuild//.ExperimentHub/1663'
dat.psm.log = dat.psm
dat.psm.log[,3:12] = log2(dat.psm[,3:12])
head(dat.psm.log)
## Peptide gene tmt10plex_126
## 1 +229.163EK+229.163EDDEEEEDEDASGGDQDQEER RAD21 16.75237
## 2 +229.163LGLGIDEDEVAAEEPNAAVPDEIPPLEGDEDASR HSP90AB1 10.83812
## 3 +229.163TEGDEEAEEEQEENLEASGDYK+229.163 XRCC6 14.50514
## 4 +229.163GDAEK+229.163PEEELEEDDDEELDETLSER TOMM22 15.03117
## 8 +229.163APLATGEDDDDEVPDLVENFDEASK+229.163 BTF3 12.91406
## 9 +229.163LEEEDEDEEDGESGC+57.021TFLVGLIQK+229.163 CAPN2 14.98558
## tmt10plex_127N tmt10plex_127C tmt10plex_128N tmt10plex_128C tmt10plex_129N
## 1 16.58542 17.26731 16.89528 17.01872 17.57275
## 2 10.13673 11.11384 11.07480 10.94694 10.79556
## 3 14.24282 15.23424 14.89867 14.74940 14.97737
## 4 14.91910 15.41189 15.28130 15.28605 15.41345
## 8 12.95097 13.00558 13.42184 12.63930 13.62308
## 9 14.97605 15.30197 15.27980 15.10410 15.31982
## tmt10plex_129C tmt10plex_130N tmt10plex_130C tmt10plex_131
## 1 17.17815 16.86259 17.10233 17.75614
## 2 11.12758 11.14692 10.82071 11.21737
## 3 15.15130 15.09598 15.01059 15.46618
## 4 15.25668 15.39181 15.26238 15.79845
## 8 13.12886 12.19316 12.90018 13.52949
## 9 15.25234 15.51071 15.72660 16.06220
Generate sample annotation table.
cond = c("ctrl","miR191","miR372","miR519","ctrl",
"miR372","miR519","ctrl","miR191","miR372")
sampleTable <- data.frame(
row.names = colnames(dat.psm)[3:12],
cond = as.factor(cond))
sampleTable
## cond
## tmt10plex_126 ctrl
## tmt10plex_127N miR191
## tmt10plex_127C miR372
## tmt10plex_128N miR519
## tmt10plex_128C ctrl
## tmt10plex_129N miR372
## tmt10plex_129C miR519
## tmt10plex_130N ctrl
## tmt10plex_130C miR191
## tmt10plex_131 miR372
Since each row is one PSM, to count how many PSMs per protein, we simply use
function table
in R to count the number of occurance that each protein ID
appears.
psm.count.table = as.data.frame(table(dat.psm$gene))
rownames(psm.count.table)=psm.count.table$Var1
plot(sort(log2(psm.count.table$Freq)),pch=".",
xlab="Proteins ordered by PSM count",ylab="log2(PSM count)")
Here, median sweeping is used to summarize PSMs intensities to protein log2
ratios. In this procedure, we substract the spectrum log2 intensity from the
median log2 intensities of all samples. The relative abundance estimate for
each protein is calculated as the median over all PSMs belonging to this
protein.
Assume the log2 intensity of PSM i
in sample j
is \(y_{i,j}\), its relative
log2 intensity of PSM i
in sample j
is \(y'_{i,j}\):
\[y'_{i,j} = y_{i,j} - median_{j'\in ctrl}\ y_{i,j'} \]
Relative abundance of protein k
in sample j
\(Y_{k,j}\) is calculated as:
\[Y_{k,j} = median_{i\in protein\ k}\ y'_{i,j} \]
Correction for differences in amounts of material loaded in the channels is then done by subtracting the channel median from the relative abundance (log2 ratio), normalizing all channels to have median zero.
dat.gene.nm = medianSweeping(dat.psm.log,group_col = 2)
boxplot(dat.gene.nm,xlab="TMT 10 channels",
ylab="log2 relative protein abundance")
We first apply t.test to detect significant protein changes between ctrl samples and miR372 treated samples, both have three replicates.
pval.372 = apply(dat.gene.nm, 1, function(x)
t.test(as.numeric(x[c(1,5,8)]), as.numeric(x[c(3,6,10)]))$p.value)
logFC.372 = rowMeans(dat.gene.nm[,c(3,6,10)])-rowMeans(dat.gene.nm[,c(1,5,8)])
Generate a data.frame of t.test results, add PSM count values and order the table by p-value.
ttest.results = data.frame(gene=rownames(dat.gene.nm),
logFC=logFC.372,P.Value = pval.372,
adj.pval = p.adjust(pval.372,method = "BH"))
ttest.results$PSMcount = psm.count.table[ttest.results$gene,]$Freq
ttest.results = ttest.results[with(ttest.results, order(P.Value)), ]
head(ttest.results)
## gene logFC P.Value adj.pval PSMcount
## CCNE2 CCNE2 0.5386427 6.522987e-07 0.00599593 1
## CPSF2 CPSF2 0.1077977 7.633799e-06 0.03508494 53
## PPOX PPOX -0.2464418 2.546510e-05 0.07802507 6
## RELA RELA -0.5617739 5.078761e-05 0.11670992 31
## IFIT1 IFIT1 0.6375060 7.300925e-05 0.13422020 33
## MAGEA6 MAGEA6 0.4625733 1.093648e-04 0.16178031 6
Anova analysis is equivalent to linear model analysis. The difference to Limma analysis is that estimated variance is not moderated using empirical bayesian approach as it is done in Limma. The purpose here is to compare the statistical power to that when Limma is applied.
We first make a design matrix and then use lmFit
function in Limma for
linear model analysis. The input for lmFit
function is a matrix of relative
protein abundance (log2 ratios) and a design matrix containing sample
annotation. Use head(fit$coefficients)
to see effect size(fold change) of
genes for different conditions.
library(limma)
gene.matrix = as.matrix(dat.gene.nm)
colnames(gene.matrix) = as.character(sampleTable$cond)
design = model.matrix(~cond,sampleTable)
fit1 <- lmFit(gene.matrix,design)
ord.t = fit1$coefficients[, 3]/fit1$sigma/fit1$stdev.unscaled[, 3]
ord.p = 2*pt(-abs(ord.t), fit1$df.residual)
ord.q = p.adjust(ord.p,method = "BH")
anova.results = data.frame(gene=names(fit1$sigma),
logFC=fit1$coefficients[,3],
t=ord.t,
P.Value=ord.p,
adj.P.Val = ord.q)
anova.results$PSMcount = psm.count.table[anova.results$gene,]$Freq
anova.results = anova.results[with(anova.results,order(P.Value)),]
head(anova.results)
## gene logFC t P.Value adj.P.Val PSMcount
## SH3BP1 SH3BP1 0.3426263 18.84397 1.442660e-06 0.005882983 20
## ANKRD52 ANKRD52 -1.1917809 -18.66205 1.527812e-06 0.005882983 17
## TGFBR2 TGFBR2 -1.3474739 -17.38307 2.323406e-06 0.005882983 7
## TBC1D2 TBC1D2 -0.5272092 -16.85506 2.786690e-06 0.005882983 26
## PHLPP2 PHLPP2 -0.7969800 -16.46379 3.200056e-06 0.005882983 8
## CROT CROT -1.1997155 -15.92937 3.885625e-06 0.005952777 20
Limma is essentially a combination of linear model analysis and empirical bayeisan estimation of variance, the latter is applied to increase the statistical power by borrowing informations across genes to shrink the variance.
fit2 <- eBayes(lmFit(gene.matrix,design))
head(fit2$coefficients)
## (Intercept) condmiR191 condmiR372 condmiR519
## A2M 0.235296527 -0.29793777 -0.51680730 -0.368015917
## A2ML1 -0.151670026 0.29957250 0.10781670 0.373261252
## AAAS 0.082921445 -0.11106003 -0.07255767 -0.156424479
## AACS 0.001687942 -0.06661920 -0.04126341 0.004051871
## AAED1 -0.117523764 0.09149186 0.25393152 0.076939663
## AAGAB -0.022929404 0.18415988 0.11458149 0.002356680
Extract limma results using topTable
function, coef = 3
allows you to
extract the contrast of specific condition (miR372), option n= Inf
output
all rows.
Add PSM count values in the table.
limma.results = topTable(fit2,coef = 3,n= Inf)
limma.results$gene = rownames(limma.results)
limma.results$PSMcount = psm.count.table[limma.results$gene,]$Freq
head(limma.results)
## logFC AveExpr t P.Value adj.P.Val B
## ANKRD52 -1.1917809 -0.093887723 -18.26692 2.608305e-08 0.0001326236 9.113399
## TGFBR2 -1.3474739 0.083901815 -18.05371 2.885630e-08 0.0001326236 9.041434
## CROT -1.1997155 -0.060459201 -16.41503 6.530463e-08 0.0002000934 8.438659
## PHLPP2 -0.7969800 -0.001069459 -14.37066 2.029693e-07 0.0004664235 7.543032
## PDCD4 -0.7800666 0.007360661 -13.07496 4.512500e-07 0.0008295781 6.874678
## ZKSCAN1 -0.8816149 -0.056612793 -11.95070 9.591643e-07 0.0012877234 6.218545
## gene PSMcount
## ANKRD52 ANKRD52 17
## TGFBR2 TGFBR2 7
## CROT CROT 20
## PHLPP2 PHLPP2 8
## PDCD4 PDCD4 40
## ZKSCAN1 ZKSCAN1 10
We observed that the variance of gene across samples gradually decreases as the number of spectra count increases.
dat.temp = data.frame(var = fit2$sigma^2,
PSMcount = psm.count.table[names(fit2$sigma),]$Freq)
dat.temp.filter = dat.temp[dat.temp$PSMcount<21,]
op <- par(mfrow=c(1,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(log2(psm.count.table[rownames(dat.gene.nm),2]),
dat.gene.nm[,1]-dat.gene.nm[,5],pch=20,cex=0.5,
xlab="log2(PSM count)",ylab="log2 ratio between two ctrl replicates")
boxplot(log2(var)~PSMcount,dat.temp.filter,xlab="PSMcount",
ylab="log2(Variance)")
Limma assumes equal prior variance for all genes, the function
spectraCounteBayes
in DEqMS package is able to correct the biase of prior
variance estimate for proteins quantified by different number of PSMs. It works
in a similar way to the intensity-based hierarchical Bayes method
(Maureen A. Sartor et al BMC Bioinformatics 2006).
Outputs of spectraCounteBayes
:
object is augmented form of “fit” object from eBayes
in Limma, with the
additions being:
sca.t
- Spectra Count Adjusted posterior t-value
sca.p
- Spectra Count Adjusted posterior p-value
sca.dfprior
- estimated prior degrees of freedom
sca.priorvar
- estimated prior variance
sca.postvar
- estimated posterior variance
loess.model
- fitted loess model
fit2$count = psm.count.table[rownames(fit2$coefficients),]$Freq
fit3 = spectraCounteBayes(fit2)
DEqMS.results = outputResult(fit3,coef_col = 3)
head(DEqMS.results)
## logFC AveExpr t P.Value adj.P.Val B
## ANKRD52 -1.1917809 -0.093887723 -18.26692 2.608305e-08 0.0001326236 9.113399
## CROT -1.1997155 -0.060459201 -16.41503 6.530463e-08 0.0002000934 8.438659
## TGFBR2 -1.3474739 0.083901815 -18.05371 2.885630e-08 0.0001326236 9.041434
## PDCD4 -0.7800666 0.007360661 -13.07496 4.512500e-07 0.0008295781 6.874678
## TRPS1 -0.8122847 -0.050833888 -11.70281 1.142391e-06 0.0013126073 6.063183
## PHLPP2 -0.7969800 -0.001069459 -14.37066 2.029693e-07 0.0004664235 7.543032
## gene count sca.t sca.P.Value sca.adj.pval
## ANKRD52 ANKRD52 17 -18.97949 3.769163e-10 3.147510e-06
## CROT CROT 20 -17.59956 8.866592e-10 3.147510e-06
## TGFBR2 TGFBR2 7 -17.37181 1.027255e-09 3.147510e-06
## PDCD4 PDCD4 40 -14.25841 9.397591e-09 2.159566e-05
## TRPS1 TRPS1 30 -12.78539 3.133624e-08 4.919434e-05
## PHLPP2 PHLPP2 8 -12.75696 3.211119e-08 4.919434e-05
Check if the fitted variance ~ PMS count model is correct.
plotFitCurve(fit3,type = "boxplot",xlab="PSM count")
Visualize the prior variance fitted by Limma and DEqMS
x = fit3$count
y = fit3$sigma^2
limma.prior = fit3$s2.prior
DEqMS.prior = fit3$sca.priorvar
plot(log2(x),log(y),pch=20, cex=0.5,xlab = "log2(PSMcount)",
ylab="log(Variance)")
abline(h = log(limma.prior),col="green",lwd=3 )
k=order(x)
lines(log2(x[k]),log(DEqMS.prior[k]),col="red",lwd=3 )
legend("topright",legend=c("Limma prior variance","DEqMS prior variance"),
col=c("green","red"),lwd=3)
Visualize the change of posterior variance after PSM count adjustment. The plot here shows posterior variance of proteins “shrink” toward the fitted value to different extent depending on PSM number.
x = fit3$count
y = fit3$s2.post
op <- par(mfrow=c(1,2), mar=c(4,4,4,1), oma=c(0.5,0.5,0.5,0))
plot(log2(x),log(y),pch=20, cex=0.5, xlab = "log2(PSMcount)",
ylab="log(Variance)",
main="Posterior Variance in Limma")
y = fit3$sca.postvar
plot(log2(x),log(y),pch=20,cex=0.5, xlab = "log2(PSMcount)",
ylab="log(Variance)",
main="Posterior Variance in DEqMS")
plotting top 500 genes ranked by p-values.
plot(sort(-log10(limma.results$P.Value),decreasing = TRUE)[1:500],
type="l",lty=2,lwd=2, ylab="-log10(p-value)",
xlab="Proteins ranked by p-values",
col="purple")
lines(sort(-log10(DEqMS.results$sca.P.Value),decreasing = TRUE)[1:500],
lty=1,lwd=2,col="red")
lines(sort(-log10(anova.results$P.Value),decreasing = TRUE)[1:500],
lty=2,lwd=2,col="blue")
lines(sort(-log10(ttest.results$P.Value),decreasing = TRUE)[1:500],
lty=2,lwd=2,col="orange")
legend("topright",legend = c("Limma","DEqMS","Anova","t.test"),
col = c("purple","red","blue","orange"),lty=c(2,1,2,2),lwd=2)
peptideProfilePlot
function will plot log2 intensity of each PSM/peptide of
the protein in the input table.
library(ggplot2)
dat.psm.log.ordered = dat.psm.log[,c("Peptide","gene",
sort(colnames(dat.psm.log)[3:12]))]
peptideProfilePlot(dat=dat.psm.log.ordered,col=2,gene="TGFBR2")
## Using Peptide, gene as id variables
library(ggrepel)
# to plot adjusted p-values by BH method on y-axis
DEqMS.results$log.adj.pval = -log10(DEqMS.results$sca.adj.pval)
ggplot(DEqMS.results, aes(x = logFC, y = log.adj.pval)) +
geom_point(size=0.5 )+
theme_bw(base_size = 16) + # change theme
xlab(expression("log2 miR372/ctrl")) + # x-axis label
ylab(expression(" -log10 adj.pval")) + # y-axis label
geom_vline(xintercept = c(-1,1), colour = "red") + # Add fold change cutoffs
geom_hline(yintercept = 2, colour = "red") + # Add significance cutoffs
geom_vline(xintercept = 0, colour = "black") + # Add 0 lines
scale_colour_gradient(low = "black", high = "black", guide = FALSE)+
geom_text_repel(data=subset(DEqMS.results, abs(logFC)>1&log.adj.pval> 2),
aes( logFC, log.adj.pval ,label=gene)) # add gene label
you can also use volcanoplot
function from Limma. However, it uses p.value
from Limma. If you want to plot sca.pvalue
from DEqMS
, you need to modify
the fit object.
fit3$p.value = fit3$sca.p
volcanoplot(fit3,coef=3, style = "p-value", highlight = 20,
names=rownames(fit3$coefficients))
library( RColorBrewer )
pr <- prcomp(t(gene.matrix))
plot( pr$x[,1:2], asp=1, col=brewer.pal(4,"Set1")[sampleTable$cond], pch=17)
text( pr$x[,1], pr$x[,2]-1, label=as.character(sampleTable$cond))
legend( "bottomright", legend = levels( sampleTable$cond ),
col = brewer.pal(4,"Set1"), pch=17 )
plot sample correlation heatmap
library( pheatmap )
cm <- cor( gene.matrix )
# rearrange columns so that same sample types are together
cm.order = cm[order(colnames(cm)),order(colnames(cm))]
pheatmap(cm.order,
cluster_rows=FALSE, cluster_cols=FALSE,
color = colorRampPalette(c("blue", "white", "red"))(100))
or plot Eucldiean distance heatmap
dm <- as.matrix( dist( t( gene.matrix ) ))
dm.order = dm[order(colnames(cm)),order(colnames(cm))]
pheatmap(dm.order,
cluster_rows=FALSE, cluster_cols=FALSE,
color = colorRampPalette(c("red", "white"))(100))