Abstract
clusterProfiler implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
Supported Analysis
- Over-Representation Analysis
- Gene Set Enrichment Analysis
- Biological theme comparison
Supported ontologies/pathways
- Disease Ontology (via DOSE)
- Network of Cancer Gene (via DOSE)
- DisGeNET (via DOSE)
- Gene Ontology (supports many species with GO annotation query online via AnnotationHub)
- KEGG Pathway and Module with latest online data (supports more than 4000 species listed in http://www.genome.jp/kegg/catalog/org_list.html)
- Reactome Pathway (via ReactomePA)
- DAVID (via RDAVIDWebService)
- Molecular Signatures Database
- hallmark gene sets
- positional gene sets
- curated gene sets
- motif gene sets
- computational gene sets
- GO gene sets
- oncogenic signatures
- immunologic signatures
- Other Annotations
- from other sources (e.g. DisGeNET as an example)
- user’s annotation
- customized ontology
- and many others
Visualization
- barplot
- cnetplot
- dotplot
- emapplot
- gseaplot
- goplot
- upsetplot
Citation
If you use clusterProfiler in published research, please cite:
G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012, 16(5):284-287. doi:[10.1089/omi.2011.0118](http://dx.doi.org/10.1089/omi.2011.0118)
Introduction
In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular molecules at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities(Yu et al. 2010).
To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes.
After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed clusterProfiler(Yu et al. 2012), for comparing and visualizing functional profiles among gene clusters.
bitr
: Biological Id TranslatoR
clusterProfiler provides bitr
and bitr_kegg
for converting ID types. Both bitr
and bitr_kegg
support many species including model and many non-model organisms.
x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2",
"ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1",
"NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1",
"COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC",
"ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT",
"MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK",
"STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(eg)
## SYMBOL ENTREZID
## 1 GPX3 2878
## 2 GLRX 2745
## 3 LBP 3929
## 4 CRYAB 1410
## 5 DEFB1 1672
## 6 HCLS1 3059
User should provides an annotation package, both fromType and toType can accept any types that supported.
User can use keytypes to list all supporting types.
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
We can translate from one type to other types.
## SYMBOL UNIPROT ENSEMBL
## 1 GPX3 P22352 ENSG00000211445
## 2 GLRX A0A024RAM2 ENSG00000173221
## 3 GLRX P35754 ENSG00000173221
## 4 LBP P18428 ENSG00000129988
## 5 LBP Q8TCF0 ENSG00000129988
## 6 CRYAB P02511 ENSG00000109846
For GO analysis, user don’t need to convert ID, all ID type provided by OrgDb
can be used in groupGO
, enrichGO
and gseGO
by specifying keyType
parameter.
bitr_kegg
: converting biological IDs using KEGG API
## [1] "4597" "7111" "5266" "2175" "755" "23046"
## kegg ncbi-proteinid
## 1 10001 NP_005457
## 2 10209 NP_005792
## 3 10232 NP_037536
## 4 10324 NP_006054
## 5 10411 NP_001092002
## 6 10614 NP_006451
The ID type (both fromType
& toType
) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene
ID for eukaryote species and Locus
ID for prokaryotes.
Many prokaryote species don’t have entrezgene ID available. For example we can check the gene information of ece:Z5100
in http://www.genome.jp/dbget-bin/www_bget?ece:Z5100, which have NCBI-ProteinID
and UnitProt
links in the Other DBs
Entry, but not NCBI-GeneID
.
If we try to convert Z5100
to ncbi-geneid
, bitr_kegg
will throw error of ncbi-geneid is not supported
.
## Error in KEGG_convert(fromType, toType, organism) :
## ncbi-geneid is not supported for ece ...
We can of course convert it to ncbi-proteinid
and uniprot
:
## kegg ncbi-proteinid
## 1 Z5100 AAG58814
## kegg uniprot
## 1 Z5100 A0A384KFP7
GO Analysis
Supported organisms
GO analyses (groupGO()
, enrichGO()
and gseGO()
) support organisms that have an OrgDb
object available.
Bioconductor have already provide OrgDb
for about 20 species. User can query OrgDb
online by AnnotationHub or build their own by AnnotationForge. An example can be found in the vignette of GOSemSim.
If user have GO annotation data (in data.frame format with first column of gene ID and second column of GO ID), they can use enricher()
and gseGO()
functions to perform over-representation test and gene set enrichment analysis.
If genes are annotated by direction annotation, it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap
function, which will infer indirection annotation and generate a data.frame
that suitable for both enricher()
and gseGO()
.
GO classification
In clusterProfiler, groupGO
is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList
provided by DOSE. Please refer to vignette of DOSE for more details.
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
head(gene.df)
## ENTREZID ENSEMBL SYMBOL
## 1 4312 ENSG00000196611 MMP1
## 2 8318 ENSG00000093009 CDC45
## 3 10874 ENSG00000109255 NMU
## 4 55143 ENSG00000134690 CDCA8
## 5 55388 ENSG00000065328 MCM10
## 6 991 ENSG00000117399 CDC20
## [1] ID Description Count GeneRatio geneID
## <0 rows> (or 0-length row.names)
The input parameters of gene is a vector of gene IDs (can be any ID type that supported by corresponding OrgDb
).
If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.
GO over-representation test
Over-representation test(Boyle et al. 2004) were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
## ID Description GeneRatio
## GO:0005819 GO:0005819 spindle 25/202
## GO:0005876 GO:0005876 spindle microtubule 12/202
## GO:0000779 GO:0000779 condensed chromosome, centromeric region 15/202
## GO:0072686 GO:0072686 mitotic spindle 14/202
## GO:0000775 GO:0000775 chromosome, centromeric region 18/202
## GO:0000776 GO:0000776 kinetochore 15/202
## BgRatio pvalue p.adjust qvalue
## GO:0005819 262/11812 2.569986e-12 8.146855e-10 7.520590e-10
## GO:0005876 45/11812 7.912518e-12 1.254134e-09 1.157726e-09
## GO:0000779 91/11812 3.254384e-11 3.438799e-09 3.174452e-09
## GO:0072686 84/11812 1.292642e-10 9.903693e-09 9.142376e-09
## GO:0000775 156/11812 1.562097e-10 9.903693e-09 9.142376e-09
## GO:0000776 106/11812 3.073022e-10 1.623580e-08 1.498772e-08
## geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876 CENPE/SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000779 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/AURKA/CCNB1
## GO:0072686 KIF23/CENPE/ASPM/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/AURKB/KIFC1/KIF18B/AURKA
## GO:0000775 CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
## GO:0000776 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
## Count
## GO:0005819 25
## GO:0005876 12
## GO:0000779 15
## GO:0072686 14
## GO:0000775 18
## GO:0000776 15
As I mentioned before, any gene ID type that supported in OrgDb
can be directly used in GO analyses. User need to specify the keyType
parameter to specify the input gene ID type.
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
Gene ID can be mapped to gene Symbol by using paramter readable=TRUE
or setReadable
function.
drop specific GO terms or level
enrichGO
test the whole GO corpus and enriched result may contains very general terms. With dropGO
function, user can remove specific GO terms or GO level from results obtained from both enrichGO
and compareCluster
.
test GO at sepcific level
enrichGO
doesn’t contain parameter to restrict the test at specific GO level. Instead, we provide a function gofilter
to restrict the result at specific GO level. It works with results obtained from both enrichGO
and compareCluster
.
reduce redundancy of enriched GO terms
According to issue #28, I implement a simplify
method to remove redundant GO terms obtained from enrichGO
. An example can be found in the blog post. It internally call GOSemSim to calculate similarities among GO terms and remove those highly similar terms by keeping one representative term. The simplify
method also works with both outputs from enrichGO
and compareCluster
.
GO Gene Set Enrichment Analysis
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)(Subramanian et al. 2005) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
For algorithm details, please refer to the vignette of DOSE.
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
nPerm = 1000,
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
GSEA use permutation test, user can set nPerm for number of permutations. Only gene Set size in [minGSSize, maxGSSize]
will be tested.
If you have issues in preparing your own geneList
, please refer to the wiki page.
GO Semantic Similarity Analysis
GO semantic similarity can be calculated by GOSemSim(Yu et al. 2010). We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.
KEGG analysis
The annotation package, KEGG.db
, is not updated since 2012. It’s now pretty old and in clusterProfiler, enrichKEGG
(for KEGG pathway) and enrichMKEGG
(for KEGG module) supports downloading latest online version of KEGG data for enrichment analysis. Using KEGG.db
is also supported by explicitly setting use_internal_data parameter to TRUE, but it’s not recommended.
With this new feature, organism is not restricted to those supported in previous release, it can be any species that have KEGG annotation data available in KEGG database. User should pass abbreviation of academic name to the organism parameter. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.
clusterProfiler provides search_kegg_organism()
function to help searching supported organisms.
## kegg_code scientific_name common_name
## 366 ece Escherichia coli O157:H7 EDL933 (EHEC) <NA>
## [1] 65 3
## kegg_code scientific_name common_name
## 361 eco Escherichia coli K-12 MG1655 <NA>
## 362 ecj Escherichia coli K-12 W3110 <NA>
## 363 ecd Escherichia coli K-12 DH10B <NA>
## 364 ebw Escherichia coli BW2952 <NA>
## 365 ecok Escherichia coli K-12 MDS42 <NA>
## 366 ece Escherichia coli O157:H7 EDL933 (EHEC) <NA>
KEGG over-representation test
## ID Description GeneRatio
## hsa04110 hsa04110 Cell cycle 11/90
## hsa04114 hsa04114 Oocyte meiosis 10/90
## hsa04218 hsa04218 Cellular senescence 10/90
## hsa03320 hsa03320 PPAR signaling pathway 7/90
## hsa04914 hsa04914 Progesterone-mediated oocyte maturation 7/90
## BgRatio pvalue p.adjust qvalue
## hsa04110 124/7470 2.317335e-07 4.472457e-05 0.0000439074
## hsa04114 125/7470 2.214201e-06 2.136704e-04 0.0002097664
## hsa04218 160/7470 2.011989e-05 1.294380e-03 0.0012707299
## hsa03320 74/7470 2.721780e-05 1.313259e-03 0.0012892640
## hsa04914 99/7470 1.765007e-04 6.812927e-03 0.0066884479
## geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232 11
## hsa04114 991/9133/983/4085/51806/6790/891/9232/3708/5241 10
## hsa04218 2305/4605/9133/890/983/51806/1111/891/776/3708 10
## hsa03320 4312/9415/9370/5105/2167/3158/5346 7
## hsa04914 9133/890/983/4085/6790/891/5241 7
Input ID type can be kegg
, ncbi-geneid
, ncbi-proteinid
or uniprot
, an example can be found in the post.
KEGG Gene Set Enrichment Analysis
kk2 <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 1000,
minGSSize = 120,
pvalueCutoff = 0.05,
verbose = FALSE)
head(kk2)
## ID Description setSize
## hsa04151 hsa04151 PI3K-Akt signaling pathway 322
## hsa04510 hsa04510 Focal adhesion 188
## hsa04218 hsa04218 Cellular senescence 143
## hsa05166 hsa05166 Human T-cell leukemia virus 1 infection 200
## hsa05169 hsa05169 Epstein-Barr virus infection 192
## hsa05170 hsa05170 Human immunodeficiency virus 1 infection 189
## enrichmentScore NES pvalue p.adjust qvalues rank
## hsa04151 -0.3482755 -1.503715 0.001328021 0.0276644 0.01814059 1997
## hsa04510 -0.4188582 -1.725829 0.001390821 0.0276644 0.01814059 2183
## hsa04218 0.4153718 1.792654 0.002967359 0.0276644 0.01814059 1155
## hsa05166 0.3926594 1.744570 0.003484321 0.0276644 0.01814059 1955
## hsa05169 0.4343189 1.916668 0.003533569 0.0276644 0.01814059 2820
## hsa05170 0.3711150 1.635182 0.003533569 0.0276644 0.01814059 3178
## leading_edge
## hsa04151 tags=23%, list=16%, signal=20%
## hsa04510 tags=27%, list=17%, signal=23%
## hsa04218 tags=17%, list=9%, signal=16%
## hsa05166 tags=26%, list=16%, signal=23%
## hsa05169 tags=40%, list=23%, signal=31%
## hsa05170 tags=38%, list=25%, signal=29%
## core_enrichment
## hsa04151 2252/7059/92579/5563/5295/6794/1288/7010/3910/3371/3082/1291/4602/3791/1027/90993/3441/3643/1129/2322/1975/7450/596/3685/1942/2149/1280/4804/3675/595/2261/7248/2246/4803/3912/1902/1278/1277/2846/2057/1293/2247/55970/5618/7058/10161/56034/3693/4254/3480/4908/5159/1292/3908/2690/3909/8817/9223/4915/3551/2791/63923/3913/9863/3667/1287/3679/7060/3479/80310/1311/5105/2066/1101
## hsa04510 5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
## hsa04218 2305/4605/9133/890/983/51806/1111/891/993/3576/1978/898/9134/4609/1869/1029/22808/3804/1871/5499/91860/292/1019/11200/1875
## hsa05166 991/9133/890/4085/7850/1111/9232/8061/701/9700/898/4316/9134/3932/3559/3126/3112/4609/3561/917/1869/1029/915/114/2005/5902/55697/1871/1031/2224/292/1019/3689/916/3383/11200/706/3600/6513/3601/468/5604/7124/1030/3569/4049/4055/10393/3119/5901/5971/1959/3135
## hsa05169 3627/890/6890/9636/898/9134/6502/6772/3126/3112/4609/917/5709/1869/3654/919/915/4067/4938/864/4940/5713/5336/11047/3066/54205/1871/578/1019/637/916/3383/4939/10213/23586/4793/5603/7979/7128/6891/930/5714/3452/6850/5702/4794/7124/3569/7097/5708/2208/8772/3119/5704/7186/5971/3135/1380/958/5610/4792/10018/8819/3134/10379/9641/1147/5718/6300/3109/811/5606/2923/3108/5707/1432
## hsa05170 9133/9582/983/51806/1111/891/6890/200315/917/3654/919/915/5336/54205/91860/578/995/25939/637/1234/916/5603/2792/5880/6891/6921/3452/5604/7124/9978/7097/7852/8772/1174/7186/3135/164/60489/2787/356/7133/5605/27350/6199/1642/5594/4792/3134/5578/4893/8454/5582/2786/1147/3984/6300/200316/811/5606/2923/4775/162/1432/2784/836/5747/5058/3106/2770/5534/5579/4615
KEGG Module over-representation test
KEGG Module is a collection of manually defined function units. In some situation, KEGG Modules have a more straightforward interpretation.
KEGG Module Gene Set Enrichment Analysis
Disease analysis
DOSE(Yu et al. 2015) supports Disease Ontology (DO) Semantic and Enrichment analysis. The enrichDO
function is very useful for identifying disease association of interesting genes, and function gseDO
function is designed for gene set enrichment analysis of DO.
In addition, DOSE also supports enrichment analysis of Network of Cancer Gene (NCG)(A. et al. 2016) and Disease Gene Network(Janet et al. 2015), please refer to the DOSE vignettes.
Reactome pathway analysis
ReactomePA(Yu and He 2016) uses Reactome as a source of pathway data. The function call of enrichPathway
and gsePathway
in ReactomePA is consistent with enrichKEGG
and gseKEGG
.
DAVID functional analysis
clusterProfiler provides enrichment and GSEA analysis with GO, KEGG, DO and Reactome pathway supported internally, some user may prefer GO and KEGG analysis with DAVID(Huang et al. 2007) and still attracted by the visualization methods provided by clusterProfiler(???). To bridge the gap between DAVID and clusterProfiler, we implemented enrichDAVID
. This function query enrichment analysis result from DAVID webserver via RDAVIDWebService(Fresno and Fernández 2013) and stored the result as an enrichResult
instance, so that we can use all the visualization functions in clusterProfiler to visualize DAVID results. enrichDAVID
is fully compatible with compareCluster
function and comparing enrichment results from different gene clusters is now available with DAVID.
david <- enrichDAVID(gene = gene,
idType = "ENTREZ_GENE_ID",
listType = "Gene",
annotation = "KEGG_PATHWAY",
david.user = "clusterProfiler@hku.hk")
DAVID Web Service has the following limitations:
- A job with more than 3000 genes to generate gene or term cluster report will not be handled by DAVID due to resource limit.
- No more than 200 jobs in a day from one user or computer.
- DAVID Team reserves right to suspend any improper uses of the web service without notice.
For more details, please refer to http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html.
As user has limited usage, please register and use your own user account to run enrichDAVID
.
Universal enrichment analysis
clusterProfiler supports both hypergeometric test and gene set enrichment analyses of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.
clusterProfiler provides enricher
function for hypergeometric test and GSEA
function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
An example of using enricher
and GSEA
to analyze DisGeNet annotation is presented in the post, use clusterProfiler as an universal enrichment analysis tool.
Using MSigDB gene set collections
The MSigDB is a collection of annotated gene sets, it include 8 major collections:
- H: hallmark gene sets
- C1: positional gene sets
- C2: curated gene sets
- C3: motif gene sets
- C4: computational gene sets
- C5: GO gene sets
- C6: oncogenic signatures
- C7: immunologic signatures
Users can use enricher
and GSEA
function to analyze gene set collections downloaded from Molecular Signatures Database (MSigDb). clusterProfiler provides a function, read.gmt
, to parse the gmt file into a TERM2GENE data.frame
that is ready for both enricher
and GSEA
functions.
gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler")
c5 <- read.gmt(gmtfile)
egmt <- enricher(gene, TERM2GENE=c5)
head(egmt)
## ID Description
## SPINDLE SPINDLE SPINDLE
## MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON
## CYTOSKELETAL_PART CYTOSKELETAL_PART CYTOSKELETAL_PART
## SPINDLE_MICROTUBULE SPINDLE_MICROTUBULE SPINDLE_MICROTUBULE
## MICROTUBULE MICROTUBULE MICROTUBULE
## CYTOSKELETON CYTOSKELETON CYTOSKELETON
## GeneRatio BgRatio pvalue p.adjust
## SPINDLE 11/82 39/5270 7.667674e-12 5.214018e-10
## MICROTUBULE_CYTOSKELETON 16/82 152/5270 8.449298e-10 2.872761e-08
## CYTOSKELETAL_PART 15/82 235/5270 2.414879e-06 5.237096e-05
## SPINDLE_MICROTUBULE 5/82 16/5270 3.080645e-06 5.237096e-05
## MICROTUBULE 6/82 32/5270 7.740446e-06 1.052701e-04
## CYTOSKELETON 16/82 367/5270 1.308357e-04 1.482805e-03
## qvalue
## SPINDLE 4.197043e-10
## MICROTUBULE_CYTOSKELETON 2.312439e-08
## CYTOSKELETAL_PART 4.215619e-05
## SPINDLE_MICROTUBULE 4.215619e-05
## MICROTUBULE 8.473751e-05
## CYTOSKELETON 1.193589e-03
## geneID
## SPINDLE 991/9493/9787/22974/983/332/3832/7272/9055/6790/24137
## MICROTUBULE_CYTOSKELETON 991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## CYTOSKELETAL_PART 991/9493/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## SPINDLE_MICROTUBULE 983/332/3832/9055/24137
## MICROTUBULE 983/332/3832/9055/24137/4137
## CYTOSKELETON 991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## Count
## SPINDLE 11
## MICROTUBULE_CYTOSKELETON 16
## CYTOSKELETAL_PART 15
## SPINDLE_MICROTUBULE 5
## MICROTUBULE 6
## CYTOSKELETON 16
## ID
## EXTRACELLULAR_REGION EXTRACELLULAR_REGION
## EXTRACELLULAR_REGION_PART EXTRACELLULAR_REGION_PART
## PROTEINACEOUS_EXTRACELLULAR_MATRIX PROTEINACEOUS_EXTRACELLULAR_MATRIX
## CELL_PROJECTION CELL_PROJECTION
## EXTRACELLULAR_MATRIX EXTRACELLULAR_MATRIX
## EXTRACELLULAR_MATRIX_PART EXTRACELLULAR_MATRIX_PART
## Description
## EXTRACELLULAR_REGION EXTRACELLULAR_REGION
## EXTRACELLULAR_REGION_PART EXTRACELLULAR_REGION_PART
## PROTEINACEOUS_EXTRACELLULAR_MATRIX PROTEINACEOUS_EXTRACELLULAR_MATRIX
## CELL_PROJECTION CELL_PROJECTION
## EXTRACELLULAR_MATRIX EXTRACELLULAR_MATRIX
## EXTRACELLULAR_MATRIX_PART EXTRACELLULAR_MATRIX_PART
## setSize enrichmentScore NES
## EXTRACELLULAR_REGION 401 -0.3860230 -1.690496
## EXTRACELLULAR_REGION_PART 310 -0.4101043 -1.756048
## PROTEINACEOUS_EXTRACELLULAR_MATRIX 93 -0.6355317 -2.360749
## CELL_PROJECTION 87 -0.4729701 -1.744031
## EXTRACELLULAR_MATRIX 95 -0.6229461 -2.318387
## EXTRACELLULAR_MATRIX_PART 54 -0.5908035 -1.973392
## pvalue p.adjust qvalues rank
## EXTRACELLULAR_REGION 0.001221001 0.03665911 0.02864304 1797
## EXTRACELLULAR_REGION_PART 0.001272265 0.03665911 0.02864304 1897
## PROTEINACEOUS_EXTRACELLULAR_MATRIX 0.001430615 0.03665911 0.02864304 1473
## CELL_PROJECTION 0.001440922 0.03665911 0.02864304 2280
## EXTRACELLULAR_MATRIX 0.001445087 0.03665911 0.02864304 1473
## EXTRACELLULAR_MATRIX_PART 0.001515152 0.03665911 0.02864304 1794
## leading_edge
## EXTRACELLULAR_REGION tags=29%, list=14%, signal=26%
## EXTRACELLULAR_REGION_PART tags=32%, list=15%, signal=28%
## PROTEINACEOUS_EXTRACELLULAR_MATRIX tags=49%, list=12%, signal=44%
## CELL_PROJECTION tags=28%, list=18%, signal=23%
## EXTRACELLULAR_MATRIX tags=48%, list=12%, signal=43%
## EXTRACELLULAR_MATRIX_PART tags=59%, list=14%, signal=51%
## core_enrichment
## EXTRACELLULAR_REGION 3910/51162/2878/2717/3373/4153/10406/1301/6750/7474/4925/7450/80781/1490/1306/3931/4314/6586/3964/10272/8425/8082/11005/4256/3483/7482/8910/23037/27122/7042/3912/4322/167/2817/9353/6037/1278/2934/5176/4060/283/30008/5549/5950/22795/727/10516/23452/1293/2247/1295/1012/6469/2192/1281/4023/54360/50509/11167/4319/1290/9365/3952/10879/11096/2202/4313/3625/2199/6444/6320/1294/3075/4653/5764/3991/3263/1462/1289/3908/4016/3909/4053/8817/7033/8292/5125/2162/5744/1842/5654/10631/2331/3730/2487/347/6863/5104/3913/27123/4982/1300/2200/9607/1287/7060/1489/9723/6424/1307/1311/4693/4148/1101/2922/10647
## EXTRACELLULAR_REGION_PART 268/3567/57124/3910/2878/3373/4153/10406/1301/6750/7474/4925/80781/1490/1306/3931/4314/6586/3964/10272/8425/8082/4256/3483/7482/8910/27122/3912/4322/167/2817/1278/4060/283/30008/5549/5950/22795/727/10516/23452/1293/2247/1295/1012/6469/2192/1281/54360/50509/11167/4319/1290/9365/3952/10879/11096/2202/4313/2199/6444/1294/3075/4653/5764/3991/3263/1462/1289/3908/3909/4053/8817/8292/5125/5744/1842/5654/10631/2331/3730/347/6863/3913/27123/1300/2200/9607/1287/7060/9723/6424/1307/1311/4693/4148/1101/2922/10647
## PROTEINACEOUS_EXTRACELLULAR_MATRIX 1490/1306/8425/8082/4256/8910/3912/1278/4060/283/30008/5549/22795/10516/1293/1295/2192/1281/50509/4319/1290/11096/2202/2199/6444/1294/1462/1289/3908/3909/4053/8292/1842/10631/2331/3730/3913/1300/2200/1287/7060/1307/1311/4148/1101
## CELL_PROJECTION 4763/57147/64064/80184/322/54997/7248/5311/7042/323/4747/4744/1012/11346/2191/4741/4646/9576/114327/51466/27124/4137/7802
## EXTRACELLULAR_MATRIX 1490/1306/8425/8082/4256/8910/3912/1278/4060/283/30008/5549/22795/10516/1293/1295/2192/1281/50509/4319/1290/11096/2202/2199/6444/1294/1462/1289/3908/3909/4053/8292/1842/10631/2331/3730/3913/1300/2200/1287/7060/1307/1311/4148/1101
## EXTRACELLULAR_MATRIX_PART 3915/6443/55914/3910/1301/80781/1306/8082/8910/3912/1278/4060/283/30008/22795/1293/1295/1281/50509/1290/6444/1294/1289/3908/3909/8292/3913/1300/2200/1287/1307
WikiPathways analysis
In contrast to KEGG, WikiPathways is a continuously updated pathway database curated by a community of researchers and pathway enthusiasts. WikiPathways produces monthly releases of gmt files for supported organisms at data.wikipathways.org. Download the appropriate gmt file and then generate TERM2GENE
and TERM2NAME
to use enricher
and GSEA
functions.
wpgmtfile <- system.file("extdata", "wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(ont, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp <- enricher(gene, TERM2GENE = wpid2gene, TERM2NAME = wpid2name)
ewp <- setReadable(ewp, org.Hs.eg.db, keytype = "ENTREZID")
head(ewp)
## ID
## WP2446 WP2446
## WP2361 WP2361
## WP179 WP179
## WP3942 WP3942
## WP4240 WP4240
## WP2328 WP2328
## Description
## WP2446 Retinoblastoma (RB) in Cancer
## WP2361 Gastric Cancer Network 1
## WP179 Cell Cycle
## WP3942 PPAR signaling pathway
## WP4240 Regulation of sister chromatid separation at the metaphase-anaphase transition
## WP2328 Allograft Rejection
## GeneRatio BgRatio pvalue p.adjust qvalue
## WP2446 11/95 88/6249 6.801697e-08 1.054263e-05 9.450779e-06
## WP2361 6/95 29/6249 3.772735e-06 2.923870e-04 2.621058e-04
## WP179 10/95 122/6249 1.384549e-05 7.153503e-04 6.412648e-04
## WP3942 7/95 67/6249 6.210513e-05 2.406574e-03 2.157336e-03
## WP4240 4/95 16/6249 7.931988e-05 2.458916e-03 2.204258e-03
## WP2328 7/95 90/6249 4.016758e-04 1.037663e-02 9.301967e-03
## geneID Count
## WP2446 CDC45/CCNB2/TOP2A/RRM2/CCNA2/CDK1/CDT1/TTK/CHEK1/CCNB1/KIF4A 11
## WP2361 MYBL2/TOP2A/UBE2C/TPX2/S100P/AURKA 6
## WP179 CDC45/CDC20/CCNB2/CCNA2/CDK1/TTK/CHEK1/CCNB1/MCM5/PTTG1 10
## WP3942 MMP1/FADS2/ADIPOQ/PCK1/FABP4/HMGCS2/PLIN1 7
## WP4240 CDC20/CENPE/MAD2L1/PTTG1 4
## WP2328 CXCL13/CXCL11/CXCL9/GZMB/GNLY/HLA-DQA1/C7 7
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
ewp2 <- setReadable(ewp2, org.Hs.eg.db, keytype = "ENTREZID")
head(ewp2)
## ID Description setSize
## WP3932 WP3932 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway 281
## WP306 WP306 Focal Adhesion 188
## WP236 WP236 Adipogenesis 125
## WP474 WP474 Endochondral Ossification 59
## WP2911 WP2911 miRNA targets in ECM and membrane receptors 21
## WP3967 WP3967 miR-509-3p alteration of YAP1/ECM axis 14
## enrichmentScore NES pvalue p.adjust qvalues rank
## WP3932 -0.3903410 -1.670043 0.001360544 0.0363235 0.03041921 1994
## WP306 -0.4230017 -1.729701 0.001408451 0.0363235 0.03041921 2221
## WP236 -0.4387536 -1.708479 0.001481481 0.0363235 0.03041921 2301
## WP474 -0.5111776 -1.774835 0.001550388 0.0363235 0.03041921 2142
## WP2911 -0.6953475 -1.925629 0.001675042 0.0363235 0.03041921 2629
## WP3967 -0.7302346 -1.848903 0.001677852 0.0363235 0.03041921 929
## leading_edge
## WP3932 tags=26%, list=16%, signal=22%
## WP306 tags=28%, list=18%, signal=23%
## WP236 tags=34%, list=18%, signal=28%
## WP474 tags=47%, list=17%, signal=40%
## WP2911 tags=76%, list=21%, signal=60%
## WP3967 tags=50%, list=7%, signal=46%
## core_enrichment
## WP3932 THBS3/CAB39/IRS2/PRKAA2/PIK3R1/STK11/COL4A6/TEK/LAMA4/TNC/HGF/KDR/COL11A1/CDKN1B/CREB3L1/INSR/CHRM2/EIF4B/VWF/ITGAV/EPAS1/EFNA1/F2R/COL2A1/NGFR/ITGA3/FGFR3/TSC1/FGF1/NGF/FGF14/LAMB1/LPAR1/FOXO1/COL1A2/COL1A1/CAB39L/LPAR4/EPOR/FGF2/COL3A1/COL5A3/COL5A2/GNG12/PRLR/THBS2/LPAR6/PDGFC/ITGB5/KITLG/SREBF1/IGF1R/PDGFRB/LIPE/COL5A1/COL6A2/LAMA2/GHR/LAMA3/FGF18/IKBKB/GNG11/TNN/LAMB2/IRS1/ITGA7/THBS4/IGF1/PDGFD/COMP/CHAD/FOXA1
## WP306 MAPK3/PGF/VEGFC/CTNNB1/MYL5/TLN2/THBS3/PIK3R1/COL4A6/PIP5K1C/LAMA4/TNC/HGF/ARHGAP5/KDR/VWF/BCL2/ITGAV/COL2A1/ITGA3/CCND1/FLNC/LAMB1/DOCK1/COL1A2/PTK6/COL1A1/MYL9/PARVA/COL5A3/COL5A2/FLNB/THBS2/SHC2/PDGFC/ITGB5/IGF1R/PDGFRB/CAV1/COL6A2/LAMA2/LAMA3/TNN/LAMB2/ITGA7/THBS4/IGF1/VAV3/PDGFD/COMP/CHAD
## WP236 FZD1/RB1/KLF7/CTNNB1/NR3C1/SMAD3/IRS2/ID3/STAT6/NCOA1/RXRG/GDF10/RARA/STAT5A/EPAS1/AHR/NRIP1/NDN/MEF2C/FOXO1/PPARG/SPOCK1/LPL/CYP26A1/UCP1/WNT5B/LEP/CFD/PRLR/SREBF1/LIPE/FRZB/IRS1/IL6ST/IGF1/SFRP4/ADIPOQ/PCK1/BMP4/PLIN1/GATA3
## WP474 DDR2/CAB39/RUNX2/PTH1R/CDKN1C/COL2A1/FGFR3/MGP/TGFB2/MEF2C/MMP13/IFT88/TIMP3/FGF2/ADAMTS5/SCIN/IGF1R/GHR/FGF18/ENPP1/GLI3/PTHLH/FRZB/PLAT/COL10A1/CST5/IGF1
## WP2911 ITGA1/THBS1/LAMC1/LAMA4/COL6A1/COL1A2/COL6A3/COL3A1/COL5A3/COL5A2/THBS2/ITGB5/COL5A1/COL6A2/LAMB2
## WP3967 SNAI2/COL3A1/THBS2/COL5A1/EDNRA/SPARC
As an alternative to manually downloading gmt files, install the rWikiPathways package to gain scripting access to the latest gmt files using the downloadPathwayArchive
function.
Functional analysis of NGS data
Functional analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via ChIPseeker(Yu, Wang, and He 2015) package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene
, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to ChIPseeker.
Visualization
The function calls of groupGO
, enrichGO
, enrichKEGG
, enrichDO
, enrichPathway
and enricher
are consistent and all the output can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.
emapplot
Enrichment map can be viusalized by emapplot
, which also support results obtained from hypergeometric test and gene set enrichment analysis.
cnetplot
In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed cnetplot
function to extract the complex association.
goplot
goplot
can accept output of enrichGO
and visualized the enriched GO induced graph.
gseaplot
Running score of gene set enrichment analysis and its association of phenotype can be visualized by gseaplot
.
browseKEGG
To view the KEGG pathway, user can use browseKEGG
function, which will open web browser and highlight enriched genes.
pathview from pathview package
clusterProfiler users can also use pathview
from the pathview(Luo and Brouwer 2013) to visualize KEGG pathway.
The following example illustrate how to visualize “hsa04110” pathway, which was enriched in our previous analysis.
library("pathview")
hsa04110 <- pathview(gene.data = geneList,
pathway.id = "hsa04110",
species = "hsa",
limit = list(gene=max(abs(geneList)), cpd=1))
For further information, please refer to the vignette of pathview(Luo and Brouwer 2013).
Biological theme comparison
clusterProfiler was developed for biological theme comparison(Yu et al. 2012), and it provides a function, compareCluster
, to automatically calculate enriched functional categories of each gene clusters.
## $X1
## [1] "4597" "7111" "5266" "2175" "755" "23046"
##
## $X2
## [1] "23450" "5160" "7126" "26118" "8452" "3675"
##
## $X3
## [1] "894" "7057" "22906" "3339" "10449" "6566"
##
## $X4
## [1] "5573" "7453" "5245" "23450" "6500" "4926"
##
## $X5
## [1] "5982" "7318" "6352" "2101" "8882" "7803"
##
## $X6
## [1] "5337" "9295" "4035" "811" "23365" "4629"
##
## $X7
## [1] "2621" "2665" "5690" "3608" "3550" "533"
##
## $X8
## [1] "2665" "4735" "1327" "3192" "5573" "9528"
The input for geneCluster parameter should be a named list of gene IDs. To speed up the compilation of this document, we set use_internal_data = TRUE
.
## Cluster ID Description GeneRatio
## 1 X2 hsa04110 Cell cycle 18/359
## 2 X2 hsa05169 Epstein-Barr virus infection 23/359
## 3 X2 hsa05340 Primary immunodeficiency 8/359
## 4 X2 hsa04064 NF-kappa B signaling pathway 13/359
## 5 X2 hsa05166 Human T-cell leukemia virus 1 infection 22/359
## 6 X3 hsa04512 ECM-receptor interaction 9/172
## BgRatio pvalue p.adjust qvalue
## 1 124/7470 2.328648e-05 0.006706505 0.006324117
## 2 201/7470 8.911882e-05 0.012833111 0.012101398
## 3 37/7470 2.978732e-04 0.028595832 0.026965368
## 4 95/7470 5.702864e-04 0.041060624 0.038719448
## 5 219/7470 7.976435e-04 0.045944264 0.043324635
## 6 82/7470 1.021339e-04 0.025635606 0.023867077
## geneID
## 1 991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173
## 2 4067/3383/7128/1869/890/1871/578/864/637/9641/6891/355/9134/5971/916/956/6850/7187/3551/919/4734/958/6772
## 3 100/6891/3932/973/916/925/958/64421
## 4 4067/3383/7128/3932/5971/4050/6850/7187/3551/10892/5588/330/958
## 5 3383/991/1869/890/1871/113/701/9700/3932/9134/5971/916/4487/3600/1029/3551/8498/4488/11200/4776/4214/958
## 6 7057/3339/1299/3695/1101/3679/3910/3696/3693
## Count
## 1 18
## 2 23
## 3 8
## 4 13
## 5 22
## 6 9
Formula interface of compareCluster
compareCluster
also supports passing a formula (the code to support formula has been contributed by Giovanni Dall’Olio) of type \(Entrez \sim group\) or \(Entrez \sim group + othergroup\).
mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"
formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enrichKEGG")
head(as.data.frame(formula_res))
## Cluster group othergroup ID
## 1 downregulated.A downregulated A hsa04974
## 2 downregulated.A downregulated A hsa04510
## 3 downregulated.A downregulated A hsa04151
## 4 downregulated.A downregulated A hsa04512
## 5 downregulated.A downregulated A hsa05414
## 6 downregulated.A downregulated A hsa04010
## Description GeneRatio BgRatio pvalue
## 1 Protein digestion and absorption 15/281 90/7470 1.060231e-06
## 2 Focal adhesion 20/281 199/7470 5.428689e-05
## 3 PI3K-Akt signaling pathway 28/281 354/7470 1.431802e-04
## 4 ECM-receptor interaction 11/281 82/7470 2.277827e-04
## 5 Dilated cardiomyopathy (DCM) 11/281 90/7470 5.184528e-04
## 6 MAPK signaling pathway 23/281 295/7470 6.961804e-04
## p.adjust qvalue
## 1 0.0002756601 0.0002544555
## 2 0.0070572962 0.0065144273
## 3 0.0124089507 0.0114544161
## 4 0.0148058770 0.0136669634
## 5 0.0269595481 0.0248857367
## 6 0.0301678172 0.0278472159
## geneID
## 1 1281/50509/1290/477/1294/1360/1289/1292/23428/1359/1300/1287/6505/2006/7373
## 2 55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310
## 3 55970/5618/7058/10161/56034/3693/4254/3480/4908/5159/1292/3908/2690/3909/8817/9223/4915/3551/2791/63923/3913/9863/3667/1287/3679/7060/3479/80310
## 4 7058/3693/3339/1292/3908/3909/63923/3913/1287/3679/7060
## 5 55799/27092/6444/3693/775/3908/5350/7043/3679/3479/9254
## 6 55799/3306/2317/4214/55970/56034/27092/4254/3480/4908/3305/5159/775/8817/4915/3551/7786/7043/57551/3479/9254/1846/80310
## Count
## 1 15
## 2 20
## 3 28
## 4 11
## 5 11
## 6 23
Visualization of profile comparison
We can visualize the result using dotplot
method.
By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter showCategory to specify how many categories of each cluster to be plotted, and if showCategory was set to NULL, the whole result will be plotted.
The plot function accepts a parameter by for setting the scale of dot sizes. The default parameter by is setting to “geneRatio”, which corresponding to the “GeneRatio” column of the output. If it was setting to count, the comparison will be based on gene counts, while if setting to rowPercentage, the dot sizes will be normalized by count/(sum of each row)
To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when by is setting to rowPercentage and number of gene clusters in each cluster label (numbers in parentheses) when by is setting to geneRatio, as shown in Figure 3. If the dot sizes were based on count, the row numbers will not shown.
The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter pvalueCutoff, and FDR can be estimated by qvalue.
User can refer to the example in Yu (2012)(Yu et al. 2012); we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus)(Schmidt et al. 2008). We identified 8 gene clusters from differentially expressed genes, and using compareCluster
to compare these gene clusters by their enriched biological process.
The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only groupGO
, enrichGO
, enrichKEGG
and enricher
provided in this package, but also other biological and biomedical ontologies, for instance, enrichDO
from DOSE(Yu et al. 2015), enrichMeSH
from meshes and enrichPathway
from ReactomePA work fine with compareCluster
for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of DOSE(Yu et al. 2015) and ReactomePA.
Need helps?
If you have questions/issues, please visit clusterProfiler homepage first. Your problems are mostly documented. If you think you found a bug, please follow the guide and provide a reproducible example to be posted on github issue tracker. For questions, please post to Bioconductor support site and tag your post with clusterProfiler.
Session Information
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] clusterProfiler_3.10.1 GSEABase_1.44.0 graph_1.60.0
## [4] annotate_1.60.0 XML_3.98-1.16 org.Hs.eg.db_3.7.0
## [7] GO.db_3.7.0 AnnotationDbi_1.44.0 IRanges_2.16.0
## [10] S4Vectors_0.20.1 Biobase_2.42.0 BiocGenerics_0.28.0
## [13] DOSE_3.8.0 tidyr_0.8.2 dplyr_0.7.8
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 enrichplot_1.2.0 bit64_0.9-7
## [4] RColorBrewer_1.1-2 progress_1.2.0 httr_1.4.0
## [7] UpSetR_1.3.3 tools_3.5.1 R6_2.3.0
## [10] DBI_1.0.0 lazyeval_0.2.1 colorspace_1.3-2
## [13] tidyselect_0.2.5 gridExtra_2.3 prettyunits_1.0.2
## [16] bit_1.1-14 compiler_3.5.1 xml2_1.2.0
## [19] labeling_0.3 triebeard_0.3.0 scales_1.0.0
## [22] ggridges_0.5.1 stringr_1.3.1 digest_0.6.18
## [25] rmarkdown_1.11 pkgconfig_2.0.2 htmltools_0.3.6
## [28] highr_0.7 rlang_0.3.0.1 RSQLite_2.1.1
## [31] prettydoc_0.2.1 gridGraphics_0.3-0 bindr_0.1.1
## [34] farver_1.1.0 jsonlite_1.6 BiocParallel_1.16.2
## [37] GOSemSim_2.8.0 RCurl_1.95-4.11 magrittr_1.5
## [40] ggplotify_0.0.3 Matrix_1.2-15 Rcpp_1.0.0
## [43] munsell_0.5.0 viridis_0.5.1 stringi_1.2.4
## [46] yaml_2.2.0 ggraph_1.0.2 MASS_7.3-51.1
## [49] plyr_1.8.4 qvalue_2.14.0 grid_3.5.1
## [52] blob_1.1.1 ggrepel_0.8.0 DO.db_2.9
## [55] crayon_1.3.4 lattice_0.20-38 cowplot_0.9.3
## [58] splines_3.5.1 hms_0.4.2 knitr_1.21
## [61] pillar_1.3.1 fgsea_1.8.0 igraph_1.2.2
## [64] reshape2_1.4.3 fastmatch_1.1-0 glue_1.3.0
## [67] evaluate_0.12 data.table_1.11.8 urltools_1.7.1
## [70] tweenr_1.0.1 gtable_0.2.0 purrr_0.2.5
## [73] assertthat_0.2.0 ggplot2_3.1.0 xfun_0.4
## [76] ggforce_0.1.3 europepmc_0.3 xtable_1.8-3
## [79] viridisLite_0.3.0 tibble_1.4.2 rvcheck_0.1.3
## [82] memoise_1.1.0 units_0.6-2 bindrcpp_0.2.2
References
A., Omer, Giovanni M. D., Thanos P. M., and Francesca D. C. 2016. “NCG 5.0: Updates of a Manually Curated Repository of Cancer Genes and Associated Properties from Cancer Mutational Screenings.” Nucleic Acids Research 44 (D1):D992–D999. https://doi.org/10.1093/nar/gkv1123.
Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England) 20 (18):3710–5. https://doi.org/10.1093/bioinformatics/bth456.
Fresno, Cristóbal, and Elmer A. Fernández. 2013. “RDAVIDWebService: A Versatile R Interface to DAVID.” Bioinformatics 29 (21):2810–1. https://doi.org/10.1093/bioinformatics/btt487.
Huang, Da, Brad T Sherman, Qina Tan, Jack R Collins, W Gregory Alvord, Jean Roayaei, Robert Stephens, Michael W Baseler, H Clifford Lane, and Richard A Lempicki. 2007. “The DAVID Gene Functional Classification Tool: A Novel Biological Module-Centric Algorithm to Functionally Analyze Large Gene Lists.” Genome Biology 8 (9):R183. https://doi.org/10.1186/gb-2007-8-9-r183.
Janet, P., Núria Q.R., Àlex B., Jordi D.P., Anna B.M., Martin B., Ferran S., and Laura I. F. 2015. “DisGeNET: A Discovery Platform for the Dynamical Exploration of Human Diseases and Their Genes.” Database 2015 (March):bav028. https://doi.org/10.1093/database/bav028.
Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14):1830–1. https://doi.org/10.1093/bioinformatics/btt285.
Schmidt, Marcus, Daniel B?hm, Christian von T?rne, Eric Steiner, Alexander Puhl, Henryk Pilch, Hans-Anton Lehr, Jan G. Hengstler, Heinz K?lbl, and Mathias Gehrmann. 2008. “The Humoral Immune System Has a Key Prognostic Impact in Node-Negative Breast Cancer.” Cancer Research 68 (13):5405–13. https://doi.org/10.1158/0008-5472.CAN-07-5206.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43):15545–50. https://doi.org/10.1073/pnas.0506580102.
Yu, Guangchuang, and Qing-Yu He. 2016. “ReactomePA: An R/Bioconductor Package for Reactome Pathway Analysis and Visualization.” Molecular BioSystems 12 (2):477–79. https://doi.org/10.1039/C5MB00663E.
Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An R Package for Measuring Semantic Similarity Among Go Terms and Gene Products.” Bioinformatics 26 (7):976–78. https://doi.org/10.1093/bioinformatics/btq064.
Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5):284–87. https://doi.org/10.1089/omi.2011.0118.
Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. 2015. “ChIPseeker: An R/Bioconductor Package for Chip Peak Annotation, Comparison and Visualization.” Bioinformatics 31 (14):2382–3. https://doi.org/10.1093/bioinformatics/btv145.
Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4):608–9. https://doi.org/10.1093/bioinformatics/btu684.