Summarizing omics data in SummarizedExperiment
format from airway
into KEGG pathway level functional activity scores
Here we illustrate through some examples how to apply the function summarize_pathway_level
to aggregate data in SummarizedExperiment
format into KEGG pathway-level functional activity scores. Note that the function can also be used to summarize other types of omics data into any higher-level functional representations beyond pathways, such as protein complexes or cellular locations.
Let’s first get an example dataset stored as a SummarizedExperiment
from the airway
package. This data represents an actual RNA sequencing experiment on four human airway smooth muscle cell lines.
library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
##
## rowMedians
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:NMF':
##
## nrun
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
The measurement data can be accessed by assay and assays. Note that SummarizedExperiment
object can contain multiple measurement matrices (all of the same dimension), but in this case airway
contains only one matrix of RNA sequencing data named counts
:
## [1] "counts"
head(assay(airway, "counts"))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
dim(assay(airway, "counts"))
## [1] 63677 8
The data matrix contains 63677 genes (or transcripts) and 8 samples. The features names are Ensembl identifiers, let’s get a list of KEGG gene sets with Ensembl IDs through the function get_kegg_sets
provided by funOmics
package. Note that get_kegg_sets
can be used to retrieve a list of KEGG gene sets from any organism available, given its abbreviation (e.g., “hsa” for Homo sapiens or “ecj” for Escherichia coli).
Since airway
data corresponds to human samples, the parameter geneid_type
in get_kegg_sets
can be used to retrieve the molecular sets with Ensembl IDs, and the organism is set to default (“hsa”):
kegg_sets <- get_kegg_sets(geneid_type = "ensembl")
head(kegg_sets)
## $`2-Oxocarboxylic acid metabolism`
## [1] "ENSG00000114786" "ENSG00000169154" "ENSG00000062485" "ENSG00000161653"
## [5] "ENSG00000137992" "ENSG00000150768" "ENSG00000091140" "ENSG00000119689"
## [9] "ENSG00000172482" "ENSG00000120053" "ENSG00000125166" "ENSG00000167701"
## [13] "ENSG00000138413" "ENSG00000182054" "ENSG00000166411" "ENSG00000101365"
## [17] "ENSG00000067829" "ENSG00000122729" "ENSG00000105953" "ENSG00000100412"
## [21] "ENSG00000109576" "ENSG00000131828" "ENSG00000163114" "ENSG00000168291"
## [25] "ENSG00000181192" "ENSG00000197444" "ENSG00000060982" "ENSG00000105552"
## [29] "ENSG00000248098" "ENSG00000083123" "ENSG00000113492" "ENSG00000166123"
## [33] "ENSG00000243989"
##
## $`ABC transporters`
## [1] "ENSG00000114770" "ENSG00000115657" "ENSG00000069431" "ENSG00000125257"
## [5] "ENSG00000064687" "ENSG00000154263" "ENSG00000154258" "ENSG00000141338"
## [9] "ENSG00000001626" "ENSG00000197150" "ENSG00000023839" "ENSG00000179869"
## [13] "ENSG00000164825" "ENSG00000165029" "ENSG00000107331" "ENSG00000167972"
## [17] "ENSG00000101986" "ENSG00000131269" "ENSG00000173208" "ENSG00000135776"
## [21] "ENSG00000150967" "ENSG00000154262" "ENSG00000154265" "ENSG00000198691"
## [25] "ENSG00000144452" "ENSG00000004846" "ENSG00000091262" "ENSG00000103222"
## [29] "ENSG00000085563" "ENSG00000005471" "ENSG00000117528" "ENSG00000119688"
## [33] "ENSG00000172350" "ENSG00000138075" "ENSG00000143921" "ENSG00000006071"
## [37] "ENSG00000168394" "ENSG00000204267" "ENSG00000121270" "ENSG00000073734"
## [41] "ENSG00000108846" "ENSG00000124574" "ENSG00000140798" "ENSG00000118777"
## [45] "ENSG00000160179"
##
## $`AGE-RAGE signaling pathway in diabetic complications`
## [1] "ENSG00000117020" "ENSG00000135446" "ENSG00000111276" "ENSG00000278139"
## [5] "ENSG00000161714" "ENSG00000108821" "ENSG00000164692" "ENSG00000168542"
## [9] "ENSG00000187498" "ENSG00000134871" "ENSG00000169031" "ENSG00000081052"
## [13] "ENSG00000188153" "ENSG00000197565" "ENSG00000112062" "ENSG00000165168"
## [17] "ENSG00000131504" "ENSG00000204305" "ENSG00000135744" "ENSG00000144891"
## [21] "ENSG00000078401" "ENSG00000120738" "ENSG00000142208" "ENSG00000105221"
## [25] "ENSG00000117525" "ENSG00000165197" "ENSG00000150907" "ENSG00000182621"
## [29] "ENSG00000115414" "ENSG00000007952" "ENSG00000174775" "ENSG00000090339"
## [33] "ENSG00000115008" "ENSG00000125538" "ENSG00000136244" "ENSG00000169429"
## [37] "ENSG00000096968" "ENSG00000177606" "ENSG00000133703" "ENSG00000175387"
## [41] "ENSG00000166949" "ENSG00000141646" "ENSG00000087245" "ENSG00000131196"
## [45] "ENSG00000109320" "ENSG00000164867" "ENSG00000213281" "ENSG00000086991"
## [49] "ENSG00000106366" "ENSG00000138193" "ENSG00000121879" "ENSG00000051382"
## [53] "ENSG00000137193" "ENSG00000171608" "ENSG00000145675" "ENSG00000105647"
## [57] "ENSG00000137841" "ENSG00000149782" "ENSG00000101333" "ENSG00000187091"
## [61] "ENSG00000124181" "ENSG00000197943" "ENSG00000154229" "ENSG00000166501"
## [65] "ENSG00000163932" "ENSG00000171132" "ENSG00000067606" "ENSG00000100030"
## [69] "ENSG00000102882" "ENSG00000107643" "ENSG00000185386" "ENSG00000050748"
## [73] "ENSG00000109339" "ENSG00000156711" "ENSG00000087088" "ENSG00000136238"
## [77] "ENSG00000110092" "ENSG00000171791" "ENSG00000173039" "ENSG00000188130"
## [81] "ENSG00000108691" "ENSG00000007908" "ENSG00000115415" "ENSG00000168610"
## [85] "ENSG00000126561" "ENSG00000173757" "ENSG00000105329" "ENSG00000092969"
## [89] "ENSG00000119699" "ENSG00000106799" "ENSG00000163513" "ENSG00000178726"
## [93] "ENSG00000232810" "ENSG00000162692" "ENSG00000112715" "ENSG00000173511"
## [97] "ENSG00000150630" "ENSG00000164305" "ENSG00000115556" "ENSG00000117461"
## [101] "ENSG00000070831"
##
## $`AMPK signaling pathway`
## [1] "ENSG00000117020" "ENSG00000107175" "ENSG00000110931" "ENSG00000001626"
## [5] "ENSG00000084733" "ENSG00000109819" "ENSG00000278139" "ENSG00000176194"
## [9] "ENSG00000169169" "ENSG00000110090" "ENSG00000205560" "ENSG00000118260"
## [13] "ENSG00000120907" "ENSG00000143578" "ENSG00000167658" "ENSG00000187840"
## [17] "ENSG00000066044" "ENSG00000160741" "ENSG00000142208" "ENSG00000105221"
## [21] "ENSG00000169710" "ENSG00000165140" "ENSG00000150907" "ENSG00000118689"
## [25] "ENSG00000065882" "ENSG00000096717" "ENSG00000103150" "ENSG00000198793"
## [29] "ENSG00000131482" "ENSG00000167393" "ENSG00000103319" "ENSG00000104812"
## [33] "ENSG00000111713" "ENSG00000278540" "ENSG00000113161" "ENSG00000101076"
## [37] "ENSG00000076555" "ENSG00000017427" "ENSG00000140443" "ENSG00000254647"
## [41] "ENSG00000171105" "ENSG00000169047" "ENSG00000174697" "ENSG00000116678"
## [45] "ENSG00000079435" "ENSG00000167461" "ENSG00000124253" "ENSG00000100889"
## [49] "ENSG00000159346" "ENSG00000106617" "ENSG00000119396" "ENSG00000140992"
## [53] "ENSG00000135932" "ENSG00000158571" "ENSG00000123836" "ENSG00000170525"
## [57] "ENSG00000114268" "ENSG00000141959" "ENSG00000152556" "ENSG00000067057"
## [61] "ENSG00000121879" "ENSG00000051382" "ENSG00000171608" "ENSG00000145675"
## [65] "ENSG00000105647" "ENSG00000115592" "ENSG00000132170" "ENSG00000092020"
## [69] "ENSG00000113575" "ENSG00000104695" "ENSG00000105568" "ENSG00000137713"
## [73] "ENSG00000221914" "ENSG00000156475" "ENSG00000074211" "ENSG00000073711"
## [77] "ENSG00000066027" "ENSG00000068971" "ENSG00000078304" "ENSG00000112640"
## [81] "ENSG00000154001" "ENSG00000082146" "ENSG00000132356" "ENSG00000162409"
## [85] "ENSG00000111725" "ENSG00000131791" "ENSG00000181929" "ENSG00000175470"
## [89] "ENSG00000141564" "ENSG00000152254" "ENSG00000104388" "ENSG00000110092"
## [93] "ENSG00000106615" "ENSG00000108443" "ENSG00000175634" "ENSG00000099194"
## [97] "ENSG00000182158" "ENSG00000181856" "ENSG00000072310" "ENSG00000118046"
## [101] "ENSG00000135341" "ENSG00000165699" "ENSG00000103197" "ENSG00000006831"
## [105] "ENSG00000145284" "ENSG00000102547" "ENSG00000177169" "ENSG00000204673"
## [109] "ENSG00000060566" "ENSG00000133124" "ENSG00000117461" "ENSG00000185950"
## [113] "ENSG00000130957" "ENSG00000145386" "ENSG00000133101" "ENSG00000157613"
## [117] "ENSG00000185236" "ENSG00000266173" "ENSG00000141349" "ENSG00000181092"
## [121] "ENSG00000135218" "ENSG00000146592"
##
## $`ATP-dependent chromatin remodeling`
## [1] "ENSG00000167797" "ENSG00000187778" "ENSG00000106400" "ENSG00000172977"
## [5] "ENSG00000080603" "ENSG00000183207" "ENSG00000112983" "ENSG00000185787"
## [9] "ENSG00000170004" "ENSG00000111642" "ENSG00000076108" "ENSG00000198604"
## [13] "ENSG00000229674" "ENSG00000258315" "ENSG00000153391" "ENSG00000189079"
## [17] "ENSG00000171634" "ENSG00000164508" "ENSG00000112624" "ENSG00000184402"
## [21] "ENSG00000135999" "ENSG00000099954" "ENSG00000169592" "ENSG00000166164"
## [25] "ENSG00000105619" "ENSG00000123636" "ENSG00000063169" "ENSG00000277075"
## [29] "ENSG00000196866" "ENSG00000188486" "ENSG00000164032" "ENSG00000116478"
## [33] "ENSG00000196591" "ENSG00000184270" "ENSG00000230797" "ENSG00000277858"
## [37] "ENSG00000274183" "ENSG00000170322" "ENSG00000116750" "ENSG00000077080"
## [41] "ENSG00000048649" "ENSG00000071655" "ENSG00000148229" "ENSG00000104472"
## [45] "ENSG00000071243" "ENSG00000128908" "ENSG00000167491" "ENSG00000114933"
## [49] "ENSG00000163939" "ENSG00000101189" "ENSG00000130024" "ENSG00000099284"
## [53] "ENSG00000246705" "ENSG00000178028" "ENSG00000143614" "ENSG00000049618"
## [57] "ENSG00000057935" "ENSG00000183495" "ENSG00000162521" "ENSG00000102054"
## [61] "ENSG00000133884" "ENSG00000075624" "ENSG00000110987" "ENSG00000075089"
## [65] "ENSG00000163875" "ENSG00000102038" "ENSG00000080503" "ENSG00000127616"
## [69] "ENSG00000099956" "ENSG00000028310" "ENSG00000173473" "ENSG00000139613"
## [73] "ENSG00000066117" "ENSG00000108604" "ENSG00000082014" "ENSG00000073584"
## [77] "ENSG00000141380" "ENSG00000163159" "ENSG00000184009" "ENSG00000288859"
## [81] "ENSG00000100811" "ENSG00000101442" "ENSG00000120616" "ENSG00000127337"
## [85] "ENSG00000111328" "ENSG00000205683" "ENSG00000011332" "ENSG00000117713"
## [89] "ENSG00000196367" "ENSG00000196747" "ENSG00000275221" "ENSG00000276368"
## [93] "ENSG00000276903" "ENSG00000180573" "ENSG00000278463" "ENSG00000278677"
## [97] "ENSG00000288825" "ENSG00000184260" "ENSG00000115274" "ENSG00000277745"
## [101] "ENSG00000156531" "ENSG00000153147" "ENSG00000274997" "ENSG00000136518"
## [105] "ENSG00000175792" "ENSG00000134046" "ENSG00000196787" "ENSG00000009954"
## [109] "ENSG00000182979" "ENSG00000149480" "ENSG00000099385" "ENSG00000106635"
## [113] "ENSG00000181218" "ENSG00000113812" "ENSG00000105968" "ENSG00000113648"
## [117] "ENSG00000123562"
##
## $`Acute myeloid leukemia`
## [1] "ENSG00000117020" "ENSG00000245848" "ENSG00000092067" "ENSG00000278139"
## [5] "ENSG00000102096" "ENSG00000213341" "ENSG00000182578" "ENSG00000164400"
## [9] "ENSG00000139318" "ENSG00000187840" "ENSG00000142208" "ENSG00000105221"
## [13] "ENSG00000150337" "ENSG00000122025" "ENSG00000198793" "ENSG00000177885"
## [17] "ENSG00000174775" "ENSG00000104365" "ENSG00000164399" "ENSG00000169896"
## [21] "ENSG00000078061" "ENSG00000173801" "ENSG00000157404" "ENSG00000133703"
## [25] "ENSG00000005381" "ENSG00000136997" "ENSG00000109320" "ENSG00000213281"
## [29] "ENSG00000138795" "ENSG00000121879" "ENSG00000051382" "ENSG00000137193"
## [33] "ENSG00000171608" "ENSG00000145675" "ENSG00000105647" "ENSG00000140464"
## [37] "ENSG00000112033" "ENSG00000100030" "ENSG00000102882" "ENSG00000169032"
## [41] "ENSG00000126934" "ENSG00000002330" "ENSG00000132155" "ENSG00000131759"
## [45] "ENSG00000110092" "ENSG00000140379" "ENSG00000173039" "ENSG00000108443"
## [49] "ENSG00000175634" "ENSG00000115904" "ENSG00000100485" "ENSG00000066336"
## [53] "ENSG00000157764" "ENSG00000168610" "ENSG00000126561" "ENSG00000173757"
## [57] "ENSG00000081059" "ENSG00000148737" "ENSG00000109906" "ENSG00000152284"
## [61] "ENSG00000117461" "ENSG00000269335" "ENSG00000159216" "ENSG00000079102"
## [65] "ENSG00000132326" "ENSG00000145386" "ENSG00000133101" "ENSG00000170458"
Example usage 1: Summarize airway
omics data into dimension-reduction derived activity scores at KEGG pathway level.
The dimension-reduction operators implemented in funOmics
include PCA (Principal Component Analysis), NMF (Non-Negative Matrix Factorization), MDS (Multidimensional Scaling), and pathifier deregulation scores from the pathifier
package derived from principal curves.
Now, let’s summarize the counts data using PCA. The PCA-aggregated activity scores values represent the projection of the overall expression of all genes in each pathway onto the first principal component. For this example, let’s use the default minimum size of sets (10). Note that when default value for minsize
parameter is used, it is not necessary to assign a value for this parameter in the function call:
pathway_activity_pca <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "pca")
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 346 successful functional aggregations over minsize
## 15 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 346 , 8
From the original airway
data containing 63677 genes (transcripts) and 8 samples, the function summarize_pathway_level
has generated a pathway-level activity score for each of the 8 samples, for 343 KEGG pathways containing more than 10 genes (16 failed functional aggregations under minsize
). Let’s see how this matrix looks like:
print(head(pathway_activity_pca))
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism -0.06702655 1.980602
## ABC transporters -0.62811819 3.493143
## AGE-RAGE signaling pathway in diabetic complications -0.94405736 -4.372707
## AMPK signaling pathway -2.33237340 -3.446439
## ATP-dependent chromatin remodeling -0.10612468 4.822368
## Acute myeloid leukemia -1.68843459 -2.745095
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism -2.253388 6.875470
## ABC transporters -4.375987 5.828548
## AGE-RAGE signaling pathway in diabetic complications 4.514782 -9.753186
## AMPK signaling pathway 2.596147 -10.301870
## ATP-dependent chromatin remodeling -5.045007 13.846549
## Acute myeloid leukemia 1.499643 -7.342374
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism -2.670499 -8.032398
## ABC transporters -4.399747 -4.392270
## AGE-RAGE signaling pathway in diabetic complications 4.931905 11.913342
## AMPK signaling pathway 3.463877 16.514465
## ATP-dependent chromatin remodeling -6.676558 -14.120201
## Acute myeloid leukemia 2.162537 11.748711
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 2.713740 1.4534992
## ABC transporters 1.040788 3.4336426
## AGE-RAGE signaling pathway in diabetic complications -3.754588 -2.5354917
## AMPK signaling pathway -6.065388 -0.4284188
## ATP-dependent chromatin remodeling 4.129495 3.1494778
## Acute myeloid leukemia -4.224764 0.5897765
The resulting matrix of higher-level functional representations looks very similar to the original one, except that the original had many more features (63677 instead of 343). This reduction in dimensionality can facilitate the interpretation of the data and the identification of patterns in the samples.
In this illustrative example, the RNA sequencing data in airway
package has been directly summarized or aggregated by the summarize_pathway_level
function of the funOmics
package without intermediate processing. Depending on the type of omics data, you may want to apply corresponding processing steps to the omics abundance matrix prior to the aggregation into higher level functional features. For instance, it is common practice to filter out rows (genes or features) with low counts when analyzing transcriptomics data. This filtering step is often performed to remove genes that are expressed at very low levels and may not be biologically relevant or reliable.
Some aggregation methods, may also have specific assumptions or requirements on the input data. Let’s see another example where some filtering is indeed necessary.
funOmics
allows to generate aggregated representations using dimension-reduction derived scores from the NMF (Non-Negative Matrix Factorization) method. The NMF-aggregated activity scores values represent the weight (or contribution) of a single underlying basis component or latent factor contributing to the pathway activity (or higher level functional structure in use) for each sample in your data set. Rank=1 is used for the basis matrix in the internal NMF dimension-reduction.
Notably, the NMF method does not allow for negative values or null rows in the input matrix. Transcriptomics data in the airway
dataset are measured as counts, hence the matrix presumably does not contain negative values, but it may contain null rows. To avoid errors, we can filter out rows with less than 10 counts across all samples before applying the NMF method:
print(any(assay(airway, "counts")[rowSums(assay(airway, "counts")) <0, ]))
## [1] FALSE
X <- assay(airway, "counts")[rowSums(assay(airway, "counts")) >= 10, ]
Let’s summarize the filtered counts data using NMF method:
pathway_activity_nmf <- summarize_pathway_level(X, kegg_sets, type = "nmf") # note that the NMF operation can take some minutes
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 342 successful functional aggregations over minsize
## 19 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 342 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity_nmf), ",", ncol(pathway_activity_nmf)))
## [1] "Pathway activity score matrix has dimensions: 342 , 8"
Let’s see how this matrix looks like:
head(pathway_activity_nmf)
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 0.31051078 0.30317170
## ABC transporters 0.11382753 0.10991825
## AGE-RAGE signaling pathway in diabetic complications 0.04499232 0.03063683
## AMPK signaling pathway 0.05167808 0.05001888
## ATP-dependent chromatin remodeling 0.02477721 0.02626503
## Acute myeloid leukemia 0.10561350 0.10824467
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 0.36924996 0.22356988
## ABC transporters 0.14166649 0.08893425
## AGE-RAGE signaling pathway in diabetic complications 0.06057488 0.02913379
## AMPK signaling pathway 0.05934801 0.03669854
## ATP-dependent chromatin remodeling 0.03097651 0.02372443
## Acute myeloid leukemia 0.13056264 0.09183131
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 0.38274697 0.47414705
## ABC transporters 0.15634393 0.17779407
## AGE-RAGE signaling pathway in diabetic complications 0.05102791 0.05400864
## AMPK signaling pathway 0.05716401 0.07513838
## ATP-dependent chromatin remodeling 0.02802998 0.05305129
## Acute myeloid leukemia 0.12816255 0.17705361
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 0.27471783 0.31270056
## ABC transporters 0.11912348 0.14072247
## AGE-RAGE signaling pathway in diabetic complications 0.04765292 0.04058921
## AMPK signaling pathway 0.04399951 0.05444846
## ATP-dependent chromatin remodeling 0.02302883 0.03004370
## Acute myeloid leukemia 0.09833833 0.13101524
In this example, summarize_pathway_level
has generated a pathway-level activity score for each of the 8 samples, for 340 KEGG pathways containing more than 10 genes (here 19 failed functional aggregations under minsize
, slightly more than for the PCA since some genes were removed in the filtering). Note that the resulting matrix looks similar to that of the previous example in terms of shape and format, but the values are derived from the NMF dimension-reduction method instead of the PCA method. Same analogies apply for other types of aggregation operator; only the interpretation of the resulting functional activity scores will change.
Note that beyond pre-processing, you can also post-process the resulting summarized matrix as you see appropriate for your analyses and workflows.
Integrating the pathway-level activity scores with the airway
SummarizedExperiment
object in a MultiAssayExperiment
object
The resulting matrix of pathway-level activity scores can be further analyzed as an independent dataset, or can also be integrated with the airway
SummarizedExperiment
object in a MultiAssayExperiment
structure (note that SummarizedExperiment
can simultaneously manage several experimental assays only if they have the same dimensions, which is not the case here, hence the need for a MultiAssayExperiment
object). The MultiAssayExperiment library has to be loaded, and a MultiAssayExperiment (airwayMultiAssay
) can be created and filled with a list of assays-like matrices that may have different dimensions. Here, airwayMultiAssay
contains the counts
and the recently generated KEGG pathway activity scores by NMF and PCA pooling.
library(MultiAssayExperiment)
assays_list <- list( counts = assay(airway, "counts"), kegg_nmf_agg = pathway_activity_nmf, kegg_pca_agg = pathway_activity_pca)
airwayMultiAssay <- MultiAssayExperiment(experiments=assays_list)
colData(airwayMultiAssay) <- colData(airway)
airwayMultiAssay
## A MultiAssayExperiment object of 3 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 3:
## [1] counts: matrix with 63677 rows and 8 columns
## [2] kegg_nmf_agg: matrix with 342 rows and 8 columns
## [3] kegg_pca_agg: matrix with 346 rows and 8 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Example usage 2: Summarize airway
omics data with summary statistics and a minimum size of the KEGG gene sets
Here we will apply the function summarize_pathway_level
to summarize pathway activity using the mean pooling aggregation for those sets containing at least 12 genes. Remember that you can adjust the parameters minsize
and type
of aggregation as desired.
min <- 12
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "mean", minsize = min)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 344 successful functional aggregations over minsize
## 17 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 344 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 344 , 8"
Now from the original airway
data of dimensions 63677 genes (transcripts) x 8 samples, summarize_pathway_level
has generated through mean-pooling a pathway-level activity score matrix of 341 pathways x 8 samples, for gene sets containing more than 12 genes.
print(head(pathway_activity))
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 709.0000 692.2424
## ABC transporters 645.8000 623.6222
## AGE-RAGE signaling pathway in diabetic complications 10087.0200 6868.5900
## AMPK signaling pathway 1594.4750 1543.2833
## ATP-dependent chromatin remodeling 1921.8095 2037.2000
## Acute myeloid leukemia 989.1045 1013.7612
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 843.1818 510.4848
## ABC transporters 803.7556 504.5556
## AGE-RAGE signaling pathway in diabetic complications 13580.5200 6531.6100
## AMPK signaling pathway 1831.0917 1132.3000
## ATP-dependent chromatin remodeling 2402.6381 1840.1524
## Acute myeloid leukemia 1222.8209 860.0448
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 873.9394 1082.636
## ABC transporters 886.9778 1008.644
## AGE-RAGE signaling pathway in diabetic complications 11440.2100 12108.400
## AMPK signaling pathway 1763.6917 2318.308
## ATP-dependent chromatin remodeling 2174.1048 4114.829
## Acute myeloid leukemia 1200.2985 1658.224
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 627.2727 714.000
## ABC transporters 675.8222 798.400
## AGE-RAGE signaling pathway in diabetic complications 10683.4900 9099.840
## AMPK signaling pathway 1357.5417 1679.950
## ATP-dependent chromatin remodeling 1786.1905 2330.286
## Acute myeloid leukemia 921.0448 1227.000
In this example, 18 of the gene sets in kegg_sets
have size < 12. You can then use the function short_sets_detail
to get information about which pathways have been left out, how many genes they had, and which genes are involved in these shorter sets:
short_sets <- short_sets_detail(kegg_sets, min)
print(short_sets$short_sets)
## [1] "Biotin metabolism"
## [2] "Caffeine metabolism"
## [3] "D-Amino acid metabolism"
## [4] "Neomycin, kanamycin and gentamicin biosynthesis"
## [5] "Phenylalanine, tyrosine and tryptophan biosynthesis"
## [6] "Phosphonate and phosphinate metabolism"
## [7] "Riboflavin metabolism"
## [8] "Sulfur metabolism"
## [9] "Sulfur relay system"
## [10] "Ubiquinone and other terpenoid-quinone biosynthesis"
## [11] "Valine, leucine and isoleucine biosynthesis"
## [12] "Virion - Adenovirus"
## [13] "Virion - Flavivirus and Alphavirus"
## [14] "Virion - Herpesvirus"
## [15] "Virion - Human immunodeficiency virus"
## [16] "Virion - Rotavirus"
## [17] "Vitamin B6 metabolism"
print(short_sets$short_sets_lengths)
## Biotin metabolism
## 3
## Caffeine metabolism
## 6
## D-Amino acid metabolism
## 6
## Neomycin, kanamycin and gentamicin biosynthesis
## 5
## Phenylalanine, tyrosine and tryptophan biosynthesis
## 6
## Phosphonate and phosphinate metabolism
## 6
## Riboflavin metabolism
## 8
## Sulfur metabolism
## 10
## Sulfur relay system
## 8
## Ubiquinone and other terpenoid-quinone biosynthesis
## 11
## Valine, leucine and isoleucine biosynthesis
## 4
## Virion - Adenovirus
## 4
## Virion - Flavivirus and Alphavirus
## 6
## Virion - Herpesvirus
## 9
## Virion - Human immunodeficiency virus
## 5
## Virion - Rotavirus
## 2
## Vitamin B6 metabolism
## 6
print(short_sets$short_sets_molecules)
## $`Biotin metabolism`
## [1] "ENSG00000159267" "ENSG00000151093" "ENSG00000169814"
##
## $`Caffeine metabolism`
## [1] "ENSG00000156006" "ENSG00000140505" "ENSG00000255974" "ENSG00000198077"
## [5] "ENSG00000158125" "ENSG00000171428"
##
## $`D-Amino acid metabolism`
## [1] "ENSG00000110887" "ENSG00000135423" "ENSG00000115419" "ENSG00000167720"
## [5] "ENSG00000133943" "ENSG00000203797"
##
## $`Neomycin, kanamycin and gentamicin biosynthesis`
## [1] "ENSG00000106633" "ENSG00000156515" "ENSG00000159399" "ENSG00000160883"
## [5] "ENSG00000156510"
##
## $`Phenylalanine, tyrosine and tryptophan biosynthesis`
## [1] "ENSG00000169154" "ENSG00000104951" "ENSG00000120053" "ENSG00000125166"
## [5] "ENSG00000171759" "ENSG00000198650"
##
## $`Phosphonate and phosphinate metabolism`
## [1] "ENSG00000134255" "ENSG00000161217" "ENSG00000111666" "ENSG00000185813"
## [5] "ENSG00000138018" "ENSG00000102230"
##
## $`Riboflavin metabolism`
## [1] "ENSG00000197594" "ENSG00000154269" "ENSG00000143727" "ENSG00000134575"
## [5] "ENSG00000102575" "ENSG00000135002" "ENSG00000090013" "ENSG00000160688"
##
## $`Sulfur metabolism`
## [1] "ENSG00000162813" "ENSG00000105755" "ENSG00000128309" "ENSG00000104331"
## [5] "ENSG00000137767" "ENSG00000139531" "ENSG00000128311" "ENSG00000143416"
## [9] "ENSG00000198682" "ENSG00000138801"
##
## $`Sulfur relay system`
## [1] "ENSG00000124217" "ENSG00000174177" "ENSG00000164172" "ENSG00000128309"
## [5] "ENSG00000128311" "ENSG00000167118" "ENSG00000142544" "ENSG00000244005"
##
## $`Ubiquinone and other terpenoid-quinone biosynthesis`
## [1] "ENSG00000167186" "ENSG00000196715" "ENSG00000181019" "ENSG00000115486"
## [5] "ENSG00000173085" "ENSG00000158104" "ENSG00000119723" "ENSG00000132423"
## [9] "ENSG00000198650" "ENSG00000167397" "ENSG00000110871"
##
## $`Valine, leucine and isoleucine biosynthesis`
## [1] "ENSG00000135094" "ENSG00000139410" "ENSG00000060982" "ENSG00000105552"
##
## $`Virion - Adenovirus`
## [1] "ENSG00000154639" "ENSG00000117335" "ENSG00000121594" "ENSG00000114013"
##
## $`Virion - Flavivirus and Alphavirus`
## [1] "ENSG00000104938" "ENSG00000113249" "ENSG00000090659" "ENSG00000167085"
## [5] "ENSG00000162576" "ENSG00000092445"
##
## $`Virion - Herpesvirus`
## [1] "ENSG00000121716" "ENSG00000085514" "ENSG00000119912" "ENSG00000197081"
## [5] "ENSG00000161638" "ENSG00000259207" "ENSG00000110400" "ENSG00000130202"
## [9] "ENSG00000157873"
##
## $`Virion - Human immunodeficiency virus`
## [1] "ENSG00000104938" "ENSG00000160791" "ENSG00000090659" "ENSG00000121966"
## [5] "ENSG00000010610"
##
## $`Virion - Rotavirus`
## [1] "ENSG00000164171" "ENSG00000150093"
##
## $`Vitamin B6 metabolism`
## [1] "ENSG00000135069" "ENSG00000138356" "ENSG00000144362" "ENSG00000108439"
## [5] "ENSG00000241360" "ENSG00000160209"
Other summary statistics can be used for the aggregation, such as median, standard deviation, min, or max. See below some more examples with varying number of genes in the gene sets:
min <- 15
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "sd", minsize = min)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 341 successful functional aggregations over minsize
## 20 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 341 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 341 , 8"
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 897.8148 933.7587
## ABC transporters 1343.7024 1247.1723
## AGE-RAGE signaling pathway in diabetic complications 41059.3414 27919.9726
## AMPK signaling pathway 6261.0234 5764.0069
## ATP-dependent chromatin remodeling 6221.1335 7862.9219
## Acute myeloid leukemia 1540.3614 1571.4687
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 1176.073 700.9372
## ABC transporters 1700.936 1034.8811
## AGE-RAGE signaling pathway in diabetic complications 61813.278 31134.0064
## AMPK signaling pathway 7173.520 3452.9315
## ATP-dependent chromatin remodeling 8738.562 8199.3336
## Acute myeloid leukemia 2372.710 1613.5125
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 1235.497 1476.768
## ABC transporters 2075.315 2100.908
## AGE-RAGE signaling pathway in diabetic complications 44791.910 50360.551
## AMPK signaling pathway 6498.032 6844.707
## ATP-dependent chromatin remodeling 6647.548 19701.660
## Acute myeloid leukemia 2178.260 3040.191
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 818.0049 958.0687
## ABC transporters 1482.2698 1751.8586
## AGE-RAGE signaling pathway in diabetic complications 46530.1838 42718.1856
## AMPK signaling pathway 5062.8806 5557.3168
## ATP-dependent chromatin remodeling 6034.5251 9437.6834
## Acute myeloid leukemia 1703.6101 2032.9305
min <- 7
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "median", minsize = min)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 349 successful functional aggregations over minsize
## 12 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 349 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 349 , 8"
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 424.0 553.0
## ABC transporters 136.0 92.0
## AGE-RAGE signaling pathway in diabetic complications 683.5 581.5
## AMPK signaling pathway 667.0 636.5
## ATP-dependent chromatin remodeling 663.0 597.0
## Acute myeloid leukemia 555.0 566.0
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 444.0 405.0
## ABC transporters 225.0 76.0
## AGE-RAGE signaling pathway in diabetic complications 779.5 435.0
## AMPK signaling pathway 801.5 491.5
## ATP-dependent chromatin remodeling 786.0 471.0
## Acute myeloid leukemia 700.0 393.0
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 518.0 780
## ABC transporters 224.0 138
## AGE-RAGE signaling pathway in diabetic complications 786.5 983
## AMPK signaling pathway 756.5 1084
## ATP-dependent chromatin remodeling 797.0 959
## Acute myeloid leukemia 663.0 767
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 375 614.0
## ABC transporters 155 86.0
## AGE-RAGE signaling pathway in diabetic complications 575 601.5
## AMPK signaling pathway 622 664.0
## ATP-dependent chromatin remodeling 602 666.0
## Acute myeloid leukemia 504 618.0
Example usage 3: Summarize airway
omics data with test statistics
Using the same airway
data and gene sets kegg_sets
from previous examples, let’s generate aggregated representations using test statistics. These operators allow to compare the measurements for each sample between the molecules in each functional set and the molecules not in the given functional set. They may help identify functionally related genes/molecules that exhibit coordinated or significant deviations in their expression patterns across samples. Currently, the implemented available statistical tests in funOmics
are the t-test, Wilcoxon test, and Kolmogorov–Smirnov test.
When using test statistics, one has to be mindful about the assumptions of the test and the distribution of the data. For instance, the t-test assumes that the data is normally distributed and compares the means of two groups; the Wilcoxon test assumes that the data is continuous and symmetric, and compares the medians of two groups; the Kolmogorov–Smirnov test is a non-parametric test that does not assume any distribution and compares the entire cumulative distribution functions (i.e, in this context, the overall shapes of the distributions of values) of two groups. Here we provide an example using the t-test statistic:
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "ttest", minsize = 15)
##
## 361 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 341 successful functional aggregations over minsize
## 20 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 341 , 8
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 341 , 8"
In this case, summarize_pathway_level
has generated a pathway-level activity score for each of the 8 samples, for 339 pathways containing more than 15 genes. The resulting test statistic for each sample represents the difference between the two groups (i.e., the functional set and rest of molecules) for each gene or molecule. These statistics can be then be used as features for further analysis or modeling:
print(head(pathway_activity))
## SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism 2.454453 2.436486
## ABC transporters 1.603352 1.763264
## AGE-RAGE signaling pathway in diabetic complications 2.381487 2.357986
## AMPK signaling pathway 2.226225 2.375512
## ATP-dependent chromatin remodeling 2.635281 2.273416
## Acute myeloid leukemia 3.528012 3.738857
## SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism 2.165767 2.224862
## ABC transporters 1.596487 1.724146
## AGE-RAGE signaling pathway in diabetic complications 2.135969 2.024587
## AMPK signaling pathway 2.191499 2.840510
## ATP-dependent chromatin remodeling 2.353859 2.005228
## Acute myeloid leukemia 2.841968 3.153732
## SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism 2.271840 2.322606
## ABC transporters 1.624509 1.672874
## AGE-RAGE signaling pathway in diabetic complications 2.472229 2.311854
## AMPK signaling pathway 2.329357 2.939678
## ATP-dependent chromatin remodeling 2.763006 1.891442
## Acute myeloid leukemia 3.064161 3.160045
## SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism 2.285466 2.280632
## ABC transporters 1.696842 1.782976
## AGE-RAGE signaling pathway in diabetic complications 2.234984 2.055617
## AMPK signaling pathway 2.290621 2.660227
## ATP-dependent chromatin remodeling 2.526441 2.172547
## Acute myeloid leukemia 2.978309 3.599640
Importantly, this is just an illustrative example. In real-world experiments, you may have to consider that genes or molecules in a certain set might be longer or naturally more abundant and therefore might have higher overall read counts (abundance) compared to genes or molecules outside the set, even if their expression levels (as a proportion of transcripts) are similar. Normalizing the expression counts for all genes (including those within and outside the sets) using a suitable method (e.g., library size normalization, transcript length normalization) can help make the expression levels more comparable before performing the aggregation and set comparisons.
Molecular sets beyond KEGG and omics matrices beyond SummarizedExperiment
The package funOmics
interoperates with KEGGREST to retrieve molecular sets from the KEGG through the function get_kegg_sets
(see description and example above). Other real-world molecular sets can be downloaded from several sources. In terms of gene sets, the Gene Ontology is a versatile resource that covers three domains: cellular components, biological processes and molecular functions. Reactome pathways can also be used to generate higher-level functional representations from omics data. Explore the different releases and download the corresponding gene sets for the different types of GO terms, and reactome pathways here. You can also aggregate genes into protein complexes, which you can find in the CORUM database.
Regarding other omics types, such as metabolomics, the function summarize_pathway_level
can be applied in a similar manner to a metabolomics matrix X
and KEGG metabolic pathways. Metabolite sets from KEGG pathways can also be downloaded with the KEGG API.
After obtaining the molecular sets information, this data has to be formatted as a list of lists (similar to what is obtained from the get_kegg_sets
function). In other words, you need a structure where you have a list of multiple molecular sets names, and each of these sets is represented as a list of molecule identifiers, such as entrez IDs, PubChem CIDs, Uniprot IDs, etc. For instance, let’s retrieve gene sets from GO terms for cellular compartments. The information can be downloaded from the msigdb link or accessed programmatically as follows:
goccdb <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c5.go.cc.v7.5.entrez.gmt"
downdb <- sapply(readLines(goccdb), function(x) strsplit(x, "\t")[[1]])
gocc_sets <- sapply(as.matrix(downdb), function(x) x[3:length(x)])
names(gocc_sets) = sapply(as.matrix(downdb), function(x) x[1])
gocc_sets[1:3]
## $GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX
## [1] "1069" "1161" "2067" "2071" "2072" "2073" "5424" "5887" "7507" "7508"
## [11] "7515"
##
## $GOCC_HISTONE_DEACETYLASE_COMPLEX
## [1] "10013" "10014" "10284" "10428" "10467" "10524" "10629" "10847"
## [9] "10856" "10902" "10933" "1107" "1108" "116092" "1457" "221037"
## [17] "23186" "23309" "23314" "23468" "25855" "25942" "26038" "283248"
## [25] "3065" "3066" "3094" "3622" "473" "51317" "51564" "51780"
## [33] "53615" "54556" "54815" "55758" "55806" "55809" "55818" "55869"
## [41] "55929" "57459" "57504" "57634" "57649" "58516" "5928" "5931"
## [49] "64426" "64431" "6907" "7764" "79595" "79685" "79718" "79885"
## [57] "81611" "8204" "8295" "83933" "84215" "84312" "8607" "8819"
## [65] "8841" "90665" "9112" "91748" "9219" "9611" "9734" "9759"
##
## $GOCC_SAGA_COMPLEX
## [1] "100130302" "10474" "10629" "112869" "117143" "170067"
## [7] "23326" "2648" "27097" "55578" "56943" "56970"
## [13] "6871" "6878" "6880" "6881" "6883" "8295"
## [19] "8464" "8850" "9913"
As you can see, the resulting gocc_sets
object is a list of lists where each element represents a GO cellular compartment gene set. Here you can also use the function short_sets_detail
to get information about which and how many pathways contain less than a given number of molecules:
min <- 8
short_sets <- short_sets_detail(gocc_sets, min)
print(head(short_sets$short_sets_molecules))
## $GOCC_TRANSCRIPTION_FACTOR_TFIIIC_COMPLEX
## [1] "112495" "2975" "2976" "9328" "9329" "9330"
##
## $GOCC_COMMITMENT_COMPLEX
## [1] "11338" "55015" "6631" "6632" "6634"
##
## $GOCC_THO_COMPLEX
## [1] "57187" "79228" "80145" "84321" "8563" "9984"
##
## $GOCC_EKC_KEOPS_COMPLEX
## [1] "112858" "51002" "55644" "8270" "84520"
##
## $GOCC_GLYCOSYLPHOSPHATIDYLINOSITOL_N_ACETYLGLUCOSAMINYLTRANSFERASE_GPI_GNT_COMPLEX
## [1] "51227" "5277" "5279" "5283" "84992" "8818" "9091"
##
## $GOCC_HRD1P_UBIQUITIN_LIGASE_ERAD_L_COMPLEX
## [1] "51009" "550" "57414" "6400" "79139" "84447" "91319"
Notably, omics data does not always come from a SummarizedExperiment
object. Some times, it is imported from a csv file, generated through other pre-processing steps and packages, or even generated from a simulation. In these cases, the data has to be formatted as a matrix of dimensions g*s
(g
molecules and s
samples).
Let’s see an example usage of the GO cellular compartments gene sets gocc_sets
where omics data is of type matrix. For this purpose, we will create an expression matrix where the expression values are random positive values sampled from a standard normal distribution. Please, note that funOmics
can be used to aggregate other types of omics data and molecular sets, such as metabolomics or proteomics that may have a similar range of values.
Let’s simulate a gene expression matrix X_expr
, where gene IDs are codes between 1:10000 (to match entrez IDs), and summarize_pathway_level
can be applied:
# Example usage:
set.seed(1)
g <- 10000
s <- 20
X_expr <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(1:g, paste0("s", 1:s)))
print(paste("Dimensions of omics matrix X:", dim(X_expr)[1], "*", dim(X_expr)[2]))
## [1] "Dimensions of omics matrix X: 10000 * 20"
## s1 s2 s3 s4 s5 s6 s7
## 1 0.6264538 0.8043316 0.2353485 0.6179223 0.2212571 0.5258908 0.3413341
## 2 0.1836433 1.0565257 0.2448250 0.8935057 0.3517935 0.4875444 0.4136665
## 3 0.8356286 1.0353958 0.6421869 0.4277562 0.1606019 1.1382508 0.1220357
## 4 1.5952808 1.1855604 1.9348085 0.2999012 0.1240523 1.2151344 1.5893806
## 5 0.3295078 0.5004395 1.0386957 0.5319833 0.6598739 0.4248307 0.7874385
## 6 0.8204684 0.5249887 0.2835501 1.7059816 0.5038493 1.4508403 1.5920640
## s8 s9 s10 s11 s12 s13 s14
## 1 1.00203611 1.55915937 0.09504307 0.7914415 0.0145724495 0.8663644 0.1604426
## 2 0.02590761 0.20166217 0.38805939 0.3921679 1.7854043337 0.9476952 0.9241849
## 3 0.44814178 1.04017610 2.13657003 0.4726670 0.0002997544 0.4522428 1.5561751
## 4 0.84323332 0.07195772 0.55661945 0.4579517 0.4356948690 0.2782408 0.8812202
## 5 0.21846310 0.01526544 0.59094164 0.1681319 1.4076452475 1.4175945 0.5263595
## 6 0.47678629 0.33938598 1.52014345 0.5856737 0.6929698698 0.6329981 0.4627372
## s15 s16 s17 s18 s19 s20
## 1 0.249371112 1.2520152 2.3150930 0.4414410 0.82485558 1.37086468
## 2 0.335346796 0.3351313 1.0603800 0.4130862 0.74402087 0.54569610
## 3 0.004405287 0.1080678 0.3970672 0.8660777 0.69009734 1.62446330
## 4 0.986768348 0.4717051 0.4840034 2.2615708 1.76900681 0.06247283
## 5 0.543575705 2.5070607 1.3584146 0.1787018 0.55215640 0.57021255
## 6 0.626142823 1.2451344 0.6370574 0.4713582 0.03257056 0.31350574
Now, let’s summarize the expression data using standard deviation pooling for the GO cellular compartments gene sets. We won’t specify a minimum size of sets in this case, so the default minsize
of 10 is used:
sd_gocc_expr <- summarize_pathway_level(X_expr, gocc_sets, type="sd", minsize=8)
##
## 1006 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## iteration 400
## iteration 500
## iteration 600
## iteration 700
## iteration 800
## iteration 900
## iteration 1000
## 550 successful functional aggregations over minsize
## 456 failed functional aggregations under minsize
## Functional activity score matrix has dimensions: 550 , 20
## s1 s2 s3 s4
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.4045096 0.3841496 0.4876056 0.6158703
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.5961716 0.5524931 0.4077048 0.6363483
## GOCC_SAGA_COMPLEX 0.6190007 0.4613048 0.4179023 0.5202579
## GOCC_GOLGI_MEMBRANE 0.6050940 0.5866127 0.6168553 0.6264271
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.7146701 0.7208348 0.6495247 0.6066239
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.5324259 0.6635115 0.6381171 0.6187512
## s5 s6 s7 s8
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.5427526 0.8412107 0.6130915 0.5778133
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.7490914 0.4977517 0.5737141 0.6572306
## GOCC_SAGA_COMPLEX 0.4176742 0.5104733 0.6437678 0.6215122
## GOCC_GOLGI_MEMBRANE 0.5953969 0.6037806 0.6283209 0.6215994
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.6213030 0.6611160 0.6128709 0.5919404
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.6536113 0.6025263 0.6968490 0.4874398
## s9 s10 s11 s12
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.8376548 0.6066229 0.6922118 0.9214760
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.6117108 0.7045937 0.3173755 0.6027702
## GOCC_SAGA_COMPLEX 0.6731533 0.7353114 0.2952001 0.4611380
## GOCC_GOLGI_MEMBRANE 0.5759732 0.6395941 0.6302452 0.6040828
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.6589884 0.5835952 0.6676812 0.5805518
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.6787391 0.3562284 0.6064547 0.4911328
## s13 s14 s15 s16
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.4879332 0.6815041 0.3723823 0.4737714
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.5398999 0.5982277 0.4842830 0.7086422
## GOCC_SAGA_COMPLEX 0.6219926 0.4270019 0.9318551 0.4333853
## GOCC_GOLGI_MEMBRANE 0.5775356 0.5659483 0.5885292 0.6093459
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.4800778 0.5803810 0.5722157 0.6796406
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.5056348 0.5565119 0.4373176 0.6570645
## s17 s18 s19 s20
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX 0.8067935 0.3302759 0.5167987 0.7764892
## GOCC_HISTONE_DEACETYLASE_COMPLEX 0.6092391 0.5663914 0.5499079 0.4634814
## GOCC_SAGA_COMPLEX 0.5610913 0.6366751 0.4932885 0.6815148
## GOCC_GOLGI_MEMBRANE 0.5888242 0.6110807 0.5752147 0.6149827
## GOCC_UBIQUITIN_LIGASE_COMPLEX 0.5656959 0.6670322 0.5867161 0.6975617
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX 0.5784575 0.4706088 0.5487221 0.5346914
GO cellular compartments level expression signatures have been generated via standard deviation aggregation. You can apply similar procedures for other types of molecular sets, aggregation functions and omics types. The package funOmics
is conceived to be flexible across omics types and types of molecular sets, so you can also tailor or directly create your own list of molecular sets based on specific criteria of your experiments (e.g., include only protein complexes involved in ubiquitination, or define ad hoc metabolic routes involving specific metabolites).