Content
- 0.0 Introduction
- 1.0 Basic: Input Preparation and Prediction
- 2.0 Advanced 1: Input Matrix and Statistics
- 2.1 Advanced 2: Helping Functions
- 2.2 Advanced 3: Clustering and Multiple Datasets
- 3.0 List of Functions
- 4.0 Acknowledgment
- 5.0 References
0.0 Introduction
In the last years a class of ncRNAs were described named lincRNAs (large intergenic noncoding RNAs) as untranslated transcripts with a length of over 200 bp. [1, 2] A number of papers focused on their tissue-specific upregulation in cancer cells. [1, 3, 4] The annotation and functional classification of lincRNAs (and ncRNAs) remain challenging. Given a RNAseq or microarray experiment, one possible approach to identify candidate ncRNAs poses “guilty by association”, a principle stating that coding and noncoding genes showing a comparable expression pattern are likely to share functionality. [5] This idea can now be easily applied on arbitrary expression matrices using this R package LINC. Enriched biological terms from gene annotation resources like Reactome PA and Gene Ontology (GO) for the cluster (multiple candidate lincRNA genes) can be identified applying getbio() which will call supported functions from the clusterProfiler package. [6 - 8] This analysis will reveal which functions, pathways or compartments are associated with the lincRNA-co-expressed genes. The basic section will show how to predict the biological functions of lincRNAs based on gene expression data. The sections focusing on the advanced options will provide explanations how to control thresholds used in the computations.
1.0 Basic: Input Preparation and Prediction
This section will demonstrate the steps intended by the methods in the LINC package used to predict the biological functions of lincRNAs by co-expression. As an input a gene expression matrix is required with rows corresponding to genes and columns representing samples. The following piece of code shows an expression matrix termed GTEX_CRBL - CRBL stands for cerebellum. (Please note, that the large gene expression matrix GTEX_CRBL from http://www.gtexportal.org is not provided in this Version of the package.)
str(GTEX_CRBL)
num [1:56318, 1:117] 0.0475 10.7799 0.105 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:56318] "ENSG00000223972" "ENSG00000227232" "ENSG00000243485" "ENSG00000237613" ...
..$ : chr [1:117] "GTEX-117XS-3126-SM-5GIDP" "GTEX-1192X-3226-SM-5987D" "GTEX-11DXW-1026-SM-5H11K"
"GTEX-11DXY-3126-SM-5N9BT" ...
In order to apply the methods in the LINC package it is recommended to select for the 1000 up to 10.000 high-variance genes. It is also useful also look for the expression levels of genes. In a second step the biotypes need to be identified, for instance by biomaRt:getBM. These lines of code are an example how to choose high-variance genes and how to get the biotypes.
# STEP 1: select the high-variance genes
var_index <- order(apply(GTEX_CRBL, 1, var), decreasing = T)
GTEX_CRBL_HVAR <- GTEX_CRBL[var_index[1:5000], ]
# STEP 2: get the gene biotype
require(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
biotype <- getBM(attributes=c('gene_biotype',
'ensembl_gene_id'),
filters = 'ensembl_gene_id',
values = rownames(GTEX_CRBL_HVAR),
mart = ensembl)
# STEP 3:
index <- match(rownames(GTEX_CRBL_HVAR), biotype$ensembl_gene_id)
GTEX_CRBL_BIOTYPE <- biotype$gene_biotype[index]
The 5000 selected genes contain 4256 protein-coding genes and 74 lincRNA genes.
table(GTEX_CRBL_BIOTYPE)
3prime_overlapping_ncRNA antisense lincRNA
8 279 74
miRNA misc_RNA Mt_rRNA
3 4 2
Mt_tRNA processed_pseudogene processed_transcript
4 119 37
protein_coding rRNA sense_intronic
4256 1 11
sense_overlapping snoRNA snRNA
13 8 1
TR_C_gene transcribed_processed_pseudogene transcribed_unprocessed_pseudogene
1 8 25
unitary_pseudogene unprocessed_pseudogene
2 18
The following example uses a preprocessed gene expression matrix with 7 lincRNA genes as queries. After separating the queries from the protein-coding genes with linc(), the function clusterlinc() computes the co-expression networks and returns the significantly co-expressed genes associated with the queries. In the third step getbio() derives the enriched biological terms.
# the preprocessed expression matrix with 1000 genes
str(cerebellum)
## num [1:1000, 1:50] 73.3 101.4 328.4 244.2 24.4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "3337" "3304" "3320" "5730" ...
## ..$ : chr [1:50] "A1" "A2" "A3" "A4" ...
# a TRUE/FALSE vector with TRUE for protein-coding genes
str(pcgenes_crbl)
## logi [1:1000] TRUE TRUE TRUE TRUE TRUE TRUE ...
# STEP 1: Separate the protine-coding genes from the queries (lincRNAs)
crbl_matrix <- linc(cerebellum, codingGenes = pcgenes_crbl)
# STEP 2: Compute the co-expression network with a fixed threshold
crbl_cluster <- clusterlinc(crbl_matrix, pvalCutOff = 0.005)
# STEP 3: Interrogate enriched biological terms for co-expressed genes
crbl_bp <- getbio(crbl_cluster)
# Show the results as a plot!
plotlinc(crbl_bp)
This plot contains plenty of information. The figure in the lower panel depicts the results of the gene set enrichment analysis. In this case terms of the subontology ‘BP’ (Biological Process) from GO (Gene Ontology) are shown. The query with the Entrez id ‘55384’ known as MEG3 (maternally expressed 3) is in this context co-expressed with genes involved in mRNA splicing and processing. This query exhibits also the highest expression level across the 9 analyzed lincRNA genes. Its partner in the cluster is ‘283131’ NEAT1 (nuclear paraspeckle assembly transcript 1) for which no significant co-expression pattern was found in this context.
It can be informative to take a closer look at the co-expressed genes, here for ‘55384’ alias MEG3.
getcoexpr(crbl_cluster, query = "55384")[1:5]
## [1] "6625" "116983" "25789" "6430" "7345"
# The co-expressed genes can also be returned as gene symbols.
getcoexpr(crbl_cluster, query = "55384", keyType = 'SYMBOL')[1:5]
## [1] "SNRNP70" "ACAP3" "TMEM59L" "SRSF5" "UCHL1"
Moreover, one could focus the analysis on a single query like MEG3. For a single query gene they function singlelinc() comes with many options in terms of co-expression selection. One can select for absolute values, negative correlations and a maximum for co-expressed genes. In contrast to clusterlinc(), singlelinc() will directly call a gene annotation resource for the co-expressed genes. For multiple queries the workflow as sequence of function calls would be linc() -> clusterlinc() -> getbio() -> plotlinc() … and linc() -> singlelinc() -> plotlinc() for a single query. If there are more than > 100 queries (lincRNA genes) in the matrix it will be beneficial to subset the expression matrix as mentioned before. Since the applied enrichment functions like enrichGO expects Entrez ids the protein-coding genes should use this gene system. Identifiers of another gene systems will be translated to Entrez.
meg3 <- singlelinc(crbl_matrix, query = "55384",
onlycor = TRUE, underth = FALSE,
threshold = 0.5, ont = 'MF')
plotlinc(meg3)
This yields 22 genes which have a correlation of > 0.5 to the expression profile of MEG3. The molecular function (‘MF’) is related to histone modification.
2.0 Advanced 1: Input Matrix and Statistics
The main function of this package termed linc() computes a correlation matrix and perfoms statistical correction. Spearman’s rank correlations is the default method. As input it requires an a matrix, data.frame or ExpressionSet. It provides options for the removal of principle components and gene outliers. These methods can indicate whether correlation values are influenced by confounding factors. Negative correlation can be computed by providing a user-defined correlation function (example C). Protein-coding genes and queries (lincRNAs and other ncRNAs) are separated by the argument codingGenes, an assignment of protein-coding genes given as a logical vector or a vector of gene biotypes. Every gene that is not a protein-coding one is considered as a query. The output of this function is a LINCmatrix instance. These objects can be plotted applying the plotlinc() function. The resulting plot illustrates the statistics of the input matrix. Longer computation times occur for genes > 10000, samples > 100 and queries > 100!
data(BRAIN_EXPR)
# (A) call 'linc' with no further arguments
# 'cerebellum' is a matrix of expression values; rows correspond to genes
# 'pcgenes_crbl' is a TRUE/FALSE vector; TRUE indicates a protein-coding gene
crbl_matrix <- linc(cerebellum, codingGenes = pcgenes_crbl)
# (B) remove first seven principle components
crbl_matrix_pca <- linc(cerebellum, codingGenes = pcgenes_crbl, rmPC = c(1:7))
# (C) negative correlation by using 'userFun'
crbl_matrix_ncor <- linc(cerebellum, codingGenes = pcgenes_crbl,
userFun = function(x,y){ -cor(x,y) })
# (D) remove outliers using the ESD method
crbl_matrix_esd <- linc(cerebellum, codingGenes = pcgenes_crbl, outlier = "esd")
# (E) plot this object
plotlinc(crbl_matrix_esd)
Roughly 20% of all genes in the dataset contained outliers. Removal of these outliers changed also the histogram of samples in the middle panel. One could compare this to:
plotlinc(crbl_matrix)
What is the effect of removing the first seven PCs from the dataset for the prediction of MEG3 in Section 1.0?
meg3_pca <- singlelinc(crbl_matrix_pca, query = "55384",
onlycor = TRUE, underth = FALSE,
threshold = 0.5, ont = 'MF')
plotlinc(meg3_pca)
Removing PCs changed the prediction! Observations which depend the main variance in the dataset are no longer significant.
intersect(getcoexpr(meg3), getcoexpr(meg3_pca))
## [1] "9881" "8675" "2686" "56654" "81669" "2648" "6625"
## [8] "27352" "3151" "1025" "8473" "222183" "81890" "6430"
## [15] "2145"
15 genes given as Entrez ids are still co-expressed with the query gene MEG3 after PC removal.
2.1 Advanced 2: Helping Functions
The LINC package comes with a number of helping functions. This section describes most of them:
There are getter functions to access the slots in an object of the LINC class.
# results() - results (different for subclasses)
# correlation() - correlation matrices
# assignment() - vector of protein-coding genes
# history() - stored parameters
# express() - original expression matrix
cerebellum <- express(crbl_cluster)
str(cerebellum)
## num [1:1000, 1:50] 73.3 101.4 328.4 244.2 24.4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:1000] "3337" "3304" "3320" "5730" ...
## ..$ : chr [1:50] "A1" "A2" "A3" "A4" ...
The function call getlinc(…, subject = “queries”) will return the possible queries in the object.
# getlinc() is used to accesss information
getlinc(crbl_cluster, subject = "queries")
## [1] "55384" "7503" "647979" "283131" "338799" "100093630"
## [7] "653635" "441951" "150776"
Converting a LINCcluster into a LINCmatrix with feature() for instance could be useful in case one wants to plot the distribution of correlation values instead of the cluster applying plotlinc().
# feature() can be used to convert objects
crbl_matrix <- crbl_cluster + feature(setLevel = "LINCmatrix", showLevels = FALSE)
By default, the assumed gene annotation in LINC refers to the organism “human” with the associated package org.Hs.eg.db. The function changeOrgDb() can be used to change the annotation platform.
# change the used gene annotation, here from "human" to "mouse"
murine_matrix <- changeOrgDb(crbl_matrix, OrgDb = 'org.Mm.eg.db')
plotlinc(…, showCor =) will output scatterplots of correlations between a query and up to 5 other genes.
# scatterplots, correlations and expression level of query
plotlinc(crbl_cluster, showCor = c("647979", "6726", "3337", "3304" ,"3320"))
## TableGrob (3 x 2) "arrange": 6 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[arrange]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
## 5 5 (3-3,1-1) arrange gtable[layout]
## 6 6 (3-3,2-2) arrange gtable[layout]
It is possible to write the ids of co-expressed genes of a LINCcluster object into a table. The size of the table is restricted to 500 observations for each query, respectively.
linctable(file_name = "crbl_co_expr", input = crbl_cluster)
The function justlinc is a wrapper. It requires a gene expression matrix with columns corresponding to samples and rows to genes. A preprocessing of the matrix is not mandatory. Gene ids need to be given preferably as Ensembl gene ids. Importanly, this function uses predefined thresholds, so it will not give a result in every case! On the other hand it does not need additional information.
The following commands show how justlinc is called. (Please note, that the large gene expression matrix named GTEX_LIVER_CRUDE is not provided in this Version of the package. However, it can be downloaded from: https://github.com/ManuelGoepferich/EXPRESSION_DATA_LINC)
justlinc(GTEX_LIVER_CRUDE) # 'justlinc' will search for the 10 best candidates
my_lincRNAs <- c("ENSG00000197813") # This could also be a vector of ids
res <- justlinc(GTEX_LIVER_CRUDE, targetGenes = my_lincRNAs) # 'justlinc' with one query
The optional argument targetGenes can be either a gene ‘biotype’, a single candidate gene or a vector of genes. For multiple candidate genes it is assumed that they have a connection, for instance comparable co-expression pattern. As distance measure of two lincRNAs the Czekanovski dice distance [9] is used.
2.2 Advanced 3: Clustering and Multiple Datasets
The clusterlinc() functions perfroms a correlation test (Spearman’s rho by default) and clusters the candidate lincRNAs based on their interaction partners or their correlations. This relationship is represented by the dendrogram. As distance between two lincRNAs the Czekanovski dice distance [9] is considered, a measure for the number of shared interaction partners. There are different clustering methods to choose from.
# apply the distance method "correlation" instead of "dicedist", the default
crbl_cluster_cor <- clusterlinc(crbl_matrix, distMethod = "correlation")
plotlinc(crbl_cluster_cor)
Another function querycluster() enables clustering not only across different lincRNAs in one dataset, but also clustering of one lincRNA in multiple datasets. In order to distinguish between different expression matrices the function feature can be applied in order to add user-defined names (customID) and colors (customCol).
# add custom names and colors
gbm_cluster <- gbm_cluster + feature(customID = "CANCER_GBM", customCol = "red")
ctx_cluster <- ctx_cluster + feature(customID = "HEALTHY_CTX", customCol = "blue")
hpc_cluster <- hpc_cluster + feature(customID = "HEALTHY_HPC", customCol = "blue")
crbl_cluster <- crbl_cluster + feature(customID = "HEALTHY_CRBL", customCol = "blue")
# plot the dendrogram
norad <- querycluster('647979', queryTitle = 'NORAD',
gbm_cluster, # Glioblastoma
ctx_cluster, # Cortex
hpc_cluster, # Hippocampus
crbl_cluster) # Cerebellum
neat1 <- querycluster('283131', queryTitle = 'NEAT1',
gbm_cluster, ctx_cluster,
hpc_cluster, crbl_cluster)
grid.arrange(norad, neat1, ncol = 2)
The colors in the heatmap indicate the number of shared interaction partners. There is so far no direct function call to depict expression patterns of one lincRNA in multiple datasets. However, this task can be performed in few steps using lapply(), getcoexpr() and the clusterProfiler [6] package.
# STEP 1: get the co-expressed genes
norad <- lapply(list(gbm_cluster,
ctx_cluster,
hpc_cluster,
crbl_cluster),
function(x){
getcoexpr(x, '647979')})
# STEP 2: name the list
names(norad) <- c("CANCER_GBM",
"HEALTHY_CTX",
"HEALTHY_HPC",
"HEALTHY_CRBL")
# STEP 3: enrichment analysis in 'clusterProfiler'
require(clusterProfiler)
norad_cluster <- compareCluster(norad,
fun = 'enrichGO',
OrgDb = 'org.Hs.eg.db',
ont = 'BP')
plot(norad_cluster)
3.0 List of Functions
# WRAPPER
justlinc() # gene selection, co-expression
# MAIN FUNCTIONS
linc() # cor. matrix and statistics
clusterlinc() # cluster and cor. test
singlelinc() # single query co-expression
# PLOTTING FUNCTIONS
plotlinc() # main plotting function
querycluster() # one query in multiple data sets
# HELPING FUNCTIONS
getcoexpr() # get the co-expressed genes
getbio() # enriched terms for a cluster
object + feature() # level and data labeling
getlinc() # subsetting of 'LINC' objects
changeOrgDb() # change organism
linctable() # write to table
4.0 Acknowledgment
I want to thank my supervisor Dr. Carl Herrmann who supported me during the development of this package. In addition, I would like to mention the members of the eilslabs and the DKFZ (Deutsches Krebsforschungszentrum) at the University of Heidelberg. In particular, Prof. Dr. Benedikt Brors, Mattia Falcone, Dr. Michael Flechter, Calvin Chan, Sebastian Steinhauser and Qi Wang.
5.0 References
[1] Cheetham SW, Gruhl F, Mattick JS and Dinger ME (2013) Long noncoding RNAs and the genetics of cancer. British Journal of Cancer 108, 2419 - 2425.
[2] Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A and Rinn JL (2011) Integrative an-notation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes and Development doi: 10.1101/gad.17446611.
[3] Garzon R, Volinia S, Papaioannou D, Nicolet D, Kohlschmidt J, Yan PS, Mrózek K, Bucci D, Carroll AJ, Baer MR, Wetzler M, Carter TH, Powell BL, Kolitz JE, Moore JO, Eisfeld AK, Blachly JS, Blum W, Caligiuri MA, Stone RM, Marcucci G, Croce CM, Byrd JC and Bloomfield CD (2014) Expression and prognostic impact of lncRNAs in acute myeloid leukemia. PNAS 111 :18679 - 18684.
[4] Zhang J, Zhu N and Chen X (2015) A novel long noncoding RNA LINC01133 is upregulated in lung squamous cell cancer and predicts survival. Tumor Biology doi 10.1007/s13277-015-3460-9.
[5] Gillis J and Pavlidis P (2012) “Guilt by Association” Is the Exception Rather Than the Rule in Gene Networks. PLoS Computational Biology 8: e1002444.
[6] Yu G, Wang LG, Han Y, He QY (2015) clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284 - 287.
[7] Yu G, Wang L, Yan G and He Q (2015). DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 31: 608 - 609.
[8] Yu G and He QY (2016). ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 12: 477 - 479.
[9] Christine Brun, Francois Chevenet, David Martin, Jerome Wojcik, Alain Guenoche and Bernard Jacq" Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network" (2003) Genome Biology, 5:R6.