FindMyFriends is a framework for doing microbial comparative genomics. It defines a class system for working with pangenome data that allows transparent access to the underlying sequence data, while being able to handle huge collections of genomes. It also defines a set of novel algorithms that makes it possible to create high quality pangenomes at high speed, leveraging alignment free sequence comparison techniques.
FindMyFriends is an R package for creating pangenomes of large sets of microbial organisms, together with tools to investigate the results. The number of tools to create pangenomes is large and some of the more notable ones are Roary, OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan (available from CRAN) is currently the only other entry, but the scope of FindMyFriends is much broader.
Within comparative microbiol genomics, a pangenome is defined as a collection of all the genes present in a set of related organisms and grouped together by some sort of homology. This homology measure has mainly been some sort of blast derived similarity, but as you shall see here, this is not the only possibility.
Pangenomes have many uses, both as organisational tools and as a mean to get insight into the collective genomic information present within a group of organisms. An example of the former is to ensure an equal functional annotation of all genes within a set of genomes by annotating pangenome gene groups rather than single genes. An example of the latter is to use a pangenome together with phenotypic information to discover new links between genes and phenotypes.
Usually pangenomes are calculated from the result of blasting all genes in the organisms under investigation against each other. This step is very time-consuming due to the complexity of the BLAST algorithm and the need to compare everything with everything (resulting in a quadratic scaling). FindMyFriends generally takes a different approach by ditching BLAST altogether. In its absence FindMyFriends uses alignment free sequence comparisons, based on decomposition of the sequences into K-mer feature vectors. K-mers are words of equal length derived by sliding a window of size K over the sequence and a K-mer feature vector is simply the count of each unique word, as seen below
GATTCGATTAG -> ATT: 2
CGA: 1
GAT: 2
TAG: 1
TCG: 1
TTA: 1
TTC: 1
With this decomposition the problem of calculating sequence similarity is reduced to a vector similarity problem, for which there are many different solution. The one employed in FindMyFriends is called cosine similarity, which is effectively the cosine to the angle between the two vectors. If the vector space is strictly positive then the value is bound between 0 and 1 with 1 being 100 % similarity.
Apart from the use of K-mers over BLAST, FindMyFriends also advocate a different approach to how the grouping is performed. Normally the final (or close to final) grouping is derived directly from the main similarity comparison step. In FindMyFriends this is turned around, and the main similarity step is very coarse, resulting in very large gene groups, unsuited as a final result. These groups are then investigated in more detail with regards to the sequence similarity, neighborhood similarity and sequence length. This division makes it possible to employ a very fast, but not very precise algorithm for the first step, and only use time to investigate gene pairs with a high likelihood of being equal. In the end this approach results in a linear scaling with regards to the number of genes, as well as a general speedup, and to my knowledge FindMyFriends is currently by far the fastest algorithm for creating pangenomes.
FindMyFriends is flexible and there are many ways to go about creating pangenomes - you can even specify them manually making it possible to import results from other algorithms into the framework. This vignette will only focus on the currently recommended approach.
This vignette will use a collection of 10 different Mycoplasma pneumonia and hyopneumonia strains. These organisms have an appreciable small genome making it easier to distribute and analyse.
# Unpack files
location <- tempdir()
unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'),
exdir=location)
genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta')
The first step in performing a pangenomic analysis with FindMyFriends is to load gene data into a Pangenome object. This is done with the aptly named pangenome
function, which takes a list of fasta files, a boolean indicating whether the genes are translated and optionally information about the position of each gene on the chromosome and a boolean indicating whether to read all sequences into memory.
library(FindMyFriends)
mycoPan <- pangenome(genomeFiles[1:9],
translated = TRUE,
geneLocation = 'prodigal',
lowMem = FALSE)
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## Gene groups not yet defined
##
## Genes are translated
The last parameters of the function call warrants some more explanation. Some of the algorithms in FindMyFriends uses positional information and this must be supplied with the geneLocation parameter. This can be done in a number of ways: As a data.frame with one row per gene, as a function that takes the header info of each gene from the fasta file and returns a data.frame with on row per gene, or as a string identifying the annotator used to find the gene (currently only Prodigal is supported). The gene info has the following format (all columns required):
head(geneLocation(mycoPan))
## contig start end strand
## 1 gi|71851486|gb|AE017243.1| 207 395 1
## 2 gi|71851486|gb|AE017243.1| 825 1109 1
## 3 gi|71851486|gb|AE017243.1| 1131 1409 1
## 4 gi|71851486|gb|AE017243.1| 1791 2879 1
## 5 gi|71851486|gb|AE017243.1| 2983 4521 1
## 6 gi|71851486|gb|AE017243.1| 4582 4842 1
The contig column holds information on the contig/chromosome on which the gene is located, the start and stop columns gives the start and stop position of translation and the strand column identifies on which strand the gene is located. Additional columns are allowed, but is currently unused.
The last parameter (lowMem) specifies whether all sequences should be read into memory (the default) or be kept as references to the original fasta files. The latter is useful in cases where the size of the sequence data becomes prohibitively large but is not needed for our little toy example.
Metadata about the organisms can be added and queried using the addOrgInfo
method. For instance we might add some taxonomy data.
library(reutils)
orgMeta <- lapply(orgNames(mycoPan), function(name) {
uid <- esearch(name, 'assembly')
taxuid <- elink(uid, dbTo = 'taxonomy')
reutils::content(esummary(taxuid), as = 'parsed')
})
orgMeta <- lapply(lapply(orgMeta, lapply, unlist), as.data.frame)
orgMeta <- do.call(rbind, orgMeta)
mycoPan <- addOrgInfo(mycoPan, orgMeta)
head(orgInfo(mycoPan))
## nGenes Id Status Rank Division
## AE017243 1336 262719 active NA mycoplasmas
## AE017244 1351 262722 active NA mycoplasmas
## AE017332 1316 295358 active NA mycoplasmas
## AP012303 1407 1112856 active NA mycoplasmas
## CP002077 1392 722438 active NA mycoplasmas
## CP002274 1367 907287 active NA mycoplasmas
## ScientificName CommonName TaxId AkaTaxId
## AE017243 Mycoplasma hyopneumoniae J NA 262719 0
## AE017244 Mycoplasma hyopneumoniae 7448 NA 262722 0
## AE017332 Mycoplasma hyopneumoniae 232 NA 295358 0
## AP012303 Mycoplasma pneumoniae 309 NA 1112856 0
## CP002077 Mycoplasma pneumoniae FH NA 722438 0
## CP002274 Mycoplasma hyopneumoniae 168 NA 907287 0
## Genus Species Subsp ModificationDate GenBankDivision
## AE017243 Mycoplasma hyopneumoniae NA 2010/10/08 00:00 Bacteria
## AE017244 Mycoplasma hyopneumoniae NA 2005/08/28 00:00 Bacteria
## AE017332 Mycoplasma hyopneumoniae NA 2004/10/09 00:00 Bacteria
## AP012303 Mycoplasma pneumoniae NA 2011/12/02 00:00 Bacteria
## CP002077 Mycoplasma pneumoniae NA 2010/07/28 00:00 Bacteria
## CP002274 Mycoplasma hyopneumoniae NA 2010/10/09 00:00 Bacteria
The raw sequences are easily accessible as an XStringSet object, or split by organism into an XStringSetList object:
genes(mycoPan)
## A AAStringSet instance of length 12247
## width seq names
## [1] 63 MQTNKNNLKVRTQQIRQQI...IIIDFTDLIAKQEVISR* gi|71851486|gb|AE...
## [2] 95 MSQVDAFLFDDIQGLANKQ...LRIFKHKLVEEKLEKHI* gi|71851486|gb|AE...
## [3] 93 MSKHFRNSIRELEGALKSI...KSEKRGKNIVHARDIAI* gi|71851486|gb|AE...
## [4] 363 MQSAILNNINSPLSSFFLK...VSPEKPEICQLVGLVLV* gi|71851486|gb|AE...
## [5] 513 MSKKSKNSSIEFDAIVVGG...AKKYNLEKRISKLKLIS* gi|71851486|gb|AE...
## ... ... ...
## [12243] 391 MFSFFKQIFKSLKKFFFLL...FTLDTSKLSHLDKEEFD* gi|440453185|gb|C...
## [12244] 222 MTNGITNQLICDHIDLKIA...IVADYLNQRPKTINEIN* gi|440453185|gb|C...
## [12245] 440 MEQFSAFKLLLKKQYETTL...MIEKDDSLQDIITSLVI* gi|440453185|gb|C...
## [12246] 251 MAISKKKRFFFDLAQDEDD...IPISRILTMILDQVIEE* gi|440453185|gb|C...
## [12247] 271 MIISFVNNKGGVLKTTMAT...YLEITKEILNLVNHGHQ* gi|440453185|gb|C...
genes(mycoPan, split = 'organism')
## AAStringSetList of length 9
## [["AE017243"]] gi|71851486|gb|AE017243.1|_1 # 207 # 395 # 1 # ID=1_1;par...
## [["AE017244"]] gi|71913466|gb|AE017244.1|_1 # 207 # 395 # 1 # ID=1_1;par...
## [["AE017332"]] gi|53987142|gb|AE017332.1|_1 # 1 # 189 # 1 # ID=1_1;parti...
## [["AP012303"]] gi|358640274|dbj|AP012303.1|_1 # 691 # 1833 # 1 # ID=1_1;...
## [["CP002077"]] gi|301633103|gb|CP002077.1|_1 # 691 # 1833 # 1 # ID=1_1;p...
## [["CP002274"]] gi|312600973|gb|CP002274.1|_1 # 1 # 189 # 1 # ID=1_1;part...
## [["CP003131"]] gi|506957411|gb|CP003131.1|_1 # 1 # 189 # 1 # ID=1_1;part...
## [["CP003802"]] gi|523582824|gb|CP003802.1|_1 # 207 # 395 # 1 # ID=1_1;pa...
## [["CP003913"]] gi|440453185|gb|CP003913.1|_1 # 692 # 1834 # 1 # ID=1_1;p...
Apart from that the object really needs some grouping before there is anything interesting to do with it.
A lot of the methods in FindMyFriends contains similar parameters, and a lot of these. To save typing when doing the analyses, these parameters can be set on a per-object manner. At creation a set of defaults is already set, which can be queried and modified:
# Query current defaults
head(defaults(mycoPan))
## $groupPrefix
## [1] "OG"
##
## $coreThreshold
## [1] 1
##
## $kmerSize
## [1] 5
##
## $lowerLimit
## [1] 0.5
##
## $algorithm
## [1] "infomap"
##
## $flankSize
## [1] 4
# Set a new default
defaults(mycoPan)$lowerLimit <- 0.6
Values supplied as arguments to a method will always override object defaults.
Pangenome calculation is generally split into two steps; one focusing on a coarse clustering of genes based on sequence similarity and one focusing on refining this clustering by comparing sequences and chromosomal neighborhood between genes in the clusters. Currently the fastest approach to the first step is based on the CD-Hit algorithm (cdhitGrouping
), but other approaches exists (gpcGrouping
and graphGrouping
). There is no advantages to using the slower algorithms, so there use is generally not encouraged. cdhitGrouping
works by repeatedly combining gene groups based on lower and lower similarity threshold. At each step the longest member of each gene group is chosen as a representative for the group in the following step. You generally want to set the lowest threshold rather low to ensure that genes belonging to the same group are definitely clustered together.
mycoPan <- cdhitGrouping(mycoPan, kmerSize = 5, cdhitIter = TRUE, nrep = 3)
Looking at our mycoPan object now reveals some additional information about the gene groups:
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3141 gene groups defined
##
## Core|
## Accessory|===========================================:
## Singleton|======
##
## Genes are translated
This is not the end of our analysis, as these gene groups needs to be refined. The refinement step is done with the neighborhoodSplit
function (or kmerSplit
if chromosomal position is unavailable). This function investigates each gene group and compares the member genes based on sequence similarity, chromosomal neighborhood similarity, sequence length and genome mebership, and uses this information to build up a weighted graph with the member genes as nodes. Edges reflect similarity and are only created if the two genes passes all thresholds with regards to the different comparisons. From this graph cliques (complete subgraphs i.e. subgraphs were all nodes are connected to each other) are extracted sequentially, starting with the clique containing the highest weighted edges. This ensures that all members of the resulting gene groups share a similarity with all other members and avoids the inclusion of genes being very similar to a few genes but not all. After this main step highly similar gene groups lying in parallel (sharing a gene group either up- or downstream) are merged together to create the final grouping.
mycoPan <- neighborhoodSplit(mycoPan, lowerLimit = 0.8)
As can be seen this step results in an increase in the number of gene groups:
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3414 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
In general you should expect that the use of FindMyFriends will result in a higher number of gene groups than other algorithms. This is due to its strict approach to defining similarity and in general you can expect the final result to be of very high quality.
Once the pangenome has been calculated it is possible to perform a range of different post-processing steps and analyses on results.
FindMyFriend forcefully avoids grouping genes from the same genome together. While it is generally a boon to split up paralogue gene groups, as the genes might have different functionality within the cell (and probably be under different regulation), it can be nice to maintain a link between groups with a certain similarity. Such a link can be obtained by calculating a kmer similarity between representatives from each gene group. Once paralogue links have been created they can be used when extracting raw sequences and the information will be taken into account when calculating organism statistics. Furthermore it is possible to merge paralogues into true gene groups (effectively undoing the neighborhood splitting).
mycoPan <- kmerLink(mycoPan, lowerLimit=0.8)
genes(mycoPan, split='paralogue')[[1]]
## A AAStringSet instance of length 40
## width seq names
## [1] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
## [2] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
## [3] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
## [4] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEEYQKQ* gi|71851486|gb|AE...
## [5] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|71851486|gb|AE...
## ... ... ...
## [36] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [37] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [38] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [39] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
## [40] 396 MQTLEKINPNAIQIIKEKLK...ETNILNLNVLDIVEKYQKQ* gi|506957411|gb|C...
mycoPan
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3414 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
collapseParalogues(mycoPan, combineInfo='largest')
## An object of class pgFullLoc
##
## The pangenome consists of 12247 genes from 9 organisms
## 3235 gene groups defined
##
## Core|
## Accessory|============================================
## Singleton|======
##
## Genes are translated
Usually genes are detected automatically by programs suchs as Prodigal and Glimmer, but while these programs are generally good, they are not perfect. A pangenome analysis can potentially reveal genes that, while annotated, are no longer functioning due to frameshifts etc. These will be apparant as members of gene groups that have vastly different length than the majority of the members. Removing such genes will in general improve whatever downstream analysis that the data will be used for. Using the removeGene()
function it is possible to remove single or sets of genes from your pangenome object.
# Remove a gene by raw index
removeGene(mycoPan, ind=60)
## An object of class pgFullLoc
##
## The pangenome consists of 12246 genes from 9 organisms
## 3414 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
# Remove the first organism by index
removeGene(mycoPan, organism=1)
## An object of class pgFullLoc
##
## The pangenome consists of 10911 genes from 8 organisms
## 3325 gene groups defined
##
## Core|
## Accessory|============================================
## Singleton|======
##
## Genes are translated
# or by name
name <- orgNames(mycoPan)[1]
removeGene(mycoPan, organism=name)
## An object of class pgFullLoc
##
## The pangenome consists of 10911 genes from 8 organisms
## 3325 gene groups defined
##
## Core|
## Accessory|============================================
## Singleton|======
##
## Genes are translated
# Remove the second member of the first gene group
removeGene(mycoPan, group=1, ind=2)
## An object of class pgFullLoc
##
## The pangenome consists of 12246 genes from 9 organisms
## 3414 gene groups defined
##
## Core|
## Accessory|==========================================:
## Singleton|=======:
##
## Genes are translated
Having a nice structure around your pangenome data is just a part of a great framework - having a set of analyses at hand for investigating the results is another part. Luckily, by being part of Bioconductor and using the standard data representations, a lot of the functionality already provided in Bioconductor is at your disposal. The two main data types that you might want to extract in order to do further analysis is the pangenome matrix and raw sequences:
# Get the pangenome matrix as an ExpressionSet object
as(mycoPan, 'ExpressionSet')
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 3414 features, 9 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: AE017243 AE017244 ... CP003913 (9 total)
## varLabels: nGenes Id ... GenBankDivision (14 total)
## varMetadata: labelDescription
## featureData
## featureNames: OG1 OG2 ... OG3414 (3414 total)
## fvarLabels: description group ... nGenes (7 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
# or as a regular matrix
as(mycoPan, 'matrix')[1:6, ]
## AE017243 AE017244 AE017332 AP012303 CP002077 CP002274 CP003131
## OG1 2 2 2 0 0 2 2
## OG2 1 1 1 0 0 2 2
## OG3 2 3 2 0 0 1 1
## OG4 2 3 2 0 0 1 1
## OG5 2 3 2 0 0 0 1
## OG6 2 3 1 0 0 0 1
## CP003802 CP003913
## OG1 2 0
## OG2 2 0
## OG3 0 0
## OG4 0 0
## OG5 0 0
## OG6 0 0
# Get all genes split into gene groups
genes(mycoPan, split='group')
## AAStringSetList of length 3414
## [["OG1"]] gi|71851486|gb|AE017243.1|_197 # 135429 # 135674 # -1 # ID=1_1...
## [["OG2"]] gi|71851486|gb|AE017243.1|_332 # 219147 # 219662 # 1 # ID=1_33...
## [["OG3"]] gi|71851486|gb|AE017243.1|_1081 # 726907 # 727254 # -1 # ID=1_...
## [["OG4"]] gi|71851486|gb|AE017243.1|_1078 # 725485 # 725850 # -1 # ID=1_...
## [["OG5"]] gi|71851486|gb|AE017243.1|_1080 # 726349 # 726723 # -1 # ID=1_...
## [["OG6"]] gi|71851486|gb|AE017243.1|_1083 # 727841 # 727978 # -1 # ID=1_...
## [["OG7"]] gi|71851486|gb|AE017243.1|_636 # 428809 # 428922 # -1 # ID=1_6...
## [["OG8"]] gi|71851486|gb|AE017243.1|_87 # 57229 # 58629 # 1 # ID=1_87;pa...
## [["OG9"]] gi|71851486|gb|AE017243.1|_1004 # 676055 # 677263 # -1 # ID=1_...
## [["OG10"]] gi|71851486|gb|AE017243.1|_384 # 253940 # 254611 # -1 # ID=1_...
## ...
## <3404 more elements>
Apart from what is already available within Bioconductor, FindMyFriends also comes with a set of tools to investigate the results further. Two of these calculates different statistics on the two main gene groupings in the dataset, namely gene groups and organisms. These functions are called groupStat()
and orgStat()
and are very straightforward to use:
groupStat(mycoPan)[[1]]
## $maxOrg
## [1] 2
##
## $minLength
## [1] 82
##
## $maxLength
## [1] 82
##
## $sdLength
## [1] 0
##
## $genes
## [1] 197 331 1539 1679 3026 3160 6997 7137 8363 8501 9732 9903
##
## $backward
## [1] "OG3139" "OG2918" "OG2944" "OG2716" "OG2" "OG765" "OG2"
## [8] "OG765" "OG2" "OG765" "OG2" "OG2716"
##
## $forward
## [1] "OG765" "OG2" "OG765" "OG2" "OG2949" "OG2764" "OG2642"
## [8] "OG2" "OG2642" "OG2" "OG1163" "OG2"
head(orgStat(mycoPan))
## nGenes minLength maxLength sdLength nGeneGroups nParalogues
## AE017243 1336 30 1824 121.1718 1330 14
## AE017244 1351 30 1808 121.5220 1338 14
## AE017332 1316 30 1825 123.5957 1311 11
## AP012303 1407 30 1141 121.8308 1404 18
## CP002077 1392 30 1141 121.6465 1389 18
## CP002274 1367 30 1815 123.5032 1364 7
To get a very broad overview of your result plotStat()
can help you:
plotStat(mycoPan, color='Species', type='qual', palette=6)
Three other plot functions exists to let you asses general properties of the pangenome. plotEvolution()
lets you follow the number of singleton, accessory and core genes as the number of organisms increases. Generally this type of plot is very biased towards the order of organisms, so while it is possible to supply a progression of organisms, the default approach is to create bootstraped values for the plot:
plotEvolution(mycoPan)
plotSimilarity()
creates a heatplot of the similarity matrix of the organisms in the pangenome. The similarity of two organisms is here defined as either the percent of shared genes or the cosine similarity of their total kmer feature vector. The ordering can be done manually or automatically according to a hierachcal clustering:
# Pangenome matrix similarity
plotSimilarity(mycoPan)
# Kmer similarity
plotSimilarity(mycoPan, type='kmer', kmerSize=5)
# No ordering
plotSimilarity(mycoPan, ordering='none')
plotTree()
Plots a dendrogram of the organisms in the pangenome based either on the pangenome matrix or kmer counts. The tree can be augmented with additional information from either orgInfo or supplied manually:
plotTree(mycoPan, clust='ward.D2', dist='minkowski')
plotTree(mycoPan, type='kmer', kmerSize=5, clust='ward.D2',
dist='cosine', circular=TRUE, info='Species') +
ggplot2::scale_color_brewer(type='qual', palette=6)
Apart from these general statistics it is possible to get the neighborhood of any given gene group as a weighted graph to further investigate how the up- and downstream genes are organised.
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
getNeighborhood(mycoPan, group=15, vicinity=5)
## IGRAPH DNW- 15 16 --
## + attr: name (v/c), centerGroup (v/l), weight (e/n)
## + edges (vertex names):
## [1] OG1141->OG1187 OG1141->OG424 OG1187->OG424 OG15 ->OG488
## [5] OG198 ->OG1141 OG201 ->OG198 OG247 ->OG938 OG2856->OG2865
## [9] OG2865->OG1187 OG3153->OG391 OG391 ->OG247 OG424 ->OG547
## [13] OG488 ->OG633 OG547 ->OG15 OG633 ->OG3153 OG633 ->OG391
The resulting graph object can be plotted and handled by the functionality of the igraph package or plotted directly using FindMyFriends plotNeighborhood()
method, which applies appropriate styling to the plot, making it easier to interpret:
plotNeighborhood(mycoPan, group=15, vicinity=5)
While pangenomes are often envisioned as presence/absence matrices, this is not the only way to organise the information. As chromosomal position is available in most circumstances, the information can also be converted into a panchromosomal graph where each gene group is a vertex and chromosomal adjacency is weighted edges. The graph will generally be very sparse, consisting of long strings of gene groups interspersed with regions of higher variability.
pcGraph(mycoPan)
## IGRAPH UNW- 3414 3986 --
## + attr: name (v/c), description (v/l), group (v/c), paralogue
## | (v/n), GO (v/l), EC (v/l), nOrg (v/n), nGenes (v/n), organisms
## | (v/x), upstream (v/x), downstream (v/x), weight (e/n), organisms
## | (e/x)
## + edges (vertex names):
## [1] OG1000--OG2796 OG1001--OG3126 OG1002--OG2642 OG1002--OG2716
## [5] OG1002--OG2940 OG1002--OG2949 OG1002--OG3212 OG1004--OG1133
## [9] OG1004--OG2781 OG1005--OG1094 OG1005--OG2773 OG100 --OG797
## [13] OG1010--OG2617 OG1010--OG3260 OG1011--OG3234 OG1012--OG1740
## [17] OG1012--OG1946 OG1013--OG1192 OG1013--OG1253 OG1013--OG2785
## + ... omitted several edges
This graph structure can be used for a range of things and actually is the basis for the last part of the neighborhood splitting step (merging parallel gene groups). Often local regions of high variability can point to insertion/deletion events, frameshifts, problems with the gene grouping (god forbid) or general regions of high chromosomal plasticity. These regions can of course be detected in FindMyFriends and investigated accordingly.
localVar <- variableRegions(mycoPan, flankSize=6)
localVar[[1]]
## $type
## [1] "plastic"
##
## $members
## [1] "OG1025" "OG1190" "OG266" "OG3254" "OG343"
##
## $flank
## [1] "OG266" "OG343"
##
## $connectsTo
## $connectsTo[[1]]
## [1] "OG373"
##
## $connectsTo[[2]]
## [1] "OG414"
##
##
## $graph
## IGRAPH UN-- 5 6 --
## + attr: name (v/c), flank (v/l)
## + edges (vertex names):
## [1] OG343 --OG1190 OG343 --OG1025 OG1190--OG1025 OG1190--OG3254
## [5] OG1025--OG266 OG266 --OG3254
plot(localVar[[1]]$graph)
What is explained above is just a small part of what is possible with FindMyFriends. There are tie-ins with other packages such as PanVizGenerator, and support for functional annotation of gene groups, to name a few. Go explore!
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] igraph_1.0.1 reutils_0.2.2 FindMyFriends_1.2.2
## [4] BiocStyle_2.0.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 RColorBrewer_1.1-2 formatR_1.4
## [4] plyr_1.8.3 XVector_0.12.0 bitops_1.0-6
## [7] class_7.3-14 tools_3.3.0 zlibbioc_1.18.0
## [10] digest_0.6.9 evaluate_0.9 gtable_0.2.0
## [13] lattice_0.20-33 Matrix_1.2-6 filehash_2.3
## [16] DBI_0.4-1 yaml_2.1.13 parallel_3.3.0
## [19] ggdendro_0.1-20 e1071_1.6-7 apcluster_1.4.3
## [22] LiblineaR_1.94-2 dplyr_0.4.3 stringr_1.0.0
## [25] knitr_1.13 Biostrings_2.40.0 S4Vectors_0.10.0
## [28] kebabs_1.6.2 IRanges_2.6.0 stats4_3.3.0
## [31] grid_3.3.0 Biobase_2.32.0 R6_2.1.2
## [34] XML_3.98-1.4 BiocParallel_1.6.2 rmarkdown_0.9.6
## [37] reshape2_1.4.1 kernlab_0.9-24 ggplot2_2.1.0
## [40] magrittr_1.5 scales_0.4.0 htmltools_0.3.5
## [43] BiocGenerics_0.18.0 MASS_7.3-45 assertthat_0.1
## [46] colorspace_1.2-6 labeling_0.3 stringi_1.0-1
## [49] lazyeval_0.1.10 RCurl_1.95-4.8 munsell_0.4.3