rSWeeP: Alignment-free method for vectorising biological sequences

Camila Pereira Perico1, Danrley Rafael Fernandes2, Julia Formighieri Varaschin1, Camilla Reginatto De Pierri1,3, Mariane Gonçalves Kulik1, Ricardo Assunção Vialle4 and Roberto Tadeu Raittz1,3

1Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.
2Federal University of Paraná, Graduate in Biological Sciences, Curitiba, Paraná, Brazil.
3Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil.
4Rush Alzheimer's Disease Center, Chicago, IL, US.

29 October 2024

Abstract

The rSWeeP package is an R implementation of the SWeeP method. The main function of this package is to provide a vector representation of biological sequences (nucleotides or amino acids), and thus favor alignment-free phylogenetic studies. Each sequence provided is represented by a compact numerical vector which is easier to analyze. The method is based on k-mers counting and random projection of high-dimensional vectors into low-dimensional vectors. In addition, the package allows general dimensionality reduction of RNAseq data and generic matrices.

Package

rSWeeP 1.18.0

1 Overview
- 1.1 Functions
2 Quick Start
3 Session information
4 References
Appendix

1 Overview

The “Spaced Words Projection (SWeeP)” is a method for representing biological sequences in compact vectors. The sequences are scanned by a reading mask that corresponds to a given k-mer. This reading generates a high-dimensional vector (HDV) which, by random projection, is reduced to a low-dimensional vector (LDV), preserving the relative distances between the high-dimensional vectors (preserving the information).

The SWeeP function is the original implementation of the method as in Pierri et al. (2019), which uses a random orthonormal basis generated by the orthBase function. The SWeePlite function is optimized for processing with larger masks (larger k-mer) and for larger volumes of data (long sequences or large numbers of sequences).

These functions are suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The MATLAB version of the method is available at sWeeP (PIERRI, 2019). This tool has it’s main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.

Tutorials and more information are available at https://aibialab.github.io/rSWeeP.

1.1 Functions

The package has six functions:

SWeeP: conventional SWeeP function, in which it is necessary to generate or load an orthonormal basis from a file.
orthBase: function to generate the orthonormal matrix to be used by SWeeP in the projection.
SWeePlite: SWeeP function with built-in orthonormal basis. Slower function but low RAM usage.
extractHDV: extracts only the High Dimensional Vector (HDV) from biological sequences.
PCCI: PhyloTaxonomic Consistency Cophenetic Index, estimates the grouping rate of the taxa in the phylogenetic tree.
PMPG: percentage of Monophyletic and Paraphyletic taxa in the phylogenetic tree.

The SWeeP and SWeePlite functions are responsible for vectorizing the sequences using the referenced SWeeP method (PIERRI, 2019).

2 Quick Start

The page https://aibialab.github.io/rSWeeP provides basic tutorials on using and parameterizing the rSWeeP package. Below you can find a quickstart of the package.

Consider a set of 13 mitochondrial proteomes (translated CDSs) deposited in the folder at the address path in FASTA format. These sequences can be vectorized as below:

library(rSWeeP)

## Loading required package: foreach

## Loading required package: doParallel

## Loading required package: iterators

## Loading required package: parallel

## Loading required package: Biostrings

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname

## Loading required package: IRanges

## Loading required package: XVector

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeePlite(path,seqtype='AA',mask=c(4),psz=1000)

## Starting projection. Please wait.
## starting sequence  1 of 14 - complete
## starting sequence  2 of 14 - complete
## starting sequence  3 of 14 - complete
## starting sequence  4 of 14 - complete
## starting sequence  5 of 14 - complete
## starting sequence  6 of 14 - complete
## starting sequence  7 of 14 - complete
## starting sequence  8 of 14 - complete
## starting sequence  9 of 14 - complete
## starting sequence  10 of 14 - complete
## starting sequence  11 of 14 - complete
## starting sequence  12 of 14 - complete
## starting sequence  13 of 14 - complete
## starting sequence  14 of 14 - complete

In sw$proj you will find the SWeeP vectors in matrix format with 13 rows (each row a sample) and 1000 columns (coordinates). In sw$info other processing information is stored, which may be important in subsequent steps.

sw$info

## $headers
##  [1] "01_Pan_troglodytes"               "02_Capra_aegagrus"               
##  [3] "03_Homo_sapiens"                  "04_Bos_taurus"                   
##  [5] "05_Ara_ararauna"                  "06_Mus_musculus"                 
##  [7] "07_Brotogeris_cyanoptera"         "08_Homo_sapiens_neanderthalensis"
##  [9] "09_Gazella_gazella"               "10_Rattus_norvegicus"            
## [11] "11_Pan_paniscus"                  "12_Psittacara_rubritorquis"      
## [13] "13_Apodemus_sylvaticus"           "14_Rhazya_stricta"               
## 
## $ProjectionSize
## [1] 1000
## 
## $bin
## [1] "counting (FALSE)"
## 
## $mask
## [1] 1 1 1 1
## 
## $SequenceType
## [1] "AA"
## 
## $extension
## [1] ""
## 
## $version
## [1] '1.18.0'
## 
## $norm
## [1] "none"
## 
## $timeElapsed
## [1] 7.484

$ProjectionSize - length of the output vector (length of the projection)
$mask - mask used (provided in extended binary format)
$SequenceType - data type (amino acid or nucleotide)
$concatenate - if the files were concatenated into one
$bin - whether binary (TRUE) or count (FALSE)
$version - version of rSWeeP used
$norm - type of normalization used (‘none’, ‘log’ or ‘logNeg’)
$timeElapsed - time spent in processing
$headers - name of the vectorized files/samples

We can obtain the phylogenetic relationship between the vectorized organisms using the Neighbour Joining (NJ) method.

library(ape)

## 
## Attaching package: 'ape'

## The following object is masked from 'package:Biostrings':
## 
##     complement

# get the distance matrix
mdist = dist(sw$proj,method='euclidean')

# use the NJ algorithm to build the tree
tr = nj(mdist)
# root the tree in the plant sample
tr = root(tr,outgroup='14_Rhazya_stricta')

# plot
plot(tr)

To visualize the vectorized data graphically and evaluate the phylogenetic tree, we provide the metadata with the classes at the family taxonomic level.

pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv")
mt = read.csv(pathmetadata,header=TRUE)

We evaluated the phylogenetic tree in terms of taxon groupings with the PCCI metric, and checked the percentage of consistent taxa, mono or paraphyletic with the PMPG metric.

data = data.frame(sp=mt$fileName,family=mt$family) 

PCCI(tr,data) # PhyloTaxonomic Consistency Cophenetic Index

## $tab
##          taxa      cost
## 1   Hominidae 1.0000000
## 2     Bovidae 1.0000000
## 3 Psittacidae 0.8888889
## 4     Muridae 1.0000000
## 5 Apocynaceae 1.0000000
## 
## $mean
## [1] 0.9777778

PMPG(tr,data) # Percentage of Mono or Paraphyletic Groups

## $tab
##          taxa mono  para
## 1   Hominidae TRUE FALSE
## 2     Bovidae TRUE FALSE
## 3 Psittacidae TRUE FALSE
## 4     Muridae TRUE FALSE
## 5 Apocynaceae TRUE FALSE
## 
## $percMono
## [1] 1
## 
## $percPara
## [1] 0
## 
## $metric
## [1] 1

We obtain the PCA and visualize the first components.

pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1],pca_output$x[,2],xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = mt$id)
legend("bottomright",unique(mt$family),col=as.character(c(1:length(unique(mt$family)))),pch=20)
plot(pca_output$x[,3],pca_output$x[,4],xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = mt$id)

3 Session information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ape_5.8             rSWeeP_1.18.0       Biostrings_2.74.0  
##  [4] GenomeInfoDb_1.42.0 XVector_0.46.0      IRanges_2.40.0     
##  [7] S4Vectors_0.44.0    BiocGenerics_0.52.0 doParallel_1.0.17  
## [10] iterators_1.0.14    foreach_1.5.2       BiocStyle_2.34.0   
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.9          highr_0.11              compiler_4.4.1         
##  [4] BiocManager_1.30.25     crayon_1.5.3            tinytex_0.53           
##  [7] Rcpp_1.0.13             magick_2.8.5            jquerylib_0.1.4        
## [10] yaml_2.3.10             fastmap_1.2.0           lattice_0.22-6         
## [13] R6_2.5.1                knitr_1.48              bookdown_0.41          
## [16] GenomeInfoDbData_1.2.13 bslib_0.8.0             rlang_1.1.4            
## [19] cachem_1.1.0            stringi_1.8.4           xfun_0.48              
## [22] sass_0.4.9              cli_3.6.3               magrittr_2.0.3         
## [25] zlibbioc_1.52.0         grid_4.4.1              digest_0.6.37          
## [28] nlme_3.1-166            lifecycle_1.0.4         evaluate_1.0.1         
## [31] codetools_0.2-20        rmarkdown_2.28          httr_1.4.7             
## [34] tools_4.4.1             htmltools_0.5.8.1       UCSC.utils_1.2.0

4 References

Appendix

Pierri,C. R. et al. SWeeP: Representing large biological sequences data sets in compact vectors. Scientific Reports, accepted in December 2019.doi: 10.1038/s41598-019-55627-4.