1 Abstract

Wrench is a normalization technique for metagenomic count data. While principally developed for sparse 16S count data from metagenomic experiments, it can also be applied to normalizing count data from other sparse technologies like single cell RNAseq, functional microbiome etc.,.

Given (a) count data organized as features (OTUs, genes etc.,) x samples, and (b) experimental group labels associated with samples, Wrench outputs a normalization factor for each sample. The data is normalized by dividing each sample’s counts with its normalization factor.

The manuscript can be accessed here: https://www.biorxiv.org/content/early/2018/01/31/142851

2 Introduction

An unwanted side-effect of DNA sequencing is that the observed counts retain only relative abundance/expression information. Comparing such relative abundances between experimental conditions/groups (for e.g., with differential abundance analysis) can cause problems. Specifically, in the presence of features that are differentially abundant in absolute abundances, truly unperturbed features can be identified as being differentially abundant. Commonly used techniques like rarefaction/subsampling/dividing by the total count and other variants of these approaches, do not correct for this issue. Wrench was developed to address this problem of reconstructing absolute from relative abundances based on some commonly exploited assumptions in genomics.

The Introduction section in the manuscript presented here: https://www.biorxiv.org/content/early/2018/01/31/142851 provide some perspective on various commonly used normalization techniques from the above standpoint, and we recommend reading through it.

3 Installation

Download package.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Wrench")

Or install the development version of the package from Github.

BiocManager::install(“HCBravoLab/Wrench”)

Load the package.

library(Wrench)

4 Running Wrench

Below, we present a quick tutorial, where we pass count data, and group information to generate compositional and normalization factors. Details on any optional parameters are provided by typing “?wrench” in the R terminal window.

#extract count and group information for from the mouse microbiome data in the metagenomeSeq package
data(mouseData)
mouseData
## MRexperiment (storageMode: environment)
## assayData: 10172 features, 139 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: PM1:20080107 PM1:20080108 ... PM9:20080303 (139
##     total)
##   varLabels: mouseID date ... status (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Prevotellaceae:1 Lachnospiraceae:1 ...
##     Parabacteroides:956 (10172 total)
##   fvarLabels: superkingdom phylum ... OTU (7 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
counts <- MRcounts( mouseData, norm=FALSE )  #get the counts
counts[1:10,1:2]
##                                      PM1:20080107 PM1:20080108
## Prevotellaceae:1                                0            0
## Lachnospiraceae:1                               0            0
## Unclassified-Screened:1                         0            0
## Clostridiales:1                                 0            0
## Clostridiales:2                                 0            0
## Firmicutes:1                                    0            0
## PeptostreptococcaceaeIncertaeSedis:1            0            0
## Clostridiales:3                                 0            0
## Lachnospiraceae:7                               0            0
## Lachnospiraceae:8                               0            0
group <- pData(mouseData)$diet #get the group/condition vector
head(group)
## [1] "BK" "BK" "BK" "BK" "BK" "BK"
#Running wrench with defaults
W <- wrench( counts, condition=group  )
compositionalFactors <- W$ccf
normalizationFactors <- W$nf

head( compositionalFactors ) #one factor for each sample
## PM1:20080107 PM1:20080108 PM1:20080114 PM1:20071211 PM1:20080121 
##    0.7540966    0.8186825    1.0423187    1.2690286    0.7860119 
## PM1:20071217 
##    1.3059309
head( normalizationFactors)  #one factor for each sample
## PM1:20080107 PM1:20080108 PM1:20080114 PM1:20071211 PM1:20080121 
##    0.3364660    0.7051424    1.3295084    0.8530978    0.7545386 
## PM1:20071217 
##    2.1273695

4.1 Usage with differential abundance pipelines

Introducing the above normalization factors for the most commonly used tools is shown below.

# -- If using metagenomeSeq
normalizedObject <- mouseData  #mouseData is already a metagenomeSeq object 
normFactors(normalizedObject) <- normalizationFactors

# -- If using edgeR, we must pass in the compositional factors
edgerobj <- edgeR::DGEList( counts=counts,
                     group = as.matrix(group),
                     norm.factors=compositionalFactors )

# -- If using DESeq/DESeq2
deseq.obj <- DESeq2::DESeqDataSetFromMatrix(countData = counts,
                                   DataFrame(group),
                                   ~ group )
## converting counts to integer mode
deseq.obj
## class: DESeqDataSet 
## dim: 10172 139 
## metadata(1): version
## assays(1): counts
## rownames(10172): Prevotellaceae:1 Lachnospiraceae:1 ...
##   Bryantella:103 Parabacteroides:956
## rowData names(0):
## colnames(139): PM1:20080107 PM1:20080108 ... PM9:20080225
##   PM9:20080303
## colData names(1): group
sizeFactors(deseq.obj) <- normalizationFactors

5 Some caveats / work in development

Wrench currently implements strategies for categorical group labels only. While extension to continuous covariates is still in development, you can create factors/levels out of your continuous covariates (however you think is reasonable) by discretizing/cutting them in pieces.

time <- as.numeric(as.character(pData(mouseData)$relativeTime))
time.levs <- cut( time, breaks = c(0, 6, 28, 42, 56, 70) )
overall_group <- paste( group, time.levs ) #merge the time information and the group information together
W <- wrench( counts, condition = overall_group )

6 The “detrend” option

In cases of very low sample depths and high sparsity, one might find a roughly linear trend between the reconstructed compositional factors (“ccf” entry in the returned list object from Wrench) and the sample depths (total count of a sample) within each experimental group. This can potentially be caused by a large number of zeros affecting the average estimate of the sample-wise ratios of proportions in a downward direction. Existing approaches that exploit zeroes during estimation also suffer from this issue (for instance, varying Scran’s abundance filtering by changing the “min.mean” parameter will reveal the same issue, although in general we have found their pooling approach to be slightly less sensitive with their default abundance filtering).

If you find this happening with the Wrench reconstructed compositional factors, and if you can assume it is reasonable to do so, you can use the detrend=T option (a work in progress) in Wrench to remove such linear trends within groups. It is also worth mentioning that even though low sample-depth samples’ compositional factors can show this behavior, in our experience, we have often found that group-wise averages of compositional factors can still be robust.

7 Session Info

sessionInfo()
## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DESeq2_1.22.0               SummarizedExperiment_1.12.0
##  [3] DelayedArray_0.8.0          BiocParallel_1.16.0        
##  [5] matrixStats_0.54.0          GenomicRanges_1.34.0       
##  [7] GenomeInfoDb_1.18.0         IRanges_2.16.0             
##  [9] S4Vectors_0.20.0            Wrench_1.0.0               
## [11] edgeR_3.24.0                metagenomeSeq_1.24.0       
## [13] RColorBrewer_1.1-2          glmnet_2.0-16              
## [15] foreach_1.4.4               Matrix_1.2-14              
## [17] limma_3.38.0                Biobase_2.42.0             
## [19] BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6           bit64_0.9-7            rprojroot_1.3-2       
##  [4] tools_3.5.1            backports_1.1.2        R6_2.3.0              
##  [7] rpart_4.1-13           KernSmooth_2.23-15     Hmisc_4.1-1           
## [10] DBI_1.0.0              lazyeval_0.2.1         colorspace_1.3-2      
## [13] nnet_7.3-12            tidyselect_0.2.5       gridExtra_2.3         
## [16] bit_1.1-14             compiler_3.5.1         htmlTable_1.12        
## [19] caTools_1.17.1.1       scales_1.0.0           checkmate_1.8.5       
## [22] genefilter_1.64.0      stringr_1.3.1          digest_0.6.18         
## [25] foreign_0.8-71         rmarkdown_1.10         XVector_0.22.0        
## [28] base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6       
## [31] htmlwidgets_1.3        rlang_0.3.0.1          rstudioapi_0.8        
## [34] RSQLite_2.1.1          bindr_0.1.1            gtools_3.8.1          
## [37] acepack_1.4.1          dplyr_0.7.7            RCurl_1.95-4.11       
## [40] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3         
## [43] Rcpp_0.12.19           munsell_0.5.0          stringi_1.2.4         
## [46] yaml_2.2.0             zlibbioc_1.28.0        gplots_3.0.1          
## [49] plyr_1.8.4             grid_3.5.1             blob_1.1.1            
## [52] gdata_2.18.0           crayon_1.3.4           lattice_0.20-35       
## [55] splines_3.5.1          annotate_1.60.0        locfit_1.5-9.1        
## [58] knitr_1.20             pillar_1.3.0           geneplotter_1.60.0    
## [61] codetools_0.2-15       XML_3.98-1.16          glue_1.3.0            
## [64] evaluate_0.12          latticeExtra_0.6-28    data.table_1.11.8     
## [67] gtable_0.2.0           purrr_0.2.5            assertthat_0.2.0      
## [70] ggplot2_3.1.0          xtable_1.8-3           survival_2.43-1       
## [73] tibble_1.4.2           iterators_1.0.10       AnnotationDbi_1.44.0  
## [76] memoise_1.1.0          bindrcpp_0.2.2         cluster_2.0.7-1