Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC) (Lin and Peddada 2020) is a methodology of differential abundance (DA) analysis for microbial absolute abundance data. ANCOM-BC estimates the unknown sampling fractions, corrects the bias induced by their differences among samples, and identifies taxa that are differentially abundant according to the covariate of interest.
For more details, please refer to the ANCOM-BC paper.
Download package.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ANCOMBC")
Load the package.
library(ANCOMBC)
The HITChip Atlas data set (Lahti et al. 2014) is available via the microbiome R package (Lahti et al. 2017) in phyloseq (McMurdie and Holmes 2013) format, and via Data Dryad in tabular format. This data set comes with 130 genus-like taxonomic groups across 1006 western adults with no reported health complications. Some subjects have also short time series.
data(atlas1006)
# Subset to baseline
pseq = subset_samples(atlas1006, time == 0)
# Re-code the bmi group
sample_data(pseq)$bmi_group = recode(sample_data(pseq)$bmi_group,
underweight = "lean",
lean = "lean",
overweight = "overweight",
obese = "obese",
severeobese = "obese",
morbidobese = "obese")
# Re-code the nationality group
sample_data(pseq)$nation = recode(sample_data(pseq)$nationality,
Scandinavia = "NE",
UKIE = "NE",
SouthEurope = "SE",
CentralEurope = "CE",
EasternEurope = "EE")
# Aggregate to phylum level
phylum_data = aggregate_taxa(pseq, "Phylum")
options(qwraps2_markup = "markdown")
summary_template =
list("Age" =
list("min" = ~ min(age, na.rm = TRUE),
"max" = ~ max(age, na.rm = TRUE),
"mean (sd)" = ~ mean_sd(age, na_rm = TRUE,
show_n = "never")),
"Gender" =
list("F" = ~ n_perc0(sex == "female", na_rm = TRUE),
"M" = ~ n_perc0(sex == "male", na_rm = TRUE),
"NA" = ~ n_perc0(is.na(sex))),
"Nationality" =
list("Central Europe" = ~ n_perc0(nation == "CE", na_rm = TRUE),
"Eastern Europe" = ~ n_perc0(nation == "EE", na_rm = TRUE),
"Northern Europe" = ~ n_perc0(nation == "NE", na_rm = TRUE),
"Southern Europe" = ~ n_perc0(nation == "SE", na_rm = TRUE),
"US" = ~ n_perc0(nation == "US", na_rm = TRUE),
"NA" = ~ n_perc0(is.na(nation))),
"BMI" =
list("Lean" = ~ n_perc0(bmi_group == "lean", na_rm = TRUE),
"Overweight" = ~ n_perc0(bmi_group == "overweight",
na_rm = TRUE),
"Obese" = ~ n_perc0(bmi_group == "obese", na_rm = TRUE),
"NA" = ~ n_perc0(is.na(bmi_group)))
)
data_summary = summary_table(meta(pseq), summary_template)
data_summary
meta(pseq) (N = 1,006) | |
---|---|
Age | |
min | 18 |
max | 77 |
mean (sd) | 44.71 ± 14.44 |
Gender | |
F | 553 (57) |
M | 416 (43) |
NA | 37 (4) |
Nationality | |
Central Europe | 617 (63) |
Eastern Europe | 15 (2) |
Northern Europe | 212 (22) |
Southern Europe | 86 (9) |
US | 44 (5) |
NA | 32 (3) |
BMI | |
Lean | 461 (51) |
Overweight | 154 (17) |
Obese | 285 (32) |
NA | 106 (11) |
The number of samples: 1006.
The number of phyla: 8.
out = ancombc(phyloseq = phylum_data, formula = "age + nation + bmi_group",
p_adj_method = "holm", zero_cut = 0.90, lib_cut = 1000,
group = "nation", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,
max_iter = 100, conserve = TRUE, alpha = 0.05, global = TRUE)
res = out$res
res_global = out$res_global
Result from the ANCOM-BC log-linear model to determine taxa that are differentially abundant according to the covariate of interest. It contains: 1) coefficients; 2) standard errors; 3) test statistics; 4) p-values; 5) adjusted p-values; 6) indicators whether the taxon is differentially abundant (TRUE) or not (FALSE).
tab_coef = res$beta
col_name = c("Age", "EE - CE", "NE - CE", "SE - CE", "US - CE",
"Oerweight - Lean", "Obese - Lean")
colnames(tab_coef) = col_name
tab_coef %>% datatable(caption = "Coefficients from the Primary Result") %>%
formatRound(col_name, digits = 2)
tab_se = res$se
colnames(tab_se) = col_name
tab_se %>% datatable(caption = "SEs from the Primary Result") %>%
formatRound(col_name, digits = 2)
tab_w = res$W
colnames(tab_w) = col_name
tab_w %>% datatable(caption = "Test Statistics from the Primary Result") %>%
formatRound(col_name, digits = 2)
tab_p = res$p_val
colnames(tab_p) = col_name
tab_p %>% datatable(caption = "P-values from the Primary Result") %>%
formatRound(col_name, digits = 2)
tab_q = res$q
colnames(tab_q) = col_name
tab_q %>% datatable(caption = "Adjusted p-values from the Primary Result") %>%
formatRound(col_name, digits = 2)
tab_diff = res$diff_abn
colnames(tab_diff) = col_name
tab_diff %>%
datatable(caption = "Differentially Abundant Taxa
from the Primary Result")
Step 1: estimate sample-specific sampling fractions (log scale). Note that for each sample, if it contains missing values for any variable specified in the formula, the corresponding sampling fraction estimate for this sample will return NA since the sampling fraction is not estimable with the presence of missing values.
Step 2: adjust the log observed abundances by subtracting the estimated sampling fraction from log observed abundances of each sample.
samp_frac = out$samp_frac
# Replace NA with 0
samp_frac[is.na(samp_frac)] = 0
# Add pesudo-count (1) to avoid taking the log of 0
log_obs_abn = log(abundances(phylum_data) + 1)
# Adjust the log observed abundances
log_obs_abn_adj = t(t(log_obs_abn) - samp_frac)
# Show the first 6 samples
round(log_obs_abn_adj[, 1:6], 2) %>%
datatable(caption = "Bias-adjusted log observed abundances")
“Age” is a continuous variable.
df_fig1 = data.frame(res$beta * res$diff_abn, check.names = FALSE) %>%
rownames_to_column("taxon_id")
df_fig2 = data.frame(res$se * res$diff_abn, check.names = FALSE) %>%
rownames_to_column("taxon_id")
colnames(df_fig2)[-1] = paste0(colnames(df_fig2)[-1], "SD")
df_fig = df_fig1 %>% left_join(df_fig2, by = "taxon_id") %>%
transmute(taxon_id, age, ageSD) %>%
filter(age != 0) %>% arrange(desc(age)) %>%
mutate(group = ifelse(age > 0, "g1", "g2"))
df_fig$taxon_id = factor(df_fig$taxon_id, levels = df_fig$taxon_id)
p = ggplot(data = df_fig,
aes(x = taxon_id, y = age, fill = group, color = group)) +
geom_bar(stat = "identity", width = 0.7,
position = position_dodge(width = 0.4)) +
geom_errorbar(aes(ymin = age - ageSD, ymax = age + ageSD), width = 0.2,
position = position_dodge(0.05), color = "black") +
labs(x = NULL, y = "Log fold change",
title = "Waterfall Plot for the Age Effect") +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5),
panel.grid.minor.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1))
p
Result from the ANCOM-BC global test to determine taxa that are differentially abundant between three or more groups of multiple samples. It contains: 1) test statistics; 2) p-values; 3) adjusted p-values; 4) indicators whether the taxon is differentially abundant (TRUE) or not (FALSE).
tab_w = res_global[, "W", drop = FALSE]
colnames(tab_w) = "Nationality"
tab_w %>% datatable(caption = "Test Statistics
from the Global Test Result") %>%
formatRound(c("Nationality"), digits = 2)
tab_p = res_global[, "p_val", drop = FALSE]
colnames(tab_p) = "Nationality"
tab_p %>% datatable(caption = "P-values
from the Global Test Result") %>%
formatRound(c("Nationality"), digits = 2)
tab_q = res_global[, "q_val", drop = FALSE]
colnames(tab_q) = "Nationality"
tab_q %>% datatable(caption = "Adjusted p-values
from the Global Test Result") %>%
formatRound(c("Nationality"), digits = 2)
tab_diff = res_global[, "diff_abn", drop = FALSE]
colnames(tab_diff) = "Nationality"
tab_diff %>% datatable(caption = "Differentially Abundant Taxa
from the Global Test Result")
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DT_0.19 ANCOMBC_1.4.0 qwraps2_0.5.2 magrittr_2.0.1
[5] microbiome_1.16.0 phyloseq_1.38.0 forcats_0.5.1 stringr_1.4.0
[9] dplyr_1.0.7 purrr_0.3.4 readr_2.0.2 tidyr_1.1.4
[13] tibble_3.1.5 ggplot2_3.3.5 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] Rtsne_0.15 colorspace_2.0-2 ellipsis_0.3.2
[4] XVector_0.34.0 fs_1.5.0 rstudioapi_0.13
[7] farver_2.1.0 fansi_0.5.0 lubridate_1.8.0
[10] xml2_1.3.2 codetools_0.2-18 splines_4.1.1
[13] knitr_1.36 ade4_1.7-18 jsonlite_1.7.2
[16] nloptr_1.2.2.2 broom_0.7.9 cluster_2.1.2
[19] dbplyr_2.1.1 compiler_4.1.1 httr_1.4.2
[22] backports_1.2.1 assertthat_0.2.1 Matrix_1.3-4
[25] fastmap_1.1.0 cli_3.0.1 htmltools_0.5.2
[28] tools_4.1.1 igraph_1.2.7 gtable_0.3.0
[31] glue_1.4.2 GenomeInfoDbData_1.2.7 reshape2_1.4.4
[34] Rcpp_1.0.7 Biobase_2.54.0 cellranger_1.1.0
[37] jquerylib_0.1.4 vctrs_0.3.8 Biostrings_2.62.0
[40] rhdf5filters_1.6.0 multtest_2.50.0 ape_5.5
[43] nlme_3.1-153 iterators_1.0.13 crosstalk_1.1.1
[46] xfun_0.27 rbibutils_2.2.4 rvest_1.0.2
[49] lifecycle_1.0.1 zlibbioc_1.40.0 MASS_7.3-54
[52] scales_1.1.1 hms_1.1.1 parallel_4.1.1
[55] biomformat_1.22.0 rhdf5_2.38.0 yaml_2.2.1
[58] sass_0.4.0 stringi_1.7.5 highr_0.9
[61] S4Vectors_0.32.0 foreach_1.5.1 permute_0.9-5
[64] BiocGenerics_0.40.0 GenomeInfoDb_1.30.0 Rdpack_2.1.2
[67] rlang_0.4.12 pkgconfig_2.0.3 bitops_1.0-7
[70] evaluate_0.14 lattice_0.20-45 Rhdf5lib_1.16.0
[73] htmlwidgets_1.5.4 labeling_0.4.2 tidyselect_1.1.1
[76] plyr_1.8.6 R6_2.5.1 IRanges_2.28.0
[79] generics_0.1.1 DBI_1.1.1 pillar_1.6.4
[82] haven_2.4.3 withr_2.4.2 mgcv_1.8-38
[85] survival_3.2-13 RCurl_1.98-1.5 modelr_0.1.8
[88] crayon_1.4.1 utf8_1.2.2 tzdb_0.1.2
[91] rmarkdown_2.11 grid_4.1.1 readxl_1.3.1
[94] data.table_1.14.2 vegan_2.5-7 reprex_2.0.1
[97] digest_0.6.28 stats4_4.1.1 munsell_0.5.0
[100] bslib_0.3.1
Lahti, Leo, Jarkko Salojärvi, Anne Salonen, Marten Scheffer, and Willem M De Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 5 (1): 1–10.
Lahti, Leo, Sudarshan Shetty, T Blake, J Salojarvi, and others. 2017. “Tools for Microbiome Analysis in R.” Version 1: 10013.
Lin, Huang, and Shyamal Das Peddada. 2020. “Analysis of Compositions of Microbiomes with Bias Correction.” Nature Communications 11 (1): 1–11.
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PloS One 8 (4): e61217.