RcwlPipelines
is a Bioconductor package that manages a collection
of commonly used bioinformatics tools and pipeline based on
Rcwl
. These pre-built and pre-tested tools and pipelines are highly
modularized with easy customization to meet different bioinformatics
data analysis needs.
Rcwl
and RcwlPipelines
together forms a Bioconductor toolchain
for use and development of reproducible bioinformatics pipelines in
Common Workflow Language (CWL). The project also aims to develop a
community-driven platform for open source, open development, and open
review of best-practice CWL bioinformatics pipelines.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("RcwlPipelines")
The development version is also available to download from GitHub.
BiocManager::install("rworkflow/RcwlPipelines")
library(RcwlPipelines)
The project website https://rcwl.org/ serves as a central hub for all related resources. It provides guidance for new users and tutorials for both users and developers. Specific resources are listed below.
The R scripts to build the CWL tools and pipelines are now residing
in a dedicated GitHub
repository, which is
intended to be a community effort to collect and contribute
Bioinformatics tools and pipelines using Rcwl
and CWL.
The tutorial book provides detailed
instructions for developing Rcwl
tools/pipelines, and also includes
examples of some commonly-used tools and pipelines that covers a wide
range of Bioinformatics data analysis needs.
RcwlPipelines
core functionsHere we show the usage of 3 core functions: cwlUpdate
, cwlSearch
and cwlLoad
for updating, searching, and loading the needed tools
or pipelines in R.
cwlUpdate
The cwlUpdate
function syncs the current Rcwl
recipes and returns
a cwlHub
object which contains the most updated Rcwl
recipes. The
mcols()
function returns all related information about each
available tool or pipeline.
The recipes will be locally cached, so users don’t need to call
cwlUpdate
every time unless they want to use a tool/pipeline that is
newly added to RcwlPipelines
. Here we are using the recipes from
Bioconductor devel version.
## For vignette use only. users don't need to do this step.
Sys.setenv(cachePath = tempdir())
atls <- cwlUpdate(branch = "dev") ## sync the tools/pipelines.
atls
#> cwlHub with 177 records
#> cache path: /tmp/Rtmp8Oqxd2/Rcwl
#> # last modified date: 2021-08-02
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#>
#> title
#> BFC1 | pl_AnnPhaseVcf
#> BFC2 | pl_BaseRecal
#> BFC3 | pl_CombineGenotypeGVCFs
#> BFC4 | pl_GAlign
#> BFC5 | pl_GPoN
#> ... ...
#> BFC173 | tl_vcf2bed
#> BFC174 | tl_vcf_expression_annotator
#> BFC175 | tl_vcf_readcount_annotator
#> BFC176 | tl_vep
#> BFC177 | tl_vt_decompose
#> Command
#> BFC1 VCFvep+dVCFcoverage+rVCFcoverage+VCFexpression+PhaseVcf
#> BFC2 BaseRecalibrator+ApplyBQSR+samtools_index+samtools_flagstat+samt...
#> BFC3 CombineGVCFs+GenotypeGVCFs
#> BFC4 fqJson+fq2ubam+ubam2bamJson+align+mvOut
#> BFC5 GenomicsDB+PoN
#> ... ...
#> BFC173 R function
#> BFC174 vcf-expression-annotator
#> BFC175 vcf-readcount-annotator
#> BFC176 vep
#> BFC177 vt decompose
table(mcols(atls)$Type)
#>
#> pipeline tool
#> 37 139
Currently, we have integrated NA command line tools and NA pipelines.
cwlSearch
We can use (multiple) keywords to search for specific tools/pipelines
of interest, which internally search the mcols
of “rname”, “rpath”,
“fpath”, “Command” and “Containers”. Here we show how to search the
alignment tool bwa mem
.
t1 <- cwlSearch(c("bwa", "mem"))
t1
#> cwlHub with 1 records
#> cache path: /tmp/Rtmp8Oqxd2/Rcwl
#> # last modified date: 2021-05-20
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#>
#> title Command
#> BFC105 | tl_bwa bwa mem
mcols(t1)
#> DataFrame with 1 row and 14 columns
#> rid rname create_time access_time
#> <character> <character> <character> <character>
#> 1 BFC105 tl_bwa 2021-10-26 22:30:30 2021-10-26 22:30:30
#> rpath rtype fpath last_modified_time
#> <character> <character> <character> <numeric>
#> 1 /tmp/Rtmp8Oqxd2/Rcwl.. local /tmp/Rtmp8Oqxd2/Rcwl.. NA
#> etag expires Type Command Container
#> <character> <numeric> <character> <character> <character>
#> 1 NA NA tool bwa mem biocontainers/bwa:v0..
#> mtime
#> <character>
#> 1 2021-05-20 12:15:10
cwlLoad
The last core function cwlLoad
loads the Rcwl
tool/pipeline into
the R working environment. The code below loads the tool with a
user-defined name bwa
to do the read alignment.
bwa <- cwlLoad(title(t1)[1]) ## "tl_bwa"
bwa <- cwlLoad(mcols(t1)$fpath[1]) ## equivalent to the above.
bwa
#> class: cwlProcess
#> cwlClass: CommandLineTool
#> cwlVersion: v1.0
#> baseCommand: bwa mem
#> requirements:
#> - class: DockerRequirement
#> dockerPull: biocontainers/bwa:v0.7.17-3-deb_cv1
#> inputs:
#> threads (int): -t
#> RG (string?): -R
#> Ref (File):
#> FQ1 (File):
#> FQ2 (File?):
#> outputs:
#> sam:
#> type: File
#> outputBinding:
#> glob: '*.sam'
#> stdout: bwaOutput.sam
Now the R tool of bwa
is ready to use.
To fit users’ specific needs,the existing tool or pipline can be
easily customized. Here we use the rnaseq_Sf
pipeline to demonstrate
how to access and change the arguments of a specific tool inside a
pipeline. This pipeline covers RNA-seq reads quality summary by
fastQC
, alignment by STAR
, quantification by featureCounts
and
quality control by RSeQC
.
rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")
#> fastqc loaded
#> STAR loaded
#> sortBam loaded
#> samtools_index loaded
#> samtools_flagstat loaded
#> featureCounts loaded
#> gtfToGenePred loaded
#> genePredToBed loaded
#> read_distribution loaded
#> geneBody_coverage loaded
#> STAR loaded
#> gCoverage loaded
plotCWL(rnaseq_Sf)
There are many default arguments defined for the tool of STAR
inside
the pipeline. Users might want to change some of them. For example, we
can change the value for --outFilterMismatchNmax
argument from 2 to
5 for longer reads.
arguments(rnaseq_Sf, "STAR")[5:6]
#> [[1]]
#> [1] "--readFilesCommand"
#>
#> [[2]]
#> [1] "zcat"
arguments(rnaseq_Sf, "STAR")[[6]] <- 5
arguments(rnaseq_Sf, "STAR")[5:6]
#> [[1]]
#> [1] "--readFilesCommand"
#>
#> [[2]]
#> [1] "5"
We can also change the docker image for a specific tool (e.g., to a
specific version). First, we search for all available docker images
for STAR
in biocontainers repository. The Source server could be
quay or dockerhub.
searchContainer("STAR", repo = "biocontainers", source = "quay")
#> DataFrame with 34 rows and 6 columns
#> tool V2 name
#> <character> <character> <character>
#> 2.7.9a--h9ee0642_0 STAR quay.io/biocontainers 2.7.9a--h9ee0642_0
#> 2.6.1d--h9ee0642_1 STAR quay.io/biocontainers 2.6.1d--h9ee0642_1
#> 2.7.8a--h9ee0642_1 STAR quay.io/biocontainers 2.7.8a--h9ee0642_1
#> 2.4.0j--h9ee0642_2 STAR quay.io/biocontainers 2.4.0j--h9ee0642_2
#> 2.6.0c--h9ee0642_3 STAR quay.io/biocontainers 2.6.0c--h9ee0642_3
#> ... ... ... ...
#> 2.4.0j--0 STAR quay.io/biocontainers 2.4.0j--0
#> 2.5.4a--0 STAR quay.io/biocontainers 2.5.4a--0
#> 2.5.3a--0 STAR quay.io/biocontainers 2.5.3a--0
#> 2.5.2b--0 STAR quay.io/biocontainers 2.5.2b--0
#> 2.5.1b--0 STAR quay.io/biocontainers 2.5.1b--0
#> last_modified size container
#> <character> <character> <character>
#> 2.7.9a--h9ee0642_0 Tue, 11 May 2021 19:.. 10134089 quay.io/biocontainer..
#> 2.6.1d--h9ee0642_1 Fri, 26 Mar 2021 15:.. 11646389 quay.io/biocontainer..
#> 2.7.8a--h9ee0642_1 Fri, 26 Mar 2021 15:.. 9956698 quay.io/biocontainer..
#> 2.4.0j--h9ee0642_2 Fri, 26 Mar 2021 15:.. 7066519 quay.io/biocontainer..
#> 2.6.0c--h9ee0642_3 Fri, 26 Mar 2021 15:.. 11634304 quay.io/biocontainer..
#> ... ... ... ...
#> 2.4.0j--0 Tue, 06 Mar 2018 12:.. 4734325 quay.io/biocontainer..
#> 2.5.4a--0 Fri, 26 Jan 2018 21:.. 9225952 quay.io/biocontainer..
#> 2.5.3a--0 Sat, 18 Mar 2017 11:.. 9119736 quay.io/biocontainer..
#> 2.5.2b--0 Tue, 06 Sep 2016 07:.. 9086803 quay.io/biocontainer..
#> 2.5.1b--0 Wed, 11 May 2016 08:.. 11291827 quay.io/biocontainer..
Then, we can change the STAR
version into 2.7.8a (tag name: 2.7.8a–0).
requirements(rnaseq_Sf, "STAR")[[1]]
#> $class
#> [1] "DockerRequirement"
#>
#> $dockerPull
#> [1] "quay.io/biocontainers/star:2.7.9a--h9ee0642_0"
requirements(rnaseq_Sf, "STAR")[[1]] <- requireDocker(
docker = "quay.io/biocontainers/star:2.7.8a--0")
requirements(rnaseq_Sf, "STAR")[[1]]
#> $class
#> [1] "DockerRequirement"
#>
#> $dockerPull
#> [1] "quay.io/biocontainers/star:2.7.8a--0"
Once the tool or pipeline is ready, we only need to assign values for
each of the input parameters, and then submit using one of the
functions: runCWL
, runCWLBatch
and cwlShiny
. More detailed Usage
and examples can be refer to the Rcwl
vignette.
To successfully run the tool or pipeline, users either need to have
all required command line tools pre-installed locally, or using the
docker/singularity runtime by specifying docker = TRUE
or docker = "singularity"
argument inside runCWL
or runCWLBatch
function. Since the Bioconductor building machine doesn’t have all the
tools installed, nor does it support the docker runtime, here we use some
pseudo-code to demonstrate the tool/pipeline execution.
inputs(rnaseq_Sf)
rnaseq_Sf$in_seqfiles <- list("sample_R1.fq.gz",
"sample_R2.fq.gz")
rnaseq_Sf$in_prefix <- "sample"
rnaseq_Sf$in_genomeDir <- "genome_STAR_index_Dir"
rnaseq_Sf$in_GTFfile <- "GENCODE_version.gtf"
runCWL(rnaseq_Sf, outdir = "output/sample", docker = TRUE)
Users can also submit parallel jobs to HPC for multiple samples using
runCWLBatch
function. Different cluster job managers, such as
“multicore”, “sge” and “slurm”, are supported using the
BiocParallel::BatchtoolsParam
.
library(BioParallel)
bpparam <- BatchtoolsParam(workers = 2, cluster = "sge",
template = batchtoolsTemplate("sge"))
inputList <- list(in_seqfiles = list(sample1 = list("sample1_R1.fq.gz",
"sample1_R2.fq.gz"),
sample2 = list("sample2_R1.fq.gz",
"sample2_R2.fq.gz")),
in_prefix = list(sample1 = "sample1",
sample2 = "sample2"))
paramList <- list(in_genomeDir = "genome_STAR_index_Dir",
in_GTFfile = "GENCODE_version.gtf",
in_runThreadN = 16)
runCWLBatch(rnaseq_Sf, outdir = "output",
inputList, paramList,
BPPARAM = bpparam)
sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] RcwlPipelines_1.10.0 BiocFileCache_2.2.0 dbplyr_2.1.1
#> [4] Rcwl_1.10.0 S4Vectors_0.32.0 BiocGenerics_0.40.0
#> [7] yaml_2.2.1 BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.2 tidyr_1.1.4 sass_0.4.0
#> [4] bit64_4.0.5 jsonlite_1.7.2 R.utils_2.11.0
#> [7] bslib_0.3.1 shiny_1.7.1 assertthat_0.2.1
#> [10] BiocManager_1.30.16 blob_1.2.2 base64url_1.4
#> [13] progress_1.2.2 pillar_1.6.4 RSQLite_2.2.8
#> [16] backports_1.2.1 lattice_0.20-45 glue_1.4.2
#> [19] reticulate_1.22 digest_0.6.28 RColorBrewer_1.1-2
#> [22] promises_1.2.0.1 checkmate_2.0.0 htmltools_0.5.2
#> [25] httpuv_1.6.3 Matrix_1.3-4 R.oo_1.24.0
#> [28] pkgconfig_2.0.3 dir.expiry_1.2.0 bookdown_0.24
#> [31] DiagrammeR_1.0.6.1 purrr_0.3.4 xtable_1.8-4
#> [34] brew_1.0-6 later_1.3.0 BiocParallel_1.28.0
#> [37] git2r_0.28.0 tibble_3.1.5 generics_0.1.1
#> [40] ellipsis_0.3.2 cachem_1.0.6 withr_2.4.2
#> [43] magrittr_2.0.1 crayon_1.4.1 mime_0.12
#> [46] memoise_2.0.0 evaluate_0.14 R.methodsS3_1.8.1
#> [49] fansi_0.5.0 tools_4.1.1 data.table_1.14.2
#> [52] prettyunits_1.1.1 hms_1.1.1 lifecycle_1.0.1
#> [55] basilisk.utils_1.6.0 stringr_1.4.0 compiler_4.1.1
#> [58] jquerylib_0.1.4 rlang_0.4.12 debugme_1.1.0
#> [61] grid_4.1.1 rstudioapi_0.13 rappdirs_0.3.3
#> [64] htmlwidgets_1.5.4 visNetwork_2.1.0 igraph_1.2.7
#> [67] rmarkdown_2.11 basilisk_1.6.0 codetools_0.2-18
#> [70] curl_4.3.2 DBI_1.1.1 R6_2.5.1
#> [73] knitr_1.36 dplyr_1.0.7 bit_4.0.4
#> [76] fastmap_1.1.0 utf8_1.2.2 filelock_1.0.2
#> [79] stringi_1.7.5 parallel_4.1.1 Rcpp_1.0.7
#> [82] vctrs_0.3.8 png_0.1-7 batchtools_0.9.15
#> [85] tidyselect_1.1.1 xfun_0.27