Note: the most recent version of this tutorial can be found here and a short overview slide show here.

1 Introduction

systemPipeR provides utilities for building and running automated end-to-end analysis workflows for a wide range of next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke 2014). Important features include a uniform workflow interface across different NGS applications, automated report generation, and support for running both R and command-line software, such as NGS aligners or peak/variant callers, on local computers or compute clusters. The latter supports interactive job submissions and batch submissions to queuing systems of clusters. For instance, systemPipeR can be used with most command-line aligners such as BWA (Li 2013; Li and Durbin 2009), TopHat2 (Kim et al. 2013) and Bowtie2 (Langmead and Salzberg 2012), as well as the R-based NGS aligners Rsubread (Liao, Smyth, and Shi 2013) and gsnap (gmapR) (Wu and Nacu 2010). Efficient handling of complex sample sets (e.g. FASTQ/BAM files) and experimental designs is facilitated by a well-defined sample annotation infrastructure which improves reproducibility and user-friendliness of many typical analysis workflows in the NGS area (Lawrence et al. 2013).

Motivation and advantages of sytemPipeR environment:

Facilitates design of complex NGS workflows involving multiple R/Bioconductor packages
Common workflow interface for different NGS applications
Makes NGS analysis with Bioconductor utilities more accessible to new users
Simplifies usage of command-line software from within R
Reduces complexity of using compute clusters for R and command-line software
Accelerates runtime of workflows via parallelzation on computer systems with mutiple CPU cores and/or multiple compute nodes
Automates generation of analysis reports to improve reproducibility

A central concept for designing workflows within the sytemPipeR environment is the use of workflow management containers called SYSargs (see Figure 1). Instances of this S4 object class are constructed by the systemArgs function from two simple tabular files: a targets file and a param file. The latter is optional for workflow steps lacking command-line software. Typically, a SYSargs instance stores all sample-level inputs as well as the paths to the corresponding outputs generated by command-line- or R-based software generating sample-level output files, such as read preprocessors (trimmed/filtered FASTQ files), aligners (SAM/BAM files), variant callers (VCF/BCF files) or peak callers (BED/WIG files). Each sample level input/outfile operation uses its own SYSargs instance. The outpaths of SYSargs usually define the sample inputs for the next SYSargs instance. This connectivity is established by writing the outpaths with the writeTargetsout function to a new targets file that serves as input to the next systemArgs call. Typically, the user has to provide only the initial targets file. All downstream targets files are generated automatically. By chaining several SYSargs steps together one can construct complex workflows involving many sample-level input/output file operations with any combinaton of command-line or R-based software.

Figure 1: Workflow design structure of systemPipeR

The intended way of running sytemPipeR workflows is via *.Rnw or *.Rmd files, which can be executed either line-wise in interactive mode or with a single command from R or the command-line using a Makefile. This way comprehensive and reproducible analysis reports can be generated in PDF or HTML format in a fully automated manner by making use of the highly functional reporting utilities available for R. Templates for setting up custom project reports are provided as *.Rnw files by the helper package systemPipeRdata and in the vignettes subdirectory of systemPipeR. The corresponding PDFs of these report templates are available here: systemPipeRNAseq, systemPipeRIBOseq, systemPipeChIPseq and systemPipeVARseq. To work with *.Rnw or *.Rmd files efficiently, basic knowledge of Sweave or knitr and Latex or R Markdown v2 is required.

systemPipeR: NGS workflow and report generation environment

Last update: 04 January, 2019

Contents

1 Introduction

2 Getting Started

2.1 Installation

2.2 Loading package and documentation

2.3 Load sample data and workflow templates

2.4 Structure of targets file

2.4.1 Structure of targets file for single end (SE) samples

2.4.2 Structure of targets file for paired end (PE) samples

2.4.3 Sample comparisons

2.5 Structure of param file and SYSargs container

3 Workflow overview

3.1 Define environment settings and samples

3.2 Read Preprocessing

3.3 FASTQ quality report

3.4 Alignment with Tophat2

3.5 Read and alignment count stats

3.6 Create symbolic links for viewing BAM files in IGV

3.7 Alternative NGS Aligners

3.7.1 Alignment with Bowtie2 (e.g. for miRNA profiling)

3.7.2 Alignment with BWA-MEM (e.g. for VAR-Seq)

3.7.3 Alignment with Rsubread (e.g. for RNA-Seq)

3.7.4 Alignment with gsnap (e.g. for VAR-Seq and RNA-Seq)

3.8 Read counting for mRNA profiling experiments

3.9 Read counting for miRNA profiling experiments

3.10 Correlation analysis of samples

3.11 DEG analysis with edgeR

3.12 DEG analysis with DESeq2

3.13 Venn Diagrams

3.14 GO term enrichment analysis of DEGs

3.14.1 Obtain gene-to-GO mappings

3.14.2 Batch GO term enrichment analysis

3.14.3 Plot batch GO term results

3.15 Clustering and heat maps

4 Workflow templates

4.1 RNA-Seq sample

4.1.1 Run workflow

4.2 ChIP-Seq sample

4.2.1 Run workflow

4.3 VAR-Seq sample

4.3.1 VAR-Seq workflow for single machine

4.3.2 Run workflow

4.3.3 VAR-Seq workflow for computer cluster

4.4 Ribo-Seq sample

4.4.1 Run workflow

5 Version information

References

2.4 Structure of `targets` file

2.4.1 Structure of `targets` file for single end (SE) samples

2.4.2 Structure of `targets` file for paired end (PE) samples

2.5 Structure of `param` file and `SYSargs` container

3.4 Alignment with `Tophat2`

3.7.1 Alignment with `Bowtie2` (e.g. for miRNA profiling)

3.7.2 Alignment with `BWA-MEM` (e.g. for VAR-Seq)

3.7.3 Alignment with `Rsubread` (e.g. for RNA-Seq)

3.7.4 Alignment with `gsnap` (e.g. for VAR-Seq and RNA-Seq)

3.11 DEG analysis with `edgeR`

3.12 DEG analysis with `DESeq2`