Overview

This guide is organized into sections with each section corresponding either to a stage of the pipeline or a concept relating to the pipeline. Each section begins with a quick summary of what you need to know to get started and then goes into more detail for people who would like to understand more how the pipeline works to adapt it for their particular systems.

The gemBS pipeline is designed to shield the user as much as possible from the underlying complexity of the analysis. If the pipeline is configures correctly then issuing a single command, gemBS run, is enough to execute all pipeline steps. If the pipeline is stopped in the middle, then in most cases re-typing gemBS run will complete all pending jobs. In the event of an application crash or abnormal termination, this will normally be detected by gemBS and the potentially incomplete output files will be removed allowing for a clean restart.

If, on the other hand, the user wants more control then it is possible to issue precise commands instructing gemBS to perform just one specific pipeline step and then stop. This allows gemBS to be integrated into a workflow manager without too much difficulty.

Control of pipeline steps

gemBS provides a lot of flexibility in running the different pipeline steps allowing some control of the underlying tools (i.e., the GEM3 mapper) as well as providing information on the computational resources to be given to each stage (i.e., number of cores or threads, memory usage etc.). There are two ways that this information can be provided by the user: (a) in a configuration file that is provided at the beginning of the pipeline run or (b) at the command line for each individual pipeline step. If the goal is to run the entire pipeline automatically then clearly all of the options should be specified in the configuration file, but it is possible to override these on the command line.

Computing systems

gemBS can be run on a single workstation or on a distributed cluster. On a single workstation, the command gemBS run will go through all of the pipeline steps in the correct order. If set up correctly with the computing requirements of each step, gemBS will allow steps to be performed in parallel so as to make optimum use of modern workstations with plenty of cores and memory.

For a cluster with a shared filesystem, multiple copies of gemBS can be run in parallel on the same project directory without them interfering with each other. For example, if there are 16 datasets to map and 8 compute nodes, an instance of gemBS can be started on each node and the work would be divided up between the nodes so that, all things being equal, each node would map 2 datasets.

This simple model of cluster computing could be extended to the entire pipeline by running multiple instances of gemBS run, but this is wasteful as it could result in nodes sleeping while waiting for a job to become available. Most clusters have workflow management systems (such as SLURM, SGE) that allow complex pipelines to be executed where each pipeline step has a set of computing requirements (cores, memory, execution time), and dependencies between jobs can be specified. gemBS can be used to generated a list of all pending jobs with their requirements (all specified in the configuration file) and the dependencies between them. This list can then be used to generate a submission script for a cluster workflow manager. For the SLURM manager, this support is (optionally) already built into gemBS, so to submit the pipeline on a cluster with SLURM can be done using gemBS --slurm run (See 1.2. Download and install gemBS to see how to turn on slurm support).