This guide is organized into sections with each section corresponding either to a stage of the pipeline or a concept relating to the pipeline. Each section begins with a quick summary of what you need to know to get started and then goes into more detail for people who would like to understand more how the pipeline works to adapt it for their particular systems.
The gemBS pipeline is designed to shield the user as much as possible
from the underlying complexity of the analysis. If the pipeline is
configures correctly then issuing a single command,
gemBS run, is
enough to execute all pipeline steps. If the pipeline is stopped in
the middle, then in most cases re-typing
gemBS run will complete all
pending jobs. In the event of an application crash or abnormal
termination, this will normally be detected by gemBS and the
potentially incomplete output files will be removed allowing for a
If, on the other hand, the user wants more control then it is possible to issue precise commands instructing gemBS to perform just one specific pipeline step and then stop. This allows gemBS to be integrated into a workflow manager without too much difficulty.
Control of pipeline steps¶
gemBS provides a lot of flexibility in running the different pipeline steps allowing some control of the underlying tools (i.e., the GEM3 mapper) as well as providing information on the computational resources to be given to each stage (i.e., number of cores or threads, memory usage etc.). There are two ways that this information can be provided by the user: (a) in a configuration file that is provided at the beginning of the pipeline run or (b) at the command line for each individual pipeline step. If the goal is to run the entire pipeline automatically then clearly all of the options should be specified in the configuration file, but it is possible to override these on the command line.
gemBS can be run on a single workstation or on a distributed cluster.
On a single workstation, the command
gemBS run will go through all
of the pipeline steps in the correct order. If set up correctly with
the computing requirements of each step, gemBS will allow steps to be
performed in parallel so as to make optimum use of modern workstations
with plenty of cores and memory.
For a cluster with a shared filesystem, multiple copies of gemBS can be run in parallel on the same project directory without them interfering with each other. For example, if there are 16 datasets to map and 8 compute nodes, an instance of gemBS can be started on each node and the work would be divided up between the nodes so that, all things being equal, each node would map 2 datasets.
This simple model of cluster computing could be extended to the entire
pipeline by running multiple instances of
gemBS run, but this is
wasteful as it could result in nodes sleeping while waiting for a job
to become available. Most clusters have workflow management systems
(such as SLURM, SGE) that allow complex pipelines to be executed where
each pipeline step has a set of computing requirements (cores, memory,
execution time), and dependencies between jobs can be specified.
gemBS can be used to generated a list of all pending jobs with their
requirements (all specified in the configuration file) and the
dependencies between them. This list can then be used to generate a
submission script for a cluster workflow manager. For the SLURM
manager, this support is (optionally) already built into gemBS, so to
submit the pipeline on a cluster with SLURM can be done using
gemBS --slurm run (See 1.2. Download and install gemBS to see how to turn on