9. Practical considerations
The best way to configure and run gemBS depends on the computing
systems that are available. We will consider three situations
Single workstation
Computer cluster with a shared file system
Distributed system without a shared file system.
The different characteristics of these systems change the optimal
method to arrange the analyses. There are common consideration that apply to all three systems. The
individual computational units (the workstation, a cluster node or a
compute instance) must have sufficient memory and access to enough
disk space in order to perform the calculations. Memory requirements
for gemBS are quite high; this is a design choice as the speed
characteristics derive to a large extent from its large memory
footprint. To run gemBS comfortably on a human sized genome it is
recommended to have a minimum of 48Gb RAM and at least 0.5 - 1 Tb of disk
space available. These numbers could be reduced somewhat, but at the
risk of some analyses stopping due to lack of memory or disk space.
9.1. Running gemBS on a single workstation
This is the simplest case: in general the user can do the entire analysis with gemBS without having
to perform additional scripting or needing external workflow management tools. After the configuration step
has been completed, it should be sufficient to simply go through the following commands:
gemBS map
gemBS call
gemBS extract
gemBS report
or, more simply
The main decision for the user is the number of parallel jobs to run
at each stage (apart from the mapping stage). This will depend on the
amount of memory and computing cores available. As a quick rule of
thumb, allocating 2-3 cores per jobs for the calling and extraction
phases, and 6-8Gb RAM should be sufficient (although this will depend
on characteristics of the experiment such as the coverage. Note that
there is no point in running multiple jobs for the mapping process;
GEM3 can efficiently use all of the cores available, and running a
single process allows GEM3 to share the index across threads so
additional memory requirements per thread are minimized.
9.2. Running gemBS on a cluster with shared file system
When running gemBS on a cluster, the main difference from the previous
case is that it becomes useful to run multiple copies of gemBS for
each stage. For example, launching n copies of gemBS map
in
parallel can effectively make full use of the cluster resources. An
fully automated pipeline can easily be set up with some additional
scripting. If a workflow manager is available then a fully automated
pipeline can be set up that launches a certain number of gemBS
commands for each sample at each stage (mapping, calling, extraction
etc.), and handles the dependencies so that the pipeline will move to
the next stage as soon as all the processes for a sample have
finished. All the details of sharing the tasks between the separate
gemBS instances in this case are handled automatically by gemBS.
Alternatively, a more effcient solution albeit one that is more
complicated to set up is to tell the workflow manager the details of
all the required jobs and the dependencies between them and let it
handle the operation of the pipeline. In the case of the workflow
manager slurm this support is built into the system as described
earlier (dry-run, json and slurm options).
9.3. Running gemBS on a distributed system without a shared file system
For systems with no shared file system gemBS can not handle the
sharing of tasks between the multiple jobs, and the user must take
responsibility of this, normally by using a workflow management
system. To help with this organizational task, gemBS has the
dry-run and json <JSON file> options for the all
subcommands (dry-run, json and slurm options).
With such a system it is necessary to transfer the input and output
files to and from the filesystem accessible by the compute instances.
For this reason it will often be more efficient to perform blocks of
commands specific for a particular sample as one operation. For
example, rather than mapping each dataset individually, it can be more
efficient to map all datasets for a given sample in one go, allowing
gemBS to automatically merge the individual BAMS, and then transfer
back only the merged BAM (along with the QC JSON files, BAM index
etc.) Similarly for the calling, it will generally be more efficient
to perform all the calling for a sample as one job on the same compute
instance, rather than splitting the calling by contig across different
instances. In this way the user does not have to take care about the
individual contig pools, and only has to transfer the resulting merged BCF
file (along with the individual QC JSON files).
Note
It is important that the QC JSON files are stored from the mapping
and calling stage otherwise the QC reports can not be generated.