9. Practical considerations

The best way to configure and run gemBS depends on the computing systems that are available. We will consider three situations

  1. Single workstation

  2. Computer cluster with a shared file system

  3. Distributed system without a shared file system.

The different characteristics of these systems change the optimal method to arrange the analyses. There are common consideration that apply to all three systems. The individual computational units (the workstation, a cluster node or a compute instance) must have sufficient memory and access to enough disk space in order to perform the calculations. Memory requirements for gemBS are quite high; this is a design choice as the speed characteristics derive to a large extent from its large memory footprint. To run gemBS comfortably on a human sized genome it is recommended to have a minimum of 48Gb RAM and at least 0.5 - 1 Tb of disk space available. These numbers could be reduced somewhat, but at the risk of some analyses stopping due to lack of memory or disk space.

9.1. Running gemBS on a single workstation

This is the simplest case: in general the user can do the entire analysis with gemBS without having to perform additional scripting or needing external workflow management tools. After the configuration step has been completed, it should be sufficient to simply go through the following commands:

gemBS map
gemBS call
gemBS extract
gemBS report

or, more simply

gemBS run

The main decision for the user is the number of parallel jobs to run at each stage (apart from the mapping stage). This will depend on the amount of memory and computing cores available. As a quick rule of thumb, allocating 2-3 cores per jobs for the calling and extraction phases, and 6-8Gb RAM should be sufficient (although this will depend on characteristics of the experiment such as the coverage. Note that there is no point in running multiple jobs for the mapping process; GEM3 can efficiently use all of the cores available, and running a single process allows GEM3 to share the index across threads so additional memory requirements per thread are minimized.

9.2. Running gemBS on a cluster with shared file system

When running gemBS on a cluster, the main difference from the previous case is that it becomes useful to run multiple copies of gemBS for each stage. For example, launching n copies of gemBS map in parallel can effectively make full use of the cluster resources. An fully automated pipeline can easily be set up with some additional scripting. If a workflow manager is available then a fully automated pipeline can be set up that launches a certain number of gemBS commands for each sample at each stage (mapping, calling, extraction etc.), and handles the dependencies so that the pipeline will move to the next stage as soon as all the processes for a sample have finished. All the details of sharing the tasks between the separate gemBS instances in this case are handled automatically by gemBS. Alternatively, a more effcient solution albeit one that is more complicated to set up is to tell the workflow manager the details of all the required jobs and the dependencies between them and let it handle the operation of the pipeline. In the case of the workflow manager slurm this support is built into the system as described earlier (dry-run, json and slurm options).

9.3. Running gemBS on a distributed system without a shared file system

For systems with no shared file system gemBS can not handle the sharing of tasks between the multiple jobs, and the user must take responsibility of this, normally by using a workflow management system. To help with this organizational task, gemBS has the dry-run and json <JSON file> options for the all subcommands (dry-run, json and slurm options).

With such a system it is necessary to transfer the input and output files to and from the filesystem accessible by the compute instances. For this reason it will often be more efficient to perform blocks of commands specific for a particular sample as one operation. For example, rather than mapping each dataset individually, it can be more efficient to map all datasets for a given sample in one go, allowing gemBS to automatically merge the individual BAMS, and then transfer back only the merged BAM (along with the QC JSON files, BAM index etc.) Similarly for the calling, it will generally be more efficient to perform all the calling for a sample as one job on the same compute instance, rather than splitting the calling by contig across different instances. In this way the user does not have to take care about the individual contig pools, and only has to transfer the resulting merged BCF file (along with the individual QC JSON files).

Note

It is important that the QC JSON files are stored from the mapping and calling stage otherwise the QC reports can not be generated.