2. Pipeline Steps

2.1. Overview of analysis steps

The gemBS pipelines consists of 6 steps:

  • prepare: This is the step that requires input from the user. Two configuration files need to be prepared (explained in section 3. Pipeline Configuration) that describe the names and locations of the various input files (reference FASTA files, sequence FASTQ files), details about the samples, locations for the generated files (BAM/CRAM/BCF/BED etc.), parameters for the different stages, and information about the computational resources for each step.

  • index: In this stage the GEM3 index(es) required for the analysis are generated if required. The indexes can be used for multiple analyses as long as the same references are used.

  • map: The mapping stage is where the sequence files (normally FASTQ) are aligned to the reference to produce BAM or CRAM files. This is generally the most computationally intensive step, although it is also simple to parallelize and so can take advantage of multiple computer resources (i.e., a cluster). This stage also generates a JSON file with statistics about the mapping process.

  • call: The calling stage takes the BAM/CRAM files produced by mapping and generates BCF files with information about the methylation status and SNVs. This stage also generates a JSON file with statistics about the calling process.

  • extract: The extract stage takes the BCF files from the calling and generates summary Bed, BigBed and BigWig files with methylation and SNV information.

  • report: The report stage takes the JSON files with statistics generated by the mapping and calling stages and generates HTML and LaTeX (and optionally pdf) format quality control (QC) reports.

2.2. The gemBS command

The gemBS command is the starting point for all steps in the gemBS pipeline. The format of the gemBS command is as follows

USAGE:
  gemBS [FLAGS] [OPTIONS] <SUBCOMMAND>

FLAGS:
  -a, --all              Consider all tasks required for the requested command
  -d, --dry-run          Output pending commands without execution
  -h, --help             Prints help information
  -I, --ignore-status    Ignore status of tasks when compiling task list
  -i, --ignore-times     Ignore file modification times when evaluating the status of tasks
  -k, --keep-logs        Don't remove log files after successful completion of task
  -q, --quiet            Silence all output
  -S, --slurm            Submit commands to slurm for execution
  -V, --version          Prints version information

OPTIONS:
  -c, --config-file <CONFIG_FILE>     Location of gemBS config file
  -d, --dir <DIR>                      set working directory
  -r, --gembs-root <DIR>               set root of gemBS installation
  -j, --json <JSON_FILE>               Output JSON file with details of pending commands
  -l, --loglevel <LOGLEVEL>            Set log level [possible values: none, error, warn, info, debug, trace]
  -s, --slurm-script <SCRIPT_FILE>     Generate PERL script to submit commands to slurm for execution
  -t <GRANULARITY>                     Prepend log entries with a timestamp [possible values: none, sec, ms, us, ns]

SUBCOMMANDS:
  call       Methylation and SNP calling
  clear      Clear up incomplete files after aborted run
  extract    Produce methylation and SNP summary files
  help       Prints this message or the help of the given subcommand(s)
  index      Prepare genome indexes
  map        Read mapping
  prepare    Prepare gemBS analysis
  report     Generate QC report
  run        Run all pending pipeline commands

There are many options, but for normal operation it is sufficient to just type gemBS and the name of the relevant subcommand. For example, mapping can be done using the command:

gemBS map

More information on any subcommand can be obtained using the option –help after the subcommand i.e.,:

gemBS map --help

dry-run, json and slurm options

The options and flags to the gemBS command listed above are general and work with all subcommands. There are a set of commands that do not run any commands, but instead list what commands would be run if those options had not been used. For example, the dry-run flag returns a list of pending jobs for the particular command, so the command:

gemBS --dry-run map

would give a list of the pending mapping operations. Note that if this command gives no output this can either be because all mapping operations have been performed, the output BAMs have been produced and everything is up to date or because there are no mapping operations ready (for example, because the GEM3 index need to be built). If the all flag is also used then all pending operations prior to the required command are also printed. For example, the commmand:

gemBS --all --dry-run call

would list all index, mapping and calling operations that were pending, in the order that they should be run. Other modifiers are the ignore-times and ignore-status flags that are described in the next section.

The json option works in a similar way to dry-run but writes out more information on the pending commands (with a list of input and output files to each operation and as well as the prior jobs in the pipeline that the current job depends on. The job list is written to a file that is specified as the argument to the json option. This option is intended for developers who want to integrate gemBS with a pipeline or workflow manager.

Finally, the slurm flag (if slurm support has been enabled) and slurm-script options allow gemBS to use the Slurm workflow manager. the slurm flag causes gemBS to submit the pending jobs directly to slurm using the sbatch commmand. The parameters to sbatch are taken from the configuration file (for example, number of cores, memory, time limnits etc.). More slurm specific options such as the cluster to be used or qos settings can be sepcified using the sbatch specific environment variables (see https://slurm.schedmd.com/sbatch.html for details). The slurm-script option is used when for some reason gemBS can not or should not submit directly to slurm (for example, if gemBS is running in a container - 10. Containers). In this case gemBS creates a perl script that will perform the submission.

2.3. Inside Story

Configuration

When running the prepare subcommand, gemBS checks the configuration files, verifies that the input files to the pipeline are present, and writes out a binary version of the configuration (in MessagePack format) into the directory .gemBS/ in the current working directory. The configuration file is named gemBS.mp. When the any other command is run, gemBS will read the configuration file and make a list of all commands required to produce any missing outputs from the command.

Note

gemBS only reads the user supplied configuration files when the prepare subcommand is run. For all other gemBS commands the configuration is read from gemBS.mp. If it is wished to change the run parameters, for example, it will be necessary to re-run gemBS prepare. Normally this can be done without danger in the middle of an analysis, but large scale changes may not produce the expected results and it might be necessary in some cases to delete the output files and restart the pipeline from the beginning.

Command choice

By default gemBS will only consider commands where the required input files are already present. If there are either no commands that need running for a command (because the output files are already present) or there are no comamnds that are ready to run (because the input files are not yet present or are out of date), then the command will do nothing.

This behaviour can be changed, however, by using the --all option gemBS will consider all commands. For example, if the user issues the command gemBS call, gemBS will only consider running the caller for samples where the indexed BAM file is present in the filesyste,. However, if the --all options is used i.e., gemBS --all call then gemBS will try to run the caller for all samples, running mapping commands if required to generate the BAM files. In this way, ny issuing the command gemBS --all extract, gemBS will run all pending mapping, calling and extraction steps.

Note

The command gemBS --all extract will not generate the QC reports, and gemBS --all report will not run the extract sub command. To make things simpler, therefore, the whole pipeline can be run using the single command gemBS run, which will run all pending commands for all of the stages. For the run sub-command it is not necessary to use the --all flag as it is assumed that the user wishes to run all pending commands. Of course, if the run subcommand is used, there is no possiblity of over-riding the configuration parameters set when the prepare subcommand was run, so it is essential that these are correct for all gemBS stages.

Normally, gemBS uses the dates that files were created to decide if an output file is out of date and needs to be re-made. In the same way as for Unix Makefiles, an output file is considered out of date if its creation date is after the creation date of any of its input files. This behaviour can be changed using the --ignore-times option - if this option is used then a file will not be re-made as long as it is present on the filesystem.

A last gemBS option on the same topic is --ignore-status. If this optiont is set then gemBS will re-run all commands irrespective of whether the output files already exist on the filesystem. This can be used to completely re-run the pipeline from scratch (although it is generally simpler to delete the output file directories).

Scheduling jobs

The gemBS configuration file has multiple options that indicate the gemBS the computational requirements of the different stages in terms of memory requirements, CPU cores to use or numner of jobs to run simultaneously (see 3.2. Pipeline configuration file)

The tasks currently running are stored in a file in the configuration directory (normally .gemBS/ in the working directory) called gemBS_tasks.json. This is how gemBS coordinates tasks across multiple computer nodes - a gemBS process will not launch a task if the task is currently listed in this file.

Single computer node

As described in the previous section, gemBS creates a list of commands or tasks to run. These tasks are placed in a dependency tree that describes how the tasks rely on each other (for example if there are mapping and calling tasks present, the calling tasks will depend on the results of the mapping tasks. If multiple tasks are available to be run at the same time, the total memory and cores available on the machine are divided amongst the tasks. This is done in the following way:

Let max_cores be the number of available cores on the machine
let requested_cores and requested_mem be the cores and memory requested for this
task in the configuration file or on the command line

if requested_cores has been set, let n_cores = min(requested_cores, max_cores)
else if n_jobs has been set, let n_cores = max(max_cores / n_jobs, 1)
else if task is mapping or indexing, let n_cores = max_cores
else n_cores = 1

Let total_mem be the amount of memory on the machine
if requested_mem has been set, let mem = min(requested_mem, total_mem)
else if task is mapping or indexing, let mem = total_mem
else mem = 0

The requirements for the task to run are that n_cores cores and mem memory should be available. The total requirements of all jobs running on the machine are added up, and the task will be started if enough resources are available. Note that by default, mapping and indexing tasks will use the entire machine. This is because (a) these tasks require a lot of memory and (b) they are very parallel and so can take advantage of a very large number of cores. If memory requirements have not been set then memory will not be taken into account (by setting mem to 0), so it is possible that the computer will run out of memory. It is therefore advisable to set some sensible default value for memory requirements in the configuration file.

Multiple computer nodes

Multiple computer nodes can be run from the same working directory and the tasks will be shared between the different nodes. This is achieved by the file gemBS_tasks.json mentioned ealier. To prevent multiple nods modifying this file at the same time, gemBS uses a file based locking system (Posix locking can not be relied on as it is not always available on shared filesystems). When a gemBS wants to access the task file for reading or writing they first try and create a symbolic link called .gemBS_tasks.json#gemBS_lock. If creating the link fails then the link already exists (or the directory is can not be written to, but gemBS checks for this when starting up). In this case the process waits a while and then tries again. If the link creation succeeds then the task file and be accessed and/or modified, and then the link is deleted to allow another process to access the file. The same locking process is also used for the configuration file (in this case the lock link name is .gemBS.mp#gemBS_lock).

Note

The symbolic links used as locks have dummy targets that don’t point to a file (normally), but instead give the host name and pid of the gemBS process that is locking the file, and which can be viewed using ls -l. Normally these files should be automatically removed, but in the case of a crash or filesystem failure etc., it is possible that the files may be left. In this case they should be removed before re-running gemBS.

The task file is also normally removed after all gemBS processes running from the directory have completed. This is more frequent occurence than a lock file being left, although it should still be a rare occurence. The clear subcommand can be used to remove a left over task file. This will fail if the lock is present; in this case using the --force option will remove the lock and the task file. Using this command during normal gemBS operation will have undefined effects, and could result in multiple gemBS processes trying to create the same output files.