2. Pipeline Steps¶
2.1. Overview of analysis steps¶
The gemBS pipelines consists of 6 steps:
prepare: This is the step that requires input from the user. Two configuration files need to be prepared (explained in section 3. Pipeline Configuration) that describe the names and locations of the various input files (reference FASTA files, sequence FASTQ files), details about the samples, locations for the generated files (BAM/CRAM/BCF/BED etc.), parameters for the different stages, and information about the computational resources for each step.
index: In this stage the GEM3 index(es) required for the analysis are generated if required. The indexes can be used for multiple analyses as long as the same references are used.
map: The mapping stage is where the sequence files (normally FASTQ) are aligned to the reference to produce BAM or CRAM files. This is generally the most computationally intensive step, although it is also simple to parallelize and so can take advantage of multiple computer resources (i.e., a cluster). This stage also generates a JSON file with statistics about the mapping process.
call: The calling stage takes the BAM/CRAM files produced by mapping and generates BCF files with information about the methylation status and SNVs. This stage also generates a JSON file with statistics about the calling process.
extract: The extract stage takes the BCF files from the calling and generates summary Bed, BigBed and BigWig files with methylation and SNV information.
report: The report stage takes the JSON files with statistics generated by the mapping and calling stages and generates HTML and LaTeX (and optionally pdf) format quality control (QC) reports.
2.2. The gemBS command¶
The gemBS command is the starting point for all steps in the gemBS pipeline. The format of the gemBS command is as follows
USAGE: gemBS [FLAGS] [OPTIONS] <SUBCOMMAND> FLAGS: -a, --all Consider all tasks required for the requested command -d, --dry-run Output pending commands without execution -h, --help Prints help information -I, --ignore-status Ignore status of tasks when compiling task list -i, --ignore-times Ignore file modification times when evaluating the status of tasks -k, --keep-logs Don't remove log files after successful completion of task -q, --quiet Silence all output -S, --slurm Submit commands to slurm for execution -V, --version Prints version information OPTIONS: -c, --config-file <CONFIG_FILE> Location of gemBS config file -d, --dir <DIR> set working directory -r, --gembs-root <DIR> set root of gemBS installation -j, --json <JSON_FILE> Output JSON file with details of pending commands -l, --loglevel <LOGLEVEL> Set log level [possible values: none, error, warn, info, debug, trace] -s, --slurm-script <SCRIPT_FILE> Generate PERL script to submit commands to slurm for execution -t <GRANULARITY> Prepend log entries with a timestamp [possible values: none, sec, ms, us, ns] SUBCOMMANDS: call Methylation and SNP calling clear Clear up incomplete files after aborted run extract Produce methylation and SNP summary files help Prints this message or the help of the given subcommand(s) index Prepare genome indexes map Read mapping prepare Prepare gemBS analysis report Generate QC report run Run all pending pipeline commands
There are many options, but for normal operation it is sufficient to just type gemBS and the name of the relevant subcommand. For example, mapping can be done using the command:
More information on any subcommand can be obtained using the option –help after the subcommand i.e.,:
gemBS map --help
dry-run, json and slurm options¶
The options and flags to the gemBS command listed above are general and work with all subcommands. There are a set of commands that do not run any commands, but instead list what commands would be run if those options had not been used. For example, the dry-run flag returns a list of pending jobs for the particular command, so the command:
gemBS --dry-run map
would give a list of the pending mapping operations. Note that if this command gives no output this can either be because all mapping operations have been performed, the output BAMs have been produced and everything is up to date or because there are no mapping operations ready (for example, because the GEM3 index need to be built). If the all flag is also used then all pending operations prior to the required command are also printed. For example, the commmand:
gemBS --all --dry-run call
would list all index, mapping and calling operations that were pending, in the order that they should be run. Other modifiers are the ignore-times and ignore-status flags that are described in the next section.
The json option works in a similar way to dry-run but writes out more information on the pending commands (with a list of input and output files to each operation and as well as the prior jobs in the pipeline that the current job depends on. The job list is written to a file that is specified as the argument to the json option. This option is intended for developers who want to integrate gemBS with a pipeline or workflow manager.
Finally, the slurm flag (if slurm support has been enabled) and
slurm-script options allow gemBS to use the Slurm workflow
manager. the slurm flag causes gemBS to submit the pending jobs
directly to slurm using the
sbatch commmand. The parameters to
sbatch are taken from the configuration file (for example, number
of cores, memory, time limnits etc.). More slurm specific options
such as the cluster to be used or qos settings can be sepcified using
the sbatch specific environment variables (see
https://slurm.schedmd.com/sbatch.html for details). The
slurm-script option is used when for some reason gemBS can not or
should not submit directly to slurm (for example, if gemBS is running
in a container - 10. Containers). In this case gemBS creates a perl script that will perform the
2.3. Inside Story¶
When running the prepare subcommand, gemBS checks the
configuration files, verifies that the input files to the pipeline are
present, and writes out a binary version of the configuration (in MessagePack format) into the
.gemBS/ in the current working directory. The
configuration file is named
gemBS.mp. When the any other command
is run, gemBS will read the configuration file and make a list of all
commands required to produce any missing outputs from the command.
gemBS only reads the user supplied configuration files when the prepare subcommand is run. For all other gemBS commands
the configuration is read from
gemBS.mp. If it is wished to change the run parameters, for example, it will be
necessary to re-run
gemBS prepare. Normally this can be done without danger in the middle of an analysis, but
large scale changes may not produce the expected results and it might be necessary in some cases to delete the output files
and restart the pipeline from the beginning.
By default gemBS will only consider commands where the required input files are already present. If there are either no commands that need running for a command (because the output files are already present) or there are no comamnds that are ready to run (because the input files are not yet present or are out of date), then the command will do nothing.
This behaviour can be changed, however, by using the
--all option gemBS will
consider all commands. For example, if the user issues the command
gemBS call, gemBS will only consider running the caller for
samples where the indexed BAM file is present in the filesyste,.
However, if the
--all options is used i.e.,
gemBS --all call
then gemBS will try to run the caller for all samples, running mapping
commands if required to generate the BAM files. In this way, ny
issuing the command
gemBS --all extract, gemBS will run all
pending mapping, calling and extraction steps.
gemBS --all extract will not
generate the QC reports, and
gemBS --all report will not run
the extract sub command. To make things simpler, therefore, the
whole pipeline can be run using the single command
which will run all pending commands for all of the stages. For the
run sub-command it is not necessary to use the
as it is assumed that the user wishes to run all pending commands.
Of course, if the run subcommand is used, there is no possiblity of over-riding
the configuration parameters set when the prepare subcommand was run, so it is
essential that these are correct for all gemBS stages.
Normally, gemBS uses the dates that files were created to decide if an
output file is out of date and needs to be re-made. In the same way
as for Unix Makefiles, an output file is considered out of date if its
creation date is after the creation date of any of its input files. This behaviour can be changed using the
--ignore-times option - if this option is used then a file will not be re-made as long as it is present on the filesystem.
A last gemBS option on the same topic is
--ignore-status. If this
optiont is set then gemBS will re-run all commands irrespective of
whether the output files already exist on the filesystem. This can be
used to completely re-run the pipeline from scratch (although it is
generally simpler to delete the output file directories).
The gemBS configuration file has multiple options that indicate the gemBS the computational requirements of the different stages in terms of memory requirements, CPU cores to use or numner of jobs to run simultaneously (see 3.2. Pipeline configuration file)
The tasks currently running are stored in a file in the configuration
.gemBS/ in the working directory) called
gemBS_tasks.json. This is how gemBS coordinates tasks across
multiple computer nodes - a gemBS process will not launch a task if
the task is currently listed in this file.
Single computer node¶
As described in the previous section, gemBS creates a list of commands or tasks to run. These tasks are placed in a dependency tree that describes how the tasks rely on each other (for example if there are mapping and calling tasks present, the calling tasks will depend on the results of the mapping tasks. If multiple tasks are available to be run at the same time, the total memory and cores available on the machine are divided amongst the tasks. This is done in the following way:
Let max_cores be the number of available cores on the machine let requested_cores and requested_mem be the cores and memory requested for this task in the configuration file or on the command line if requested_cores has been set, let n_cores = min(requested_cores, max_cores) else if n_jobs has been set, let n_cores = max(max_cores / n_jobs, 1) else if task is mapping or indexing, let n_cores = max_cores else n_cores = 1 Let total_mem be the amount of memory on the machine if requested_mem has been set, let mem = min(requested_mem, total_mem) else if task is mapping or indexing, let mem = total_mem else mem = 0
The requirements for the task to run are that n_cores cores and mem memory should be available. The total requirements of all jobs running on the machine are added up, and the task will be started if enough resources are available. Note that by default, mapping and indexing tasks will use the entire machine. This is because (a) these tasks require a lot of memory and (b) they are very parallel and so can take advantage of a very large number of cores. If memory requirements have not been set then memory will not be taken into account (by setting mem to 0), so it is possible that the computer will run out of memory. It is therefore advisable to set some sensible default value for memory requirements in the configuration file.
Multiple computer nodes¶
Multiple computer nodes can be run from the same working directory and
the tasks will be shared between the different nodes. This is
achieved by the file
gemBS_tasks.json mentioned ealier. To
prevent multiple nods modifying this file at the same time, gemBS uses
a file based locking system (Posix locking can not be relied on as it
is not always available on shared filesystems). When a gemBS wants to
access the task file for reading or writing they first try and create
a symbolic link called
.gemBS_tasks.json#gemBS_lock. If creating
the link fails then the link already exists (or the directory is can not be written to,
but gemBS checks for this when starting up). In this case the process
waits a while and then tries again. If the link creation succeeds
then the task file and be accessed and/or modified, and then the link
is deleted to allow another process to access the file. The same locking process is also used
for the configuration file (in this case the lock link name is
The symbolic links used as locks have dummy targets that don’t
point to a file (normally), but instead give the host name and pid
of the gemBS process that is locking the file, and which can be
ls -l. Normally these files should be
automatically removed, but in the case of a crash or filesystem
failure etc., it is possible that the files may be left. In this
case they should be removed before re-running gemBS.
The task file is also normally removed after all gemBS processes
running from the directory have completed. This is more frequent
occurence than a lock file being left, although it should still be
a rare occurence. The clear subcommand can be used to remove a
left over task file. This will fail if the lock is present; in
this case using the
--force option will remove the lock and the
task file. Using this command during normal gemBS operation will
have undefined effects, and could result in multiple gemBS
processes trying to create the same output files.