3. Pipeline Configuration

In order to run gemBS efficiently and in line with your experimental design, it is important to configure the pipeline correctly. With some care, the same pipeline can be used for subsequent analyses with little modification making large scale analyses simple and standardization of results possible.

There are two aspects of pipeline configuration:

  • Sample metadata (in the metadata CSV file)

  • Pipeline parameters (in the pipeline configuration file)

The first step in using gemBS for an analysis is to set up the configuration files and then run the gemBS prepare command. This will take the information from the sample metadata and configuration files and save it as a MessagePack file. For all subsequent commands, gemBS will take the parameters from the MessagePack file as defaults for the pipeline (although these can be over-ridden for each command). Setting up the configuration files correctly means that the subsequent pipeline steps can be performed supplying little or no additional information.

3.1. Sample metadata file

Introduction

The sample metadata provides the connection between samples and datafiles, so that gemBS knows which datafiles belong to a given sample, which datafiles are in pairs etc., and gives additional information about the sample or the experiment such as sequencing library or sequencing platform, that can be added to the outputs as annotations and help to keep track of samples.

The basic format of the sample metadata file is a CSV (comma separated) text file with a header file describing the data columns. There is a great deal of flexibility in how the file is set up, with only two columns being required, the sample barcode and the dataset id. The sample barcode is a unique identifier for a sample. As it is used as part of the name of the results files, it is advised to avoid punctuation, spaces, accents etc. in the barcodes. The dataset id refers to a FASTQ file, a pair of FASTQ files or a BAM file containing the sequence data for the mapping stage. As for the sample barcode, the dataset id should be a unique identifier and it is advisable to avoid punctuation etc.

A name, and a description column can optionally be included that can provide a longer name and description for the sample. These fields will be used in the BAM, BCF and BED output files to identify the samples.

The file path to the data files can also be included in the metadata file. These paths can be absolute or can be relative to the working directory or to a directory specified in the parameter configuration file. If the file paths are not given then gemBS will look for files matching patterns based on the dataset id (i.e, if the dataset id is AA0C2BZ and the files AA0C2BZ_1.fastq.gz and AA0C2BZ_2.fastq.gz exist in the sequencing data directories (defined in the parameter configuration file) then gemBS will assume that this is a pair of FASTQ files from a paired end sequencing experiment for that dataset). In general it is simpler and safer to provide the paths in the metadata file. Note that gemBS will transparently handle files compressed by gzip, bgzip and bzip2, so there is no need to uncompress files before analysis.

The type of the dataset can also be specified: whether the dataset is single or paired end or an interleaved FASTQ file or a BAM file etc. This can often be inferred from the data, but in some occasions it is useful to specify it.

Paired end data can either be presented as one line in the metadata file (with both reads being described if the paths are given) or as two separate lines; in the latter case it is necessary to include a column with the read end information, to say which file is read 1 and which is read 2.

Finally, information on the sequencing library identifier, sequencing platform and sequencing centre can also be present in the metadata file; this information is added to the read group (RG) header in the BAM file.

Column descriptions

To make setting up the metadata file a little simpler (and hopefully easier to remember) the case of the headers and any underscores within the headers are ignored, so you are free to use samplebarcode, sampleBarcode, or Sample_barcode as you prefer. There are also multiple versions accepted for most of the headings. The table below gives the possible column headers and the different accepted versions. Note that only the first two headers are required. Unrecognized headers are ignored.

Name

Required ?

Alternates

Comments

barcode

yes

sampleid, samplebarcode

dataset

yes

fileid, fli

samplename

no

name, sample

library

no

lib, libbarcode, librarybarcode

Library identifier

file

no

location

Single file

file1

no

location1, end1

Paired files (read 1)

file2

no

location2, end2

Paired files (read 2)

readend

no

end

Read end (Paired on two lines)

type

no

filetype

Paired, Interleaved, Single, BAM

bisulfite

no

bis, bisulphite

Dataset is bisulfite converted (Yes or No)

description

no

desc

centre

no

center

Sequencer centre

platform

no

Sequencer platform

Example metadata file

Barcode,Library,file_id,end_1,end_2,name
A001XS1,419H_BS,C2TG4ACXX_7_3,C2TG4ACXX_7_3_1.fastq.gz,C2TG4ACXX_7_3_2.fastq.gz,S1
A001XS1,420H_BS,C2TG4ACXX_7_4,C2TG4ACXX_7_4_1.fastq.gz,C2TG4ACXX_7_4_2.fastq.gz,S1
A001XS2,421H_BS,C2TG4ACXX_7_5,C2TG4ACXX_7_5_1.fastq.gz,C2TG4ACXX_7_5_2.fastq.gz,S2
A001XS2,422H_BS,C2TG4ACXX_7_6,C2TG4ACXX_7_6_1.fastq.gz,C2TG4ACXX_7_6_2.fastq.gz,S2
A001XS3,423H_BS,C2TG4ACXX_7_7,C2TG4ACXX_7_7_1.fastq.gz,C2TG4ACXX_7_7_2.fastq.gz,S3
A001XS3,424H_BS,C2TG4ACXX_7_8,C2TG4ACXX_7_8_1.fastq.gz,C2TG4ACXX_7_8_2.fastq.gz,S3
A001XS4,425H_BS,C2TG4ACXX_7_9,C2TG4ACXX_7_9_1.fastq.gz,C2TG4ACXX_7_9_2.fastq.gz,S4
A001XS4,426H_BS,C2TG4ACXX_7_10,C2TG4ACXX_7_10_1.fastq.gz,C2TG4ACXX_7_10_2.fastq.gz,S4
A001XS5,427H_BS,C2TG4ACXX_7_11,C2TG4ACXX_7_11_1.fastq.gz,C2TG4ACXX_7_11_2.fastq.gz,S5
A001XS6,556H_BS,C355EACXX_6_1,C355EACXX_6_1_1.fastq.gz,C355EACXX_6_1_2.fastq.gz,S6

The example file above describes 6 samples with barcode A001XS1, …, A001XS6 which have paired end sequence data available. The first four samples have two sets of FASTQ pairs per sample (so two datasets), whereas the last two have only one dataset each.

While this layout allows most workflows to be accommodated, it can be useful in some occasions to have a more flexible system. Taking advantage of the ability of the GEM3 mapper to read from streams, it is possible to specify shell commands rather than filenames, or that gemBS should get its input from stdin rather than from a file. This is done by providing a pipe character (‘|’) as the last non-space character of a filename. This will work for both single and paired input files. This approach could be used to trim reads before mapping or to enable reading of files compressed with a compressor that gemBS does not handle natively. The example below shows how gemBS could read from xz compressed FASTQ files (assuming that the xz package is installed and available). Note that you will have to provide the full path for the files as gemBS will not modify the filenames in this mode.

Barcode,Library,file_id,end_1,end_2,name
A001XS1,419H_BS,C2TG4ACXX_7_3,"xzcat fastq/C2TG4ACXX_7_3_1.fastq.xz |","xzcat fastq/C2TG4ACXX_7_3_2.fastq.xz |",S1
A001XS1,420H_BS,C2TG4ACXX_7_4,"xzcat fastq/C2TG4ACXX_7_4_1.fastq.xz |","xzcat fastq/C2TG4ACXX_7_4_2.fastq.xz |",S1
A001XS2,421H_BS,C2TG4ACXX_7_5,"xzcat fastq/C2TG4ACXX_7_5_1.fastq.gz |","xzcat fastq/C2TG4ACXX_7_5_2.fastq.gz |",S2
A001XS2,422H_BS,C2TG4ACXX_7_6,"xzcat fastq/C2TG4ACXX_7_6_1.fastq.gz |","xzcat fastq/C2TG4ACXX_7_6_2.fastq.gz |",S2

Non-bisulfite processing

While gemBS is intended for the analysis of bisulfite sequencing, it can be useful to include non-bisulfite (i.e., normal WGS) sequence data in the analyses. While the non-bisulfite data provides no information on methylation status, it does provide information on the genome sequence, and so can be used to improve the genotype calling. To analyze non-bisulfite data it is necessary to define an additional column bisulfite in the metadata file. If a dataset has a 1, True, Yes or a missing value in this column then the dataset is assumed to be bisulfite converted, while if the column value is 0, False or No then the dataset is assumed to be non-bisulfite converted. The calling process will automatically take account of mixtures of bisulfite and non-bisulfite converted data.

3.2. Pipeline configuration file

Introduction

The parameter configuration file is a loosely structured text file with each line either being a section header (name of a section enclosed in square brackets like: [mapping], a key - value pair or a comment (everything on a line after a # character is treated as a comment). The majority of the keys correspond to parameters in gemBS. Any key specified in the configuration file becomes a default for the analysis, which can be over-ridden by supplying parameters to the gemBS commands. All keys (and section names) are case insensitive. Configuration keys can also represent directory paths, and are used to define the locations for the input data for the pipeline and for where the analysis files are generated.

Note

The order of statements in the configuration files is important. Variables should be defined before they are referenced, and new variable definitions will overwrite existing ones.

Configuration sections

The sections correspond to different parts of the pipeline (available sections are mapping, calling, extract, report, index, dbsnp and DEFAULT - everything not in one of the other sections is implicitly in DEFAULT). Sections with names not in the list will be ignored by gemBS. The utility of the sections is to allow different parameter settings to be used for different parts of the pipeline. Any key set in the DEFAULT section applies to all sections (unless over-ridden).

Example configuration file

The example below is a section of a configuration file that illustrates many of the features:

# Directory definitions
#
# Note that we can use environment variables ($HOME in this case)
# and previously defined variables can be used in subsequent definitions

base = ${HOME}/data

ref_dir = ${base}/reference
reference = ${ref_dir}/sacCer3.fa.gz
extra_references = ${ref_dir}/conversion_control.fa.gz

index_dir = ${base}/index
sequence_dir = ${base}/fastq/@SAMPLE    # @SAMPLE and @BARCODE are special
bam_dir = ${base}/mapping/@BARCODE      # variables that are replaced with
bcf_dir = ${base}/calls/@BARCODE        # the sample name or barcode being
extract_dir = ${base}/extract/@BARCODE  # worked on during gemBS operation
report_dir = ${base}/report

# For using CRAM
populate_cache = true
make_cram = true

# General project info
project = yeast_test
species = Yeast

# Default parameters
threads = 8
cores = 4
jobs = 4

[mapping] # Begin mapping specific section

memory = 16G
cores = 16
merge_cores = 8
merge_memory = 8G

# Set names of spiked in conversion controls
underconversion_sequence = NC_001416.1
overconversion_sequence = NC_001604.1

# Include a standard configuration file with parameters
# defined for the IHEC WGBS pipeline
include IHEC_standard.conf

[calling] # Begin calling specific section

cores = 4
threads = 4
memory = 6G
left_trim = 5
right_trim = 0

[extract] # extract specific section

cores = 4
threads =8
memory = 6G

make_cpg = True
make_non_cpg = True
make_bedmethyl = True

Advanced usage

Variable interpolation

Note that values can include previously defined variables using the form $(variable). This can simplify the file by, as in this example, using the variable base to store the base directory for the analysis. The names of such ‘intermediate variables’ are not important; gemBS only takes account of the keys that it know about (such as the directory keys discussed below) and the others are ignored. Values can also include environment variables (unless they clash with a previously defined key in the configuration file). For environment variables the case is respected; this is the only exception to the rule that keys are case insensitive.

Including files

One important feature displayed in the above file is that a configuration file can include another. gemBS will look for the included file relative to the current directory. If it is not found there, it will look in the directory gemBS/etc/gemBS_configs/ relative to the installation directory for gemBS. In this directory are located standard configurations for particular types of experiments (WGBS, RRBS, PBAT etc.) or for standardized analysis pipelines (IHEC, ENCODE) that make it simpler to use the correct parameter settings (see 11. IHEC standard analyses).

Analysis directories

The other important feature displayed is the definition of the analysis directories. The six directories listed above control where gemBS looks for input data and where it writes its output files. If these are not set then output files will be generated in the current directory, which is not always convenient! Note that several of the directories have the special variables @SAMPLE or @BARCODE in their names. This (optional) feature allows sample specific directories, which can be useful if a project contains many samples. The @SAMPLE and @BARCODE variables are resolved while gemBS is running into the sample name or barcode that gemBS is currently processing.

The roles of the five directories are given in the table below.

Directory

Role

index_dir

Location where the index files (and other associated files) are generated

sequence_dir

Location for the sequence data files (FASTQ, BAM)

bam_dir

Location where the BAM files (and other associated files) are generated

bcf_dir

Location where the BCF files (and other associated files) are generated

extract_dir

Location where the bed, bigBed and bigWig files generated

report_dir

Location where the report files are generated

Note that if absolute paths for the sequence data files are given in the metadata file then the sequence_dir variable is ignored. If the input files are not all in the same place but scattered over the file system then this allows gemBS to locate them correctly (although it is often easier to simply create a directory with symbolic links to all the datafiles so that they can all be found in the same place).

FASTA reference description

The last section of the configuration file that must be explained is the definition of the reference, index and contig_sizes files, as this is critical and gemBS will not run until these have been specified. gemBS can create the index and contig_sizes file from the reference, but it must have enough information to know where you want them to be stored and how you want them to be named in this case. The short section of configuration file below illustrates the options for defining the reference and index locations:

# Required section
#
# Note that the index and contig_sizes files are generated from the
# reference file if they do not already exist
#
reference = references/sacCer3_control.fa.gz

# The contigs in the extra references file(s) are index for mapping
# but are not passed to the later stages of the pipeline (calling,
# extraction etc.). This is where control sequences should be placed.

extra_references = references/conversion_controls.fa.gz

index = $(HOME)/example/references/sacCer3.BS.gem
reference_basename = sacCer3

index_dir = indexes

#dbSNP_index = $(HOME)/dbSNP/dbSNP_gemBS.idx
#dbSNP_files = $(HOME)/dbSNP/bed_chr*.bed.gz

The path to the multi-fasta reference file must be provided explicitly and the file must exist (it can be compressed with gzip, bgzip or bzip2). The index and other associated files will be created (unless they already exist) in index_dir if this has been set. If not, then index_dir will be set to the base directory if the index file if that has been set, otherwise it will be sent to the working directory. THe names of the index files, if not specified in the configuration file, are derived from the reference file name by, for example, removing the extensions and adding _BS.gem for the index. If the variable reference_basename exists then the name for the index files is formed from this instead.

Note the extra_references variable. The file(s) listed here should be additional FASTA sequences that will be appended to the main reference file. Any contigs in these additional reference files will only be used for mapping - they will not be analyzed in the methylation analyses and the remainder of the downstream analysis. The use of the extra_references variable is to list the sequences of spiked in control sequences (for example under or over-conversion controls) that should be processed by the mapper to obtain the QC reports, but should not form part of the methylation analysis.

Whatever rule is followed to generate the file names and locations, gemBS will check for the existence of the files. If they are not present then the gemBS index command will need to be run to generate the files.

The example also shows the specification (although commented out in this case) of the dbSNP index location and input files. This will be more fully explained in the index section, but if the dbSNP index key is defined in the configuration tool then it will be treated like the GEM3 index in that its existence will be checked for and if it does not exist it will need to be generated (using the gemBS index command) before methylation calling can be performed.

Note

If non-bisulfite converted data is to be processed, then the variable nonbs_index can be used to set the name of the non-bisulfite index. If this variable is not set (and the metadata file indicates that at least one dataset is not bisulfite converted) then the name for the non-bisulfite index will be generated using the same rules as for the regular bisulfite index without the .BS suffix.

CRAM support

The use of either BAM or CRAM files for storing the aligned reads is supported by gemBS. Both BAM and CRAM are compressed files, and tools such as samtools can convert between the two formats. CRAM is a more recent format and achieves better compression (to a large part through the use of reference based compression), and is to be preferred if possible. Instructing gemBS to use CRAM mapping files required simply setting the key make_cram to true in either the DEFAULT or index section of the configuration file.

As CRAM makes use of reference based compression, the sequence data is stored as the differences between the sequence reads and the reference. It follows that to read a CRAM file it is necessary to have the original reference used in the generation of the file. While using gemBS there is no need to do anything special apart from setting the make_cram key. However, to read the generated CRAM files by external utilities, the reference needs to be supplied. For samtools there are multiple ways that this can be done.

  1. The original reference file can be supplied on the command line (i.e., samtools view -T my_reference.gemBS.ref). Note that in this case the reference file should be the one generated by gemBS (and which should have the suffix gemBS.ref).

  2. The cache directory for htslib (which by default is $HOME/.cache/hts_ref) is searched to find the sequence.

  3. An attempt to download the sequence from the European Nuclotide Archive will be made. After successful download the reference will be stored in the cache.

If option (3) is taken then you will see a delay the first time an attempt is made to view a contig from a CRAM file as the reference is fetched and stored. Of course, this will only work for an ‘official’ reference sequence. For working with, for example, a draft sequence from a de novo assembly project, it will be necessary to have a local copy of the reference.

gemBS provides a mechanism to make this simpler. If the configuration key populate_cache is set to true in the DEFAULT or index section then the sequences used in the CRAM file will be stored in the local cache if not already present. In this way, samtools view can be used to read the generated CRAM files without the user having to change anything.

Job scheduling

(Almost) every job run by gemBS can be affected by a set of configuration keys or command line options that affect how the job runs and how it is scheduled. In terms of configuration keys, these should be in the appropriate section (i.e, to affect the index subcommand the keys should be in the index section on the configuration file). Some of the options afffect how the job runs on multi-core computers and are passed on to the command itself, the most common being threads which indicates to the job how many threads to use. THe other options determine how many jobs can be run in parallel (jobs, cores, memory) - for details of how these interact see Scheduling jobs. Finally the keys cores, memory and time are passed on to the pipeline manager (i.e., Slurm) if this is being used.

3.3. Configuring gemBS: the gemBS prepare command

After the configuration files have all been made, the prepare command should be used to check the configurations and generate the JSON configuration file.

USAGE:
  gemBS prepare [FLAGS] [OPTIONS] --config <CONFIG> <--text-metadata <TEXT_METADATA>|--lims-cnag-json <JSON_METADATA>>

FLAGS:
  -p, --populate-cache    Populate reference cache if required (for CRAM)
  -h, --help              Prints help information

OPTIONS:
  -c, --config <CONFIG>                   Text config file with gemBS parameters
  -t, --text-metadata <TEXT_METADATA>     Sample data in csv format
  -l, --lims-cnag-json <JSON_METADATA>    Use JSON metadata file from CNAG lims

By default this will create the MessagePack file in a the hidden directory .gemBS in the working directory as .gemBS/gemBS.mpn.

In addition, the prepare command creates a bgzipped reference file by concatenating the different reference files together (i.e., the reference file and the extra_references file). Indexes to this file are also created using the samtools faidx command. These are not the indexes for the GEM3 mapper, but allow quick access to individual contig sequences and are used by bs_call. In addition, a contig_sizes file is generated that gives the size of each contig sequence in the reference file (not in the contigs in the extra_references files), and a contig_md5 file that has the md5 signatures of all contigs in the combined reference sequence file.

Note

There is nothing to stop the user from re-running the prepare command during an analysis. If the sample metadata files or the directory variables have been changed then this could result in incomplete analysis results or failures of the pipeline. The prepare command should normally be run once at the start of the analysis and not repeated once the main parts of the pipeline (mapping, calling etc.) have been started.