3. Pipeline Configuration¶
In order to run gemBS efficiently and in line with your experimental design, it is important to configure the pipeline correctly. With some care, the same pipeline can be used for subsequent analyses with little modification making large scale analyses simple and standardization of results possible.
There are two aspects of pipeline configuration:
Sample metadata (in the metadata CSV file)
Pipeline parameters (in the pipeline configuration file)
The first step in using gemBS for an analysis is to set up the configuration files and then run the gemBS prepare command. This will take the information from the sample metadata and configuration files and save it as a MessagePack file. For all subsequent commands, gemBS will take the parameters from the MessagePack file as defaults for the pipeline (although these can be over-ridden for each command). Setting up the configuration files correctly means that the subsequent pipeline steps can be performed supplying little or no additional information.
3.1. Sample metadata file¶
Introduction¶
The sample metadata provides the connection between samples and datafiles, so that gemBS knows which datafiles belong to a given sample, which datafiles are in pairs etc., and gives additional information about the sample or the experiment such as sequencing library or sequencing platform, that can be added to the outputs as annotations and help to keep track of samples.
The basic format of the sample metadata file is a CSV (comma separated) text file with a header file describing the data columns. There is a great deal of flexibility in how the file is set up, with only two columns being required, the sample barcode and the dataset id. The sample barcode is a unique identifier for a sample. As it is used as part of the name of the results files, it is advised to avoid punctuation, spaces, accents etc. in the barcodes. The dataset id refers to a FASTQ file, a pair of FASTQ files or a BAM file containing the sequence data for the mapping stage. As for the sample barcode, the dataset id should be a unique identifier and it is advisable to avoid punctuation etc.
A name, and a description column can optionally be included that can provide a longer name and description for the sample. These fields will be used in the BAM, BCF and BED output files to identify the samples.
The file path to the data files can also be included in the metadata file. These paths can be absolute or can be relative to the working directory or to a directory specified in the parameter configuration file. If the file paths are not given then gemBS will look for files matching patterns based on the dataset id (i.e, if the dataset id is AA0C2BZ and the files AA0C2BZ_1.fastq.gz and AA0C2BZ_2.fastq.gz exist in the sequencing data directories (defined in the parameter configuration file) then gemBS will assume that this is a pair of FASTQ files from a paired end sequencing experiment for that dataset). In general it is simpler and safer to provide the paths in the metadata file. Note that gemBS will transparently handle files compressed by gzip, bgzip and bzip2, so there is no need to uncompress files before analysis.
The type of the dataset can also be specified: whether the dataset is single or paired end or an interleaved FASTQ file or a BAM file etc. This can often be inferred from the data, but in some occasions it is useful to specify it.
Paired end data can either be presented as one line in the metadata file (with both reads being described if the paths are given) or as two separate lines; in the latter case it is necessary to include a column with the read end information, to say which file is read 1 and which is read 2.
Finally, information on the sequencing library identifier, sequencing platform and sequencing centre can also be present in the metadata file; this information is added to the read group (RG) header in the BAM file.
Column descriptions¶
To make setting up the metadata file a little simpler (and hopefully easier to remember) the case of the headers and any underscores within the headers are ignored, so you are free to use samplebarcode, sampleBarcode, or Sample_barcode as you prefer. There are also multiple versions accepted for most of the headings. The table below gives the possible column headers and the different accepted versions. Note that only the first two headers are required. Unrecognized headers are ignored.
Name |
Required ? |
Alternates |
Comments |
---|---|---|---|
barcode |
yes |
sampleid, samplebarcode |
|
dataset |
yes |
fileid, fli |
|
samplename |
no |
name, sample |
|
library |
no |
lib, libbarcode, librarybarcode |
Library identifier |
file |
no |
location |
Single file |
file1 |
no |
location1, end1 |
Paired files (read 1) |
file2 |
no |
location2, end2 |
Paired files (read 2) |
readend |
no |
end |
Read end (Paired on two lines) |
type |
no |
filetype |
Paired, Interleaved, Single, BAM |
bisulfite |
no |
bis, bisulphite |
Dataset is bisulfite converted (Yes or No) |
description |
no |
desc |
|
centre |
no |
center |
Sequencer centre |
platform |
no |
Sequencer platform |
Example metadata file¶
Barcode,Library,file_id,end_1,end_2,name
A001XS1,419H_BS,C2TG4ACXX_7_3,C2TG4ACXX_7_3_1.fastq.gz,C2TG4ACXX_7_3_2.fastq.gz,S1
A001XS1,420H_BS,C2TG4ACXX_7_4,C2TG4ACXX_7_4_1.fastq.gz,C2TG4ACXX_7_4_2.fastq.gz,S1
A001XS2,421H_BS,C2TG4ACXX_7_5,C2TG4ACXX_7_5_1.fastq.gz,C2TG4ACXX_7_5_2.fastq.gz,S2
A001XS2,422H_BS,C2TG4ACXX_7_6,C2TG4ACXX_7_6_1.fastq.gz,C2TG4ACXX_7_6_2.fastq.gz,S2
A001XS3,423H_BS,C2TG4ACXX_7_7,C2TG4ACXX_7_7_1.fastq.gz,C2TG4ACXX_7_7_2.fastq.gz,S3
A001XS3,424H_BS,C2TG4ACXX_7_8,C2TG4ACXX_7_8_1.fastq.gz,C2TG4ACXX_7_8_2.fastq.gz,S3
A001XS4,425H_BS,C2TG4ACXX_7_9,C2TG4ACXX_7_9_1.fastq.gz,C2TG4ACXX_7_9_2.fastq.gz,S4
A001XS4,426H_BS,C2TG4ACXX_7_10,C2TG4ACXX_7_10_1.fastq.gz,C2TG4ACXX_7_10_2.fastq.gz,S4
A001XS5,427H_BS,C2TG4ACXX_7_11,C2TG4ACXX_7_11_1.fastq.gz,C2TG4ACXX_7_11_2.fastq.gz,S5
A001XS6,556H_BS,C355EACXX_6_1,C355EACXX_6_1_1.fastq.gz,C355EACXX_6_1_2.fastq.gz,S6
The example file above describes 6 samples with barcode A001XS1, …, A001XS6 which have paired end sequence data available. The first four samples have two sets of FASTQ pairs per sample (so two datasets), whereas the last two have only one dataset each.
While this layout allows most workflows to be accommodated, it can be useful in some occasions to have a more flexible system. Taking advantage of the ability of the GEM3 mapper to read from streams, it is possible to specify shell commands rather than filenames, or that gemBS should get its input from stdin rather than from a file. This is done by providing a pipe character (‘|’) as the last non-space character of a filename. This will work for both single and paired input files. This approach could be used to trim reads before mapping or to enable reading of files compressed with a compressor that gemBS does not handle natively. The example below shows how gemBS could read from xz compressed FASTQ files (assuming that the xz package is installed and available). Note that you will have to provide the full path for the files as gemBS will not modify the filenames in this mode.
Barcode,Library,file_id,end_1,end_2,name
A001XS1,419H_BS,C2TG4ACXX_7_3,"xzcat fastq/C2TG4ACXX_7_3_1.fastq.xz |","xzcat fastq/C2TG4ACXX_7_3_2.fastq.xz |",S1
A001XS1,420H_BS,C2TG4ACXX_7_4,"xzcat fastq/C2TG4ACXX_7_4_1.fastq.xz |","xzcat fastq/C2TG4ACXX_7_4_2.fastq.xz |",S1
A001XS2,421H_BS,C2TG4ACXX_7_5,"xzcat fastq/C2TG4ACXX_7_5_1.fastq.gz |","xzcat fastq/C2TG4ACXX_7_5_2.fastq.gz |",S2
A001XS2,422H_BS,C2TG4ACXX_7_6,"xzcat fastq/C2TG4ACXX_7_6_1.fastq.gz |","xzcat fastq/C2TG4ACXX_7_6_2.fastq.gz |",S2
Non-bisulfite processing¶
While gemBS is intended for the analysis of bisulfite sequencing, it can be useful to include non-bisulfite (i.e., normal WGS) sequence data in the analyses. While the non-bisulfite data provides no information on methylation status, it does provide information on the genome sequence, and so can be used to improve the genotype calling. To analyze non-bisulfite data it is necessary to define an additional column bisulfite in the metadata file. If a dataset has a 1, True, Yes or a missing value in this column then the dataset is assumed to be bisulfite converted, while if the column value is 0, False or No then the dataset is assumed to be non-bisulfite converted. The calling process will automatically take account of mixtures of bisulfite and non-bisulfite converted data.
3.2. Pipeline configuration file¶
Introduction¶
The parameter configuration file is a loosely structured text file
with each line either being a section header (name of a section
enclosed in square brackets like: [mapping]
, a key - value
pair or a comment (everything on a line after a # character is treated
as a comment). The majority of the keys correspond to parameters in
gemBS. Any key specified in the configuration file becomes a default
for the analysis, which can be over-ridden by supplying parameters to
the gemBS commands. All keys (and section names) are case
insensitive. Configuration keys can also represent directory paths,
and are used to define the locations for the input data for the
pipeline and for where the analysis files are generated.
Note
The order of statements in the configuration files is important. Variables should be defined before they are referenced, and new variable definitions will overwrite existing ones.
Configuration sections¶
The sections correspond to different parts of the pipeline (available sections are mapping, calling, extract, report, index, dbsnp and DEFAULT - everything not in one of the other sections is implicitly in DEFAULT). Sections with names not in the list will be ignored by gemBS. The utility of the sections is to allow different parameter settings to be used for different parts of the pipeline. Any key set in the DEFAULT section applies to all sections (unless over-ridden).
Example configuration file¶
The example below is a section of a configuration file that illustrates many of the features:
# Directory definitions
#
# Note that we can use environment variables ($HOME in this case)
# and previously defined variables can be used in subsequent definitions
base = ${HOME}/data
ref_dir = ${base}/reference
reference = ${ref_dir}/sacCer3.fa.gz
extra_references = ${ref_dir}/conversion_control.fa.gz
index_dir = ${base}/index
sequence_dir = ${base}/fastq/@SAMPLE # @SAMPLE and @BARCODE are special
bam_dir = ${base}/mapping/@BARCODE # variables that are replaced with
bcf_dir = ${base}/calls/@BARCODE # the sample name or barcode being
extract_dir = ${base}/extract/@BARCODE # worked on during gemBS operation
report_dir = ${base}/report
# For using CRAM
populate_cache = true
make_cram = true
# General project info
project = yeast_test
species = Yeast
# Default parameters
threads = 8
cores = 4
jobs = 4
[mapping] # Begin mapping specific section
memory = 16G
cores = 16
merge_cores = 8
merge_memory = 8G
# Set names of spiked in conversion controls
underconversion_sequence = NC_001416.1
overconversion_sequence = NC_001604.1
# Include a standard configuration file with parameters
# defined for the IHEC WGBS pipeline
include IHEC_standard.conf
[calling] # Begin calling specific section
cores = 4
threads = 4
memory = 6G
left_trim = 5
right_trim = 0
[extract] # extract specific section
cores = 4
threads =8
memory = 6G
make_cpg = True
make_non_cpg = True
make_bedmethyl = True
Advanced usage¶
Variable interpolation¶
Note that values can include previously defined variables using the form $(variable). This can simplify the file by, as in this example, using the variable base to store the base directory for the analysis. The names of such ‘intermediate variables’ are not important; gemBS only takes account of the keys that it know about (such as the directory keys discussed below) and the others are ignored. Values can also include environment variables (unless they clash with a previously defined key in the configuration file). For environment variables the case is respected; this is the only exception to the rule that keys are case insensitive.
Including files¶
One important feature displayed in the above file is that a configuration file can include another. gemBS will look for the included file relative to the current directory. If it is not found there, it will look in the directory gemBS/etc/gemBS_configs/ relative to the installation directory for gemBS. In this directory are located standard configurations for particular types of experiments (WGBS, RRBS, PBAT etc.) or for standardized analysis pipelines (IHEC, ENCODE) that make it simpler to use the correct parameter settings (see 11. IHEC standard analyses).
Analysis directories¶
The other important feature displayed is the definition of the analysis directories. The six directories listed above control where gemBS looks for input data and where it writes its output files. If these are not set then output files will be generated in the current directory, which is not always convenient! Note that several of the directories have the special variables @SAMPLE or @BARCODE in their names. This (optional) feature allows sample specific directories, which can be useful if a project contains many samples. The @SAMPLE and @BARCODE variables are resolved while gemBS is running into the sample name or barcode that gemBS is currently processing.
The roles of the five directories are given in the table below.
Directory
Role
index_dir
Location where the index files (and other associated files) are generated
sequence_dir
Location for the sequence data files (FASTQ, BAM)
bam_dir
Location where the BAM files (and other associated files) are generated
bcf_dir
Location where the BCF files (and other associated files) are generated
extract_dir
Location where the bed, bigBed and bigWig files generated
report_dir
Location where the report files are generated
Note that if absolute paths for the sequence data files are given in the metadata file then the sequence_dir variable is ignored. If the input files are not all in the same place but scattered over the file system then this allows gemBS to locate them correctly (although it is often easier to simply create a directory with symbolic links to all the datafiles so that they can all be found in the same place).
FASTA reference description¶
The last section of the configuration file that must be explained is the definition of the reference, index and contig_sizes files, as this is critical and gemBS will not run until these have been specified. gemBS can create the index and contig_sizes file from the reference, but it must have enough information to know where you want them to be stored and how you want them to be named in this case. The short section of configuration file below illustrates the options for defining the reference and index locations:
# Required section
#
# Note that the index and contig_sizes files are generated from the
# reference file if they do not already exist
#
reference = references/sacCer3_control.fa.gz
# The contigs in the extra references file(s) are index for mapping
# but are not passed to the later stages of the pipeline (calling,
# extraction etc.). This is where control sequences should be placed.
extra_references = references/conversion_controls.fa.gz
index = $(HOME)/example/references/sacCer3.BS.gem
reference_basename = sacCer3
index_dir = indexes
#dbSNP_index = $(HOME)/dbSNP/dbSNP_gemBS.idx
#dbSNP_files = $(HOME)/dbSNP/bed_chr*.bed.gz
The path to the multi-fasta reference file must be provided explicitly
and the file must exist (it can be compressed with gzip, bgzip or
bzip2). The index and other associated files will be created (unless they
already exist) in index_dir if this has been set. If not, then
index_dir will be set to the base directory if the index file if
that has been set, otherwise it will be sent to the working directory.
THe names of the index files, if not specified in the configuration
file, are derived from the reference file name by, for example,
removing the extensions and adding _BS.gem
for the index. If the
variable reference_basename exists then the name for the index files
is formed from this instead.
Note the extra_references variable. The file(s) listed here should be additional FASTA sequences that will be appended to the main reference file. Any contigs in these additional reference files will only be used for mapping - they will not be analyzed in the methylation analyses and the remainder of the downstream analysis. The use of the extra_references variable is to list the sequences of spiked in control sequences (for example under or over-conversion controls) that should be processed by the mapper to obtain the QC reports, but should not form part of the methylation analysis.
Whatever rule is followed to generate the file names and locations, gemBS will check for the existence of the files. If they are not present then the gemBS index command will need to be run to generate the files.
The example also shows the specification (although commented out in this case) of the dbSNP index location and input files. This will be more fully explained in the index section, but if the dbSNP index key is defined in the configuration tool then it will be treated like the GEM3 index in that its existence will be checked for and if it does not exist it will need to be generated (using the gemBS index command) before methylation calling can be performed.
Note
If non-bisulfite converted data is to be processed, then the
variable nonbs_index can be used to set the name of the
non-bisulfite index. If this variable is not set (and the metadata
file indicates that at least one dataset is not bisulfite
converted) then the name for the non-bisulfite index will be
generated using the same rules as for the regular bisulfite index
without the .BS
suffix.
CRAM support¶
The use of either BAM or CRAM files for storing the aligned reads is
supported by gemBS. Both BAM and CRAM are compressed files, and tools
such as samtools
can convert between the two formats. CRAM is a
more recent format and achieves better compression (to a large part
through the use of reference based compression), and is to be
preferred if possible. Instructing gemBS to use CRAM mapping files
required simply setting the key make_cram to true in either
the DEFAULT or index section of the configuration file.
As CRAM makes use of reference based compression, the sequence data is
stored as the differences between the sequence reads and the
reference. It follows that to read a CRAM file it is necessary to
have the original reference used in the generation of the file. While
using gemBS there is no need to do anything special apart from setting
the make_cram key. However, to read the generated CRAM files by
external utilities, the reference needs to be supplied. For
samtools
there are multiple ways that this can be done.
The original reference file can be supplied on the command line (i.e.,
samtools view -T my_reference.gemBS.ref
). Note that in this case the reference file should be the one generated by gemBS (and which should have the suffix gemBS.ref).The cache directory for htslib (which by default is $HOME/.cache/hts_ref) is searched to find the sequence.
An attempt to download the sequence from the European Nuclotide Archive will be made. After successful download the reference will be stored in the cache.
If option (3) is taken then you will see a delay the first time an attempt is made to view a contig from a CRAM file as the reference is fetched and stored. Of course, this will only work for an ‘official’ reference sequence. For working with, for example, a draft sequence from a de novo assembly project, it will be necessary to have a local copy of the reference.
gemBS provides a mechanism to make this simpler. If the configuration
key populate_cache is set to true in the DEFAULT or
index section then the sequences used in the CRAM file will be
stored in the local cache if not already present. In this way,
samtools view
can be used to read the generated CRAM files without
the user having to change anything.
Job scheduling¶
(Almost) every job run by gemBS can be affected by a set of configuration keys or command line options that affect how the job runs and how it is scheduled. In terms of configuration keys, these should be in the appropriate section (i.e, to affect the index subcommand the keys should be in the index section on the configuration file). Some of the options afffect how the job runs on multi-core computers and are passed on to the command itself, the most common being threads which indicates to the job how many threads to use. THe other options determine how many jobs can be run in parallel (jobs, cores, memory) - for details of how these interact see Scheduling jobs. Finally the keys cores, memory and time are passed on to the pipeline manager (i.e., Slurm) if this is being used.
3.3. Configuring gemBS: the gemBS prepare command¶
After the configuration files have all been made, the prepare command should be used to check the configurations and generate the JSON configuration file.
USAGE:
gemBS prepare [FLAGS] [OPTIONS] --config <CONFIG> <--text-metadata <TEXT_METADATA>|--lims-cnag-json <JSON_METADATA>>
FLAGS:
-p, --populate-cache Populate reference cache if required (for CRAM)
-h, --help Prints help information
OPTIONS:
-c, --config <CONFIG> Text config file with gemBS parameters
-t, --text-metadata <TEXT_METADATA> Sample data in csv format
-l, --lims-cnag-json <JSON_METADATA> Use JSON metadata file from CNAG lims
By default this will create the MessagePack file in a the hidden directory .gemBS in the working directory as .gemBS/gemBS.mpn.
In addition, the prepare command creates a bgzipped reference file
by concatenating the different reference files together (i.e., the
reference file and the extra_references file). Indexes to
this file are also created using the samtools faidx
command.
These are not the indexes for the GEM3 mapper, but allow quick access
to individual contig sequences and are used by bs_call. In
addition, a contig_sizes file is generated that gives the size of
each contig sequence in the reference file (not in the contigs
in the extra_references files), and a contig_md5 file that has
the md5 signatures of all contigs in the combined reference sequence
file.
Note
There is nothing to stop the user from re-running the prepare command during an analysis. If the sample metadata files or the directory variables have been changed then this could result in incomplete analysis results or failures of the pipeline. The prepare command should normally be run once at the start of the analysis and not repeated once the main parts of the pipeline (mapping, calling etc.) have been started.