6. Methylation and Variant Calling

Methylation and sequence variant calling are performed simultaneously in the gemBS pipeline with the bs_call program. Taking account of the sequence quality scores, under and over conversion probabilities and the observed bases, bs_call finds the most likely genotype (maximizing the likelihood over the unknown methylation parameter for each genotype) and then reports the methylation conditional on the called genotype. Standard quality scores for variant calling are provided, and the native output format is a VCF/BCF file with additional custom fields that provide methylation specific information (the specifications of the custom format used by gemBS can be seen here).

The bs_call program generates several format fields in the VCF output files that are related to the quality of the genotype calls:

Field ID

Description

GL

Genotype Likelihood

GQ

Phred scaled conditioanl genotype quality

DP

Read Depth (non converted reads only)

MQ

RMS Mapping Quality

QD

Quality By Depth (Variant quality / read depth (non-converted reads only))

FS

Phred scaled log p-value from Fishers exact test of strand bias

Apart from GOF, these are taken from GATK best practices for variant filtering. By default, bs_call applies a series of filters based on these fields, and the FILTER column in the VCF file is set accordingly. The filters applied are:

Filter ID

Description

q20

Genotype Quality below 20

qd2

Quality By Depth below 2

fs60

Fisher Strand above 60

mq40

RMS Mapping Quality below 40

The default filter cutoffs are also from GATK best practices. Note that only soft filtering is applied - no variants are removed from the VCF file as a result of the filtering. However, the generation of the gemBS style methylation report files and SNP extraction files only use sites that pass all filters. This does not apply to the generation of the ENCODE style methylation report files (see 7. Methylation & SNP Extraction).

6.1. Introduction

Standard pre-processing

By default, bs_call uses only uniquely mapping reads for the analysis (as determined by a minimum threshold on the MAPQ score provided by the mapper). Duplicate reads can be determine by bs_call as reads with the same start and end points (and the same library barcode if this information is available), but bs_call will also take account of the duplicate flag in the BAM files. Reads marked or detected by bs_call as being duplicates are optionally removed prior to calling. For read pairs that overlap, the overlapping section will be collapsed into a single copy. Sequence bases with quality less than a given cutoff will be removed priori to further analysis. Read pairs that do not form proper pairs will normally be discarded. Bases at the extremities of reads can be marked so that they are not used for methylation prediction (this is to avoid methylation artifacts due to end repair during sample preparation. All of the above behaviours can be modified using the program parameters.

Chromosomes, contigs and pools

Separate chromosomes or contigs can be processed by bs_call independently. If multiple computing resources are available then performing the calling in parallel can speed up the processing (although users should look at the section on Practical Considerations where this is discussed more thoroughly). For organisms with well studied genomes, such as humans, most of the reference sequence is organized into chromosomes. However for less well studied organisms, instead of a small number of chromosomes there can be tens of thousands of small contigs. Even in the human GRCh38 build there are almost 200 separate contigs in total. Performing variant calling by contig can therefore be problematic due to the number of contigs, and often it is advisable to either call all contigs together for a single sample, or group small contigs together into pools for analysis purposes.

To get around this, gemBS uses a contig pool strategy that is managed with almost no user intervention. Contigs smaller than a certain cutoff (configuration key contig_pool_limit; set by default to 25Mb) are pooled together to simplify processing. Contigs / chromosomes larger than this are processed separately (technically they are in a pool of one). gemBS generates the pools itself, and performs all calling by pool. By default the user does not have to take care of this. As with the mapping stage, gemBS will normally call all available pools on all available samples, merging the results for completed samples into a single sample BCF file. Calling can be restricted, however, by sample or pool.

Note

Note that by default, gemBS will perform calling on all contigs in the main reference file (any contigs from files in the extra_references list will not be passed to the calling analysis). If the main reference contains additional sequences then these should be excluded from calling by setting the omit_contigs variable.

Conversion rates

The calling process uses the estimated under and over conversion rates in the statistical model of genotype and methylation. These values should be supplied to the caller using the conversion configuration key. The value should be one or two comma separated numbers between 0 and 1 with the under and over conversion rates (in that order). If a single value is present it is assumed to be the under conversion rate, and the over conversion is set to a default (0.05). If the special value auto is used for the conversion key, gemBS will try and obtain sample specific estimates of the conversion rates from the mapping results. For this to work (a) there have to be conversion control sequences in the sequencing library, (b) these have to have been included in the reference sequence and (c) the names of the control sequences have to had been supplied to gemBS when the mapping was performed. If one or both of the values can not be estimated then gemBS will use defaults (0.01 and 0.05) for the missing value(s).

Parallelization

Most computers today are multi-core, and a typical workstation (or even a laptop) can have tens of threads available. The mapping process is inherently parallelizable as each read pair can be processed independently. Running the GEM3 mapper with 24 cores will be close to 24 times faster (ignoring startup times) than running on a single core. Calling is less easy to parallelize; on a single contig bs_call can benefit from 3-4 cores, but more than this has little effect on the running time. Even with 4 cores, the speed up is generally much less than a four fold increase over the single threaded case. For the calling, therefore, it can make sense to run multiple pools on the same compute host. For this purpose gemBS has the jobs configuration key; during the calling gemBS will run jobs separate copies of the caller simultaneously.

As with the mapping process, by default it is possible to run gemBS on multiple compute hosts on the same analysis directory with multiple jobs running on each host, and the multiple instances will cooperate with each other, dividing the available tasks between them.

6.2. The calling command: gemBS call

As with the mapping command, by default the calling requires no parameters and will try and perform all available calling and merging tasks. By default, all of the information about input and output files, calling parameters etc. come from the original configuration files. The command line options for the call command are present to enable restriction of the mapping to a given sample or contig pool, or to perform ad-hoc adjustment of the calling parameters, but it is intended that normally no parameters are required.

The command template is as follows (with all parameters being optional):

USAGE:
 gemBS call [FLAGS] [OPTIONS]

FLAGS:
 -u, --keep-duplicates          Do not merge duplicate reads
 -U, --ignore-duplicate-flag    Ignore duplicate flag from SAM/BAM files
 -k, --keep-unmatched           Do not discard reads that do not form proper pairs
 -r, --remove                   Remove individual BAMs after merging
     --md5                      Perform calculation of md5 sums only
     --no-md5                   Do not automatically calculate md5 sums
     --index                    Perform indexing of final BCF only
     --no-index                 Do not automatically calculate index of final BCF
 -1, --haploid                  Force genotype calls to be haploid
     --auto-conversion          Try to calculate conversion rates from data
     --merge                    Perform merge BCF step only
     --no-merge                 Do not automatically merge BAMs
     --benchmark_mode           Omit dates etc. from output to make comparison simpler
 -h, --help                     Prints help information

OPTIONS:
 -n, --sample <SAMPLE>...                 Name of sample
 -b, --barcode <BARCODE>...               Barcode of sample
 -q, --mapq-threshold <MAPQ_THRESHOLD>    Threshold for MAPQ scores
 -Q, --qual-threshold <QUAL_THRESHOLD>    Threshold for base quality scores
 -g, --right-trim <BASES [,BASES]>...     Bases to trim from right of read pair (give 2 values for read specific
                                          values)
 -f, --left-trim <BASES, [,BASES]>...     Bases to trim from left of read pair (give 2 values for read specific
                                          values)
 -d, --tmp-dir <PATH>                     Temporary directory to perform sorting operations
 -t, --threads <THREADS>                  Number of threads for calling pipeline
     --call-threads <THREADS>             Number of threads for methylation caller
     --merge-threads <THREADS>            Number of threads for merging
 -j, --jobs <JOBS>                        Number of parallel jobs
 -c, --cores <CORES>                      Number of cores for a job (default - available cores / jobs)
 -T, --time <TIME>                        Time required for a job
 -m, --memory <MEMORY>                    Memory required for a job (default - available memory / jobs)
 -e, --species <SPECIES>                  Species name
 -D, --dbsnp-index <FILE>                 dbSNP processed index file
 -C, --conversion <UNDER OVER>...         set conversion rates (under over)
 -R, --reference-bias <BIAS>              set bias to reference homozygote
     --pool <POOL>...                     Contig pool for methylation calling

Possible parameters for the gemBS call command are given below. Where present, the Key column shows the keys that control this parameter in the configuration files.

Section

Key

Description

map

bam_dir

Location of input BAM files

call

bcf_dir

Location of output BCF files

index

reference

Location of FASTA reference file

dbsnp

dbSNP_index

Location of dbSNP index file

call

mapq_threshold

Minimum cutoff for read MAPQ

call

qual_threshold

Minimum cutoff for base quality

call

left_trim

Left trim x bases from reads before calling

call

right_trim

Right trim x bases from reads before calling

call

species

species

call

keep_duplicates

Keep duplicates

call

keep_improper_pairs

Keep unmatched (non-paired) reads

call

remove_individual_bcfs

Remove contig pools BCFs after merge step

call

haploid

Force all genotype calls to haploid

call

reference_bias

Multiplicative weight for homozygous reference genotype probability

call

conversion

under-, over- conversion rates.

call

contig_list

Restrict calling to a list of contigs (actually, to all pools than contain these contigs)

index

contig_sizes

File with list of contigs and sizes

call

contig_pool_limit

Maximum sizes limit for small contigs (to be pooled)

call

call_threads

Number of threads for call stage

call

merge_threads

Number of threads for merge stage

call

merge_jobs

Number of jobs for merge stage

call

merge_cores

Number of core for merge stage

call

merge_memory

Amount of memory for merge stage

call

merge_time

Amount of memory for merge stage

cores

Number of cores

threads

Number of threads

jobs

Number of parallel jobs

memory

Memory usage

time

Time required

The calling process (as with mapping) involves multiple steps: The calling itself using bs_call, combining the BCF files from the individual contig pools, using bcftools concat, indexing the merged bcf (unlike with samtools, bcftools does not have the option to generate the index at the same time as combining the files), and calculating the md5 signature of the BCF. While the combine step can use multiple threads, there is not much benefit with providing more that 2-3 threads. The resources used just for the merge step can thereore be set individually using the merge_threads, merge_cores, merge_jobs, merge_cores and merge_time keys, which can allow for more efficient scheduling of jobs. The indexing of the BCFs with bcftools index and calculation of md5 signatures are not be multithreaded and the number of threads for these steps are set to 1.

As with the map command, normally the call command is run without any parameters so that all necessary calling, combining and indexing is performed. However it is possible to specify the individual steps, This is intended for integration in pipelines to allow, for example, the calling of different samples / contig pools to be performed using different computational nodes.