5. Mapping Bisulfite Converted Sequences

Mapping in gemBS is performed using GEM3 in bisulfite mode. In common with many bisulfite mappers, the aligner processes the reads pre and post-alignment, by default converting C’s to T’s in the first read of a pair or for a single read, and G’s to A’s in the second read of the pair. The converted reads are aligned to a similarly converted reference. In contrast to most other bisulfite pipelines, the pre- and post-mapping conversion steps are performed within the mapper on a read by read basis, almost eliminating the cost penalty normally associated with this step. This, combined with the best in class raw mapping performance of the GEM3 mapping engine, and the collection of QC mapping stats within the mapper avoiding the need to pre-process the output BAMs, provide a rapid and efficient mapping of bisulfite converted reads.

GEM3 allows as much as 10% error rate when mapping. Unwanted adaptor sequences and low quality base pairs are excluded by soft trimming of the 5’ and/or 3’ read ends.Due to this efficient self-trimming it is rarely necessary to perform an adaptor trimming step prior to mapping.

The basic input format is FASTQ, but the pipeline will also accept BAM files (sorted on read name) and can read from input streams (see this section on the sample metadata format for an example of how this works. The output format of the mapping stage is BAM files sorted on genome coordinates.

Note

If pre-alignment trimming is desired then this can easily be accommodated within the pipeline. Either the raw FASTQ files are processed by the trimmer application producing filtered FASTQs that are passed to gemBS or, if the trimmer application can output to stdout, the ability of gemBS to use input from a pipe can be used so that the intermediate trimmed files do not need to be stored on disk but piped directly into gemBS.

5.1. The mapping command: gemBS map

The default mapping mode of gemBS is to try to perform all mapping and merging steps that it knows about that have not already been performed successfully. In this mode all of the information about input and output files, mapping parameters etc. comes from the JSON file generated from the original configuration files. Command line parameters of the gemBS map command are available to enable restriction of the mapping to a given sample or a given dataset, or to perform ad-hoc adjustment of the mapping parameters, but it is intended that normally no parameters are required.

The command template is therefore as follows (with all parameters being optional):

USAGE:
 gemBS map [FLAGS] [OPTIONS]

FLAGS:
 -p, --paired-end            Input data is paired end
 -r, --remove                Remove individual BAMs after merging
 -R, --reverse-conversion    Assume G2A conversion on read 1 and C2T on read 2
 -s, --read-non-stranded     Treat library as non-stranded
     --non-bs                Map as regular (non-bisulfite) data
     --bs                    Map as bisulfite data
     --md5                   Perform calcuation of md5 sums only
     --no-md5                Do not automatically calculate md5 sums
     --merge                 Perform merge BAM step only
     --no-merge              Do not automatically merge BAMs
     --benchmark-mode        Omit dates etc. from output to make comparison simpler
 -h, --help                  Prints help information

OPTIONS:
 -D, --dataset <DATASET>...                        Dataset to be mapped
 -n, --sample <SAMPLE>...                          Name of sample to be mapped
 -b, --barcode <BARCODE>...                        Barcode of sample to be mapped
 -c, --cores <CORES>                               Number of cores for a job (default - all available cores)
 -m, --memory <MEMORY>                             Memory required for a job (default - all available memory)
 -T, --time <TIME>                                 Time required for a job
 -d, --tmp-dir <PATH>                              Temporary directory to perform sorting operations
 -t, --threads <THREADS>                           Number of threads for mapping pipeline
     --map-threads <THREADS>                       Number of threads for GEM3 mapper
     --sort-threads <THREADS>                      Number of threads for sorting
     --sort-memory <MEMORY>                        Amount of memory per sort thread
     --merge-threads <THREADS>                     Number of threads for merging
 -F, --type <FILE_TYPE>
         Type of data file [possible values: PAIRED, SINGLE, INTERLEAVED, STREAM, BAM]

 -u, --underconversion-sequence <SEQUENCE_NAME>    Name of underconversion sequencing control
 -v, --overconversion-sequence <SEQUENCE_NAME>     Name of overconversion sequencing control

In the default operation mode of gemBS, multiple instances of gemBS map can be run simultaneously in the same directory, and they will cooperate in processing the available tasks. These gemBS instances can be running on different compute hosts as long as the job directory is shared (see Scheduling jobs for more details).

Note

To map bisulfite converted sequences the mapper needs to know whether the sequencing library is stranded or not, and if stranded whether read 1 is from the positive bisulfite strand and read 2 from the negative strand or visa versa. The default (as this is the most common) is for the library to be stranded with read 1 being from the positive strand and read 2 from the negative. If read 1 is from the negative strand then this can be specified using the reverse-conversion command line flag of the reverse_conversion configuration key, and if the library is non-stranded this can be set using the read-non-stranded command line flag or the non_stranded configuration key. The non stranded setting can be used for any library, but it does come with a slight performance cost and reduction in mapping efficiency.

If the mapping results are very poor (many unmapped reads) it is possible that the library is not specified correctly. Normally this can be seen by inspection of the FASTQ files. If stranded then the read 1 file should be either C depleted (with only a few % of bases being C) and T enriched or G depleted and A enriched, and similarly for read 2. In the default library configuration, read 1 should be C depleted and read 2 G depleted; if the reverse is seen then the reverse conversion flag should be specified. If, instead, both C and G are depleted to about half the normal value, while A and G are enriched for reads 1 and 2 then the non stranded option should be specified.

Possible configuration file parameters for the gemBS map command are given below.

Section

Key

Description

index

index

Location of GEM3 index

map

sequence_dir

Directory location of input sequence files

map

bam_dir

Directory where BAM files will be created

map

tmp_dir

Directory for temporary files (used for BAM sorting)

map

reverse_conversions

Assume G2A conversion on read 1 and C2T on read 2

map

non-stranded

Automatically selects the proper C->T and G->A read conversions based on the level of Cs and Gs in the read

map

remove_individual_bams

Remove individual BAM files after merging

map

underconversion_sequence

Name of unmethylated bisulfite control sequence

map

overconversion_sequence

Name of methylated bisulfite control sequence

map

map_threads

Threads for mapping step

map

sort_threads

Threads for sort BAMs step

map

sort_memory

Memory for sort BAMs step

map

merge_threads

Threads for merge BAMs step

map

merge_jobs

Number of jobs for merge stage

map

merge_cores

Number of core for merge stage

map

merge_memory

Amount of memory for merge stage

map

merge_time

Amount of memory for merge stage

cores

Number of cores

threads

Number of threads

jobs

Number of parallel jobs

memory

Memory usage

time

Time required

gemBS keeps track of what steps have been completed, so in normal usage if you repeat the gemBS map command it will exit without executing any commands. In some situations (for example, if gemBS is not allowed to terminate normally) gemBS might think that some commands are still running when in fact they are not. See Scheduling jobs for information on how to recognize and deal with this situation.

The mapping steps consists of 3-4 distinct phases: mapping, sorting, merging (if required) and the generation of md5 signatures of the output files. The mapping and sorting occure together; GEM3 generates SAM output that is piped directly into samtools sort. Both GEM3 and samtools sort are multithreaded, and the number of threads can either be set for both with the threads key or individually using map_threads and sort_threads. Note that samtools sort allocates memory for each thread (768MB by default). If a lot of threads are used this can add up to a large amount, so it is possible to set the per thread memory used for the sorting with the sort_memory key. The merging step is required if multiple datasets are present for a single sample. The merging is performed using samtools merge. While the merge step can use multiple threads, there is not much benefit with providing more that 2-3 threads. The resources used just for the merge step can thereore be set individually using the merge_threads, merge_cores, merge_jobs, merge_cores and merge_time keys, which can allow for more efficient scheduling of jobs. The calculation of md5 signatures can not be multithreaded and the number of threads for this step is set to 1. There is no explicit step to index the output BAM files: this is performed during the merge step or the sort step if no merging is required.

Note

The default location for the temporary files used for BAM sorting is set to the same directory where the BAMs are generated. Be careful about changing this. Use of the system /tmp directory (which is the normal place used for temporary sorting files) can give problems due to the large size of some BAM files. There must be at least as much space as the largest BAM file (and if several BAMS are generated in parallel the requirements can be even higher).

Normally the map command is run without any parameters so that all necessary mapping & merging is performed. However it is possible to specify the individual steps, This is intended for integration in pipelines to allow, for example, the mapping of different datasets to be performed using different computational nodes.