12. Worked example

Once gemBS has been installed, the user can work through the example analysis below to ensure that the pipeline is set up correctly and to demonstrate all of the pipeline steps in sequence.

12.1. Example dataset

The test dataset is an extraction of a set of reads from a yeast WGBS experiment. To decrease the running times and size of the dataset, each of the 6 samples only has two million read pairs (2 x 100bp). The dataset can be found here. Once downloaded, the dataset should be unpacked and you should cd into the data directory.

tar -zxvf gemBS_example.tar.gz
cd gemBS_example

The directory has two files: example.conf and example.csv, and four directories: reference, fastq and extract_gold and report_gold. The two files are the parameter configuration file and sample metadata file respectively. The contents of the example.csv are:

Barcode,Name,Dataset,File1,File2,Library
A001XS1,sample1,sample1_data_a,sample1_data_a_1.fastq.gz,sample1_data_a_2.fastq.gz,LB1S23
A001XS1,sample1,sample1_data_b,sample1_data_b_1.fastq.gz,sample1_data_b_2.fastq.gz,LB1S23
A001XS2,sample2,sample2_data_a,sample2_data_a_1.fastq.gz,sample2_data_a_2.fastq.gz,LB2T41
A001XS2,sample2,sample2_data_b,sample2_data_b_1.fastq.gz,sample2_data_b_2.fastq.gz,LB2T41
A001XS3,sample3,sample3_data_a,sample3_data_a_1.fastq.gz,sample3_data_a_2.fastq.gz,LB7X42
A001XS3,sample3,sample3_data_b,sample3_data_b_1.fastq.gz,sample3_data_b_2.fastq.gz,LB7X42
A001XS4,sample4,sample4_data_a,sample4_data_a_1.fastq.gz,sample4_data_a_2.fastq.gz,LB3B63
A001XS4,sample4,sample4_data_b,sample4_data_b_1.fastq.gz,sample4_data_b_2.fastq.gz,LB3B63
A001XS5,sample5,sample5_data,sample5_data_1.fastq.gz,sample5_data_2.fastq.gz,LB7D76
A001XS6,sample6,sample6_data,sample6_data_1.fastq.gz,sample6_data_2.fastq.gz,LB3M18

It can be seen that there are six samples, four with two pairs of fastq files and two with a single pair. For more details of the format of this file see: 3.1. Sample metadata file

The contents of the example.conf file are:

# Required section
#
# Note that the index and contig_sizes files are generated from the
# reference file if they do not already exist
#
reference = reference/sacCer3.fa.gz

#
# This is for the control sequences.  The contigs here will
# be used for mapping, but will not be passed to the caller
#
extra_references = reference/conversion_control.fa.gz

index_dir = indexes

#
# The variables below define the directory structure for the results files
# This structure should not be changed after the analysis has started
#

base = .
sequence_dir = ${base}/fastq/@SAMPLE
bam_dir = ${base}/mapping/@BARCODE
bcf_dir = ${base}/calls/@BARCODE
extract_dir = ${base}/extract/@BARCODE
report_dir = ${base}/report

#
# End of required section
#


# The following are optional

project = yeast_test
species = Yeast

threads = 16
jobs = 1
cores = 16

[index]
populate_cache = true

[mapping]

cores = 16
memory = 30G
merge_cores = 8
merge_memory = 8G
underconversion_sequence = NC_001416.1
overconversion_sequence = NC_001604.1
make_cram = true

include IHEC_standard.conf

[calling]

# Contigs smaller than contig_pool_limit are pooled.
# This would normally be higher (25Mb is the default)
# but for the example we are setting this lower so the
# pooling strategy can be observed (otherwise all the contigs
# for yeast fit in one pool!)
contig_pool_limit = 5000000
auto_conversion = true
threads = 16
cores = 4
memory = 5G
left_trim = 5

[extract]
threads = 8
cores = 4
memory = 6G

[report]
threads = 8
cores = 8
memory = 8G
# pdf = yes

[md5sum]
threads = 1
cores = 1
memory = 1G

The values for threads and jobs should be adjusted according to the computing resources available. For more details of the format of this file see: 3.2. Pipeline configuration file.

12.2. Preparing the pipeline and creating the genome index

After the parameter files are ready, the pipeline can be set up and the genome index generated (this only needs to be done the first time the pipeline is run; re-running the index command when the index is already created has no effect).

gemBS prepare -c example.conf -t example.csv
gemBS index

12.3. Running the analysis pipeline

When the index has been generated, the main part of the analysis can be performed, which can be done by simple performing the commands below:

gemBS map
gemBS call
gemBS extract
gemBS report

or, more simply:

gemBS run

If everything completes successfully, then the methylation files in the extract and report directories can be compared against the previously computed results files that can be found in the *_gold directories.