7. Methylation & SNP Extraction

The BCF output files from gemBS are a rich source of information, but for most analyses it is more convenient to work with summary files, focusing on particular aspects of the data contained within the BCFs.

The gemBS pipeline can produce either or both of two sets of extracted methylation calls, gemBS style and ENCODE style. Between the two sets are minor differences in format and larger differences in how they are filtered, and they are intended for different downstream analyses. The pipeline can also produce SNP calls, which are intended to be used to allow identification and verification of samples as well as for investigating the relationship between SNPs and DNA methylation.

7.1. gemBS style output files

The underlying philosophy of the gemBS style output files is to produce as set of cytosine sites, either CpG or non-CpG, where the genotype is known with a certain degree of confidence. Of these, typically only sites that are called as being homozygous C (on either strand) are considered for subsequent analysis. This approach is suitable for high quality (mainly high coverage) samples where it is possible to get good genotype estimates across most of the genome, and allows analyses of differential methylation to be made with high confidence that any differences are not due to sequence variation at the genomic location. Cytosines in both CpG and non-CpG context can be extracted and information on the genotype calling (with calling confidence) is provided to allow genome context to be studied.

The standard gemBS filtered output is a BED3+8 format with cytosines in CpG context (either one line per cytosine or one line per CpG) where the sequence context has been called as homozygous CC/GG with high confidence. Cytosines in non-CpG context can also be output (with an emphasis on sites where gemBS can determine with high confidence that the genotype is not CpG (homozygous or heterozygous). The filtering on genotype can be reduced to allow less confident calls and/or heterozygous sites.

gemBS BED3+8 methylation output

Field

Description

1

Contig or chromosome name.

2

Start position (0 offset)

3

End position (1 offset)

4

Reference call

5

Genotype call

6

Flags (genotype calling confidence + QC flags)

7

Methylation estimate (model based)

8

Non-converted base count

9

Converted base count

10

Total bases supporting genotype call

11

Total bases

7.2. ENCODE style output files

The ENCODE style files are designed to have a standardized format that is as similar as possible across different epigenetic assays, should apply to experiments with variable qualities and specifications (i.e., high coverage WGBS, low coverage WGBS, RRBS), and should be simple to view on a genome browser. The ENCODE files produce methylation estimates for cytosines in CpG, CHG and CHH contexts (based on the reference genome), and do not refer to genotypes called from the sequence data. The basic output files are described here and are in bedMethyl format (BED9+5); bigBed versions (suitable for viewing on a genome browser) are also produced. In addition, a bigWig file with the methylation level at all reference cytosines in CpG context covered in the experiment is also produced.

ENCODE BED9+5 bedMethyl format

Field

Description

1

Contig or chromosome name.

2

Start position (0 offset)

3

End position (1 offset)

4

Name of item

5

Score from 0 - 1000 (capped methylation informative coverage)

6

Strand: +, - or ?

7

Start of where display should be thick

8

End of where display should be thick

9

Colour value (RGB)

10

Methylation informative coverage

11

Percentage of reads showing methylation

12

Reference genotype

13

Sample genotype

14

Quality score for genotype call

7.3. SNP extraction

VCF/BCF files were designed to hold sequence variation, and there are many pipeline for analyzing SNPs from VCF/BCF files, and these should work without modification on the output BCFs from gemBS. However, there is the functionality built in to gemBS to extract a set of genotypes from a set of standard SNPs (which are by default, the list of SNPs used to form the dbSNP index (4.3. dbSNP index file). For each of the SNPs in the dbSNP index, a line in the VCF will be outputted if there is some sequence coverage at that position. In this way, any SNP position with sufficient coverage will have a genotype call, irrespective of whether the site is homozygous or not, or whether it contains a cytosine. The same effect could be achieved using a gVCF format, but given that gemBS outputs a line on average every two base positions (every base that is C or G), the gVCF format provides little or no benefit over simply listing every base position.

The list of SNPs can be restricted using the –snp-list option or snp_list configuration key, which provides the name of a file containing a list of SNP ids (one per line) that will be included in the output file. An alternate dbSNP index can also be provided using the –snps-db option of snp_db configuration key (although be aware that for a SNP where the reference is not C or G, there will only be a line in the VCF if the site is not homozygous reference).

The output format is a 4 column tab separated text file (with a header line), where the 4 columns are contig, position, SNP id and genotype. The file name for the output file is the sample barcode with _snps.txt.gz appended.

7.4. The methylation extraction command: gemBS extract

As with the other gemBS commands, gemBS extract will by default use the parameters from the configuration file and perform the filtering for all samples. The samples to be extracted and the output files to be generated can be specified using the command line parameters. Extraction can be performed in parallel on a single machine using the -j JOBS option, and if run on multiple machines the different gemBS instances will cooperate correctly in sharing the tasks (in the default configuration)

USAGE:
 gemBS extract [FLAGS] [OPTIONS]

FLAGS:
 -s, --strand-specific           Output separate lines in CpG file for each strand
 -W, --bigwig-strand-specific    Output separate bigWig files for each strand
 -H, --allow-het                 Allow both homozygous and heterozygous sites
 -C, --cpg                       Output gemBS bed with CpG sites
 -N, --non-cpg                   Output gemBS bed with non-CpG sites
 -B, --bed-methyl                Output ENCODE standard output (bedMethyl, bigBed and bigWig)
 -S, --snps                      Output SNPs
     --no-md5                    Do not automatically calculate md5 sums
 -h, --help                      Prints help information

OPTIONS:
 -n, --sample <SAMPLE>...                 Name of sample
 -b, --barcode <BARCODE>...               Barcode of sample
 -t, --threads <THREADS>                  Number of threads for extraction pipeline
 -j, --jobs <JOBS>                        Number of parallel jobs
 -c, --cores <CORES>                      Number of cores for a job (default - available cores / jobs)
 -T, --time <TIME>                        Time required for a job
 -m, --memory <MEMORY>                    Memory required for a job (default - available memory / jobs)
 -Q, --qual-threshold <QUAL_THRESHOLD>    Threshold for base quality scores
 -q, --phred_threshold <PHRED>            Minimum threshold for genotype PHRED score
 -I, --min-inform <N>                     Minimum threshold for informative reads
 -M, --min-nc <N>                         Minimum number of non-converted reads for non-CpG sites
 -R, --reference_bias <BIAS>              set bias to reference homozygote
     --snp-list <SNP_LIST>                Path to file with list of SNPs to output
     --snp-db <SNP_DB>                    Path to dbSNP_idx processed SNP database file

The command line parameters and configuration keys for gemBS extract are listed below:

Section

Key

Description

dbsnp

dbsnp_index

dbSNP style index of SNPs

call

bcf_dir

Location of input BCF files

extract

extract_dir

Location of output files

extract

allow_het

Allow heterozygous sites (gemBS style output)

extract

phred_threshold

Minimum genotype calling PHRED score (gemBS style output)

extract

min_inform

Minimum informative coverage (gemBS style output)

extract

strand_specific

Output strand-specific CpG cytosines (gemBS style output)

extract

min_nc

Minimum non-converted count for non-CpG site (gemBS style output)

extract

make_cpg

Make gemBS style CpG output files

extract

make_non_cpg

Make gemBS style non_CpG output files

extract

make_bedmethyl

Make ENCODE style bedMethyl output files

extract

reference_bias

Multiplicative weight for homozygous reference genotype probability

extract

make_bigwig

Make ENCODE style bigWig output files

extract

bigwig_strand_specific

Output strand-specific bigWig files

extract

make_snps

Extract SNPs from VCF/BCF file

extract

snp_list

File containing a list of SNPs to extract

extract

extract_threads

Threads for extraction step

cores

Number of cores

threads

Number of threads

jobs

Number of parallel jobs

memory

Memory usage

time

Time required