4. GemBS Index and associated files

Before any of the subsequent pipeline steps can be run, the reference has to be indexed to allow GEM3 to perform the mapping. This only needs to be performed once for a given reference sequence. If non-bisulfite (i.e., regular WGS) sequence data is present then two references will be produced, once for non-bisulfite data and the other for bisulfite data.

4.1. The reference FASTA files

Any standard genome reference can be used, and there is practically no limit on genome size or number of contigs. As with other input files for gemBS, the fasta reference file can be compressed. However, if control sequences have been ‘spiked in’ to the sequencing libraries to allow assessment of the under and over bisulfite conversion rates, it is important that these sequences are included in the index. This is done by supplying them as an additional fasta file to gemBS.

When the mapping is performed, the names of the conversion sequences (if used) are provided to GEM3, and it will use the reads mapping to the control sequences to estimate the conversion rates. Additional sequences can also be added to the index, for example to control for potential viral contamination the NCBI viral reference genome database can be added with only a marginal increase in index size or mapping time. The final results will only use the contigs in the main reference file (although all contigs will be used for the mapping QC report).

Note that internally GEM3 maps to both bisulfite strands and will generate two versions of the reference: one with C’s converted to T’s and the other with G’s converted to A’s. Only a single index is generated, but it is twice as large as a non-bisulfite GEM index.

In addition to the GEM3 index, the index command also generates the dbSNP index if required that provides a mapping for genomics positions to dbSNP IDs (see 4.3. dbSNP index file).

4.2. Running the index command

A single command will create all of the files that are required (GEM3 index, contig_sizes file, dbSNP index if requested).By default the GEM3 indexer will use all available threads, but this can be changed using the -t option.

If the key nonbs_index is set in the configuration file or at least one datafile is indicated as being non-bisulfite converted then a normal (non-bisulfite) reference will also be generated.

In the configuration file, keys affecting the generation of the GEM3 index should be in the index section, while keys affecting the generation of the dbSNP index should be in the dbsnp section.

USAGE:
 gemBS index [FLAGS] [OPTIONS]

FLAGS:
 -b, --bs-index       Generate bisulfite index
 -n, --nonbs-index    Generate non-bisulfite (regular) index
 -D, --dbsnp-index    Generate dbSNP index
 -h, --help           Prints help information

OPTIONS:
 -t, --threads <THREADS>                    Number of threads (default - all available cores)
 -c, --cores <CORES>                        Number of cores for a job (default - all available cores)
 -m, --memory <MEMORY>                      Memory required for a job (default - all available memory)
 -s, --sampling-rate <SAMPLING_RATE>        Text sampling rate - increasing will decrease index size but also
                                            performance
 -M, --min-contig-size <MIN_CONTIG_SIZE>    Contigs smaller than this will be filtered out during index generation
 -j, --dbsnp-jobs <JOBS>                    Number of parallel read jobs for dbsnp_index
 -d, --dbsnp-files <FILES>...               List of input files from dbSNP
 -S, --dbsnp-selected <FILE>                File with list of selected SNPs from dbSNP
     --dbsnp-type <DBSNP_TYPE>              Type of dbSNP input files [possible values: AUTO, BED, JSON, VCF]
     --dbsnp-chrom-alias <FILE>             Chromosome alias

Configuration file keys for the index command:

Section

Key

Description

index

reference

Reference FASTA file

index

extra_references

List of additional FASTA files with control sequences

index

reference_basename

Basename for index files

index

index

GEM3 bisulfite index file

index

nonbs_index

GEM3 regular (non-bisulfite) index file

index

index_dir

Directory for index file generation

index

contig_sizes

Contig sizes file

dbsnp

dbsnp_index

dbSNP index file

dbsnp

dbsnp_files

List of dbSNP files (only if dbSNP_index key is defined)

dbsnp

dbsnp_jobs

Number of parallel reads jobs for dbsnp_index

dbsnp

dbsnp_selected

File with SNP IDS to be marked as selected

dbsnp

dbsnp_type

dbSNP file type (AUTO, BED, JSON, VCF)

cores

Number of cores

threads

Number of threads

jobs

Number of parallel jobs

memory

Memory usage

time

Time required

The command creates 3 - 5 output files depending on the options chosen

reference.BS.gem

GEM bisulfite index file.

reference.BS.info

Information about the index process.

reference.contig_sizes

List of contigs and their sizes

reference.gem

GEM non-bisulfite index file

dbSNP_index (optional)

Index of dbSNP SNP names and positions

4.3. dbSNP index file

The dbSNP database can be used to annotate SNP positions from the caller. If this is required then the user should download the SNP files from dbSNP in BED, VCF or JSON format. Earlier versions of dbSNP used the BED format while more recent versions use VCF or JSON. In general the VCF format is more practical as it is much more compact (albeit with much less information, but this is not needed for our purposes), but any of the three formats should work. The dbSNP files should, of course, be mapped to the same version of the genome that is being used for gemBS.

To use these files, gemBS will pool the SNPs from all of the files and form a compact index that will be used during the calling process. As with the main reference index, this process only needs to be performed once for a given build of the database.

The dbSNP index can be configured in the dbsnp section of the configuration file. To get gemBS to use the dbSNP index, it is necessary to specify the dbSNP index file name using the dbSNP_index key in the configuration file. If the file is not present then it will be generated using the gemBS index command. For generation of the index, the input files have to be provided to the indexer. These can be specified using the dbSNP_files key. A list of files can be supplied, and the filenames in the list can contain shell wildcards (*, ? etc) allowing more concise specification of the input files for example:

dbSBP_files = ${dbSNP_dir}/*.bed.gz

Normally gemBS can figure out the filetype (BED, JSON or VCF), but this can also be specified using the key dbSNP_type. With some sets of files the contig names may not correspond with the names in the reference sequence. This occurs frequently with VCF and JSON format files from dbSNP where the contig names are often the accession numbers and e.g. NC_000001.11 rather than chr1. For these cases we can use the dbSNP_chrom_alias key to specify the different possible names for the chromosomes. The chromosome alias file should have one line per contig with the first column being the chromosome name folled by a tab separated list of aliases. For example, a chromosome alias file for Human hg38 can be found at https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromAlias.txt

The dbSNP index command has a special job scheduling option, dbsnp_jobs that indicates the number of parallel reads that should be performed. If the input files are in BED or JSON files there are normally many files, and using this option instruct dbsnp_index to read in multiple files at the same time. Each read job is itself multi-threaded and the optimum settfor for dbsnp_jobs will depend on the computer being used, but a rough estimate would be number of available cores / 4. For VCF format files it is common that only a single file is present containing all SNPs. In this case dbsnp_indx can still perform parallel reads, but only if the VCF is indexed. Normally the index can be found in th esame place as the VCF file (for example the input file dbsnp_152.vcf.gz for human dbSNP release 152 has the index file dbsnp_152.vcf.gz.csi). If the index file is not found in the same directory as the input file then dbsnp_jobs will be ignored and the file will be read in linearly.

Note

The dbSNP index can be used with both the call and extract commands in the gemBS pipeline. If used with the call command (preferred) then by default bs_call will emit a line in the VCF/BCF (with the ID column filled in with the SNP rs id) for every SNP in the database that has at least 1 read covering the position whether or not the position is heterozygous or not, which will clearly increase the output file size. If you are interested only in a subset of SNPs then the dbSNP_selected key can be used to specify a file containing a list of SNPs when the dbSNP index is built. In this case, bs_call will use the entire database to annotate the SNP names for positions that it would normally output (i.e., heterozygote positions and positions with a C or G allele), and in addition will emit a line for each selected SNP if not otherwise selected for output.

If the dbSNP index was not used during the calling step, then it can still be supplied at the SNP extraction step. However, in this case some SNP positions may be missing even if the position was covered (notably positions where the reference is A or T and the observed genotype was homozygous reference).