Source Code

A standalone version is available in GitHub

Download

git clone https://github.com/ncbi/dcode-cape

Installation

On Unix systems, type make to build the libkm static library and programs kweight, formatFasta, snp2Fasta and cape. These executables will be located inside the `bin` folder

Library: libkm

This static C++ library include all the classes and functions used by the programs. It can be used by others just keeping the headers.

libkm includes in-house modified source code from R and Rmath-julia for the calculation of the p-Values. Additionally, source code from LibSVM was used for the SVM predictions.

Program: formatFasta

This program reads fasta sequences in file or multiple files in one directory and creates a binary files used by cape. This file conversion reduce the time required for loading the chromosome sequences needed during the calculations. It also includes a reverse option to generate conventional fasta file from a binary file.

The format for the input data is FASTA format

Usage

    -v    Print info
    -h    Display this usage information.
    -r    Read binary and print fasta
    -i    Input file.
    -o    Output file.
    -d    Directory with fasta files. Extension: .fa (all files will be combined in one binary file)

Program: kweight

This program generate Kmer's weight used by cape as one of the SVM Model descriptor. The main input file is a bed file with chromosome positions. It will scan the chromosomes generating controls sequences and choosing a user designated number of them randomly. It require the masked version of the chromosomes fasta files.

The output is a text or binary file with the identified kmers and their weights.

The format for the input file is

    chromosome_name<tab>start_position<tab>end_position

Usage

    -v    Print info
    -h    Display this usage information.
    -b    Coordination of enhancers or DHS peaks.
    -m    Chromosomes masked binary fasta file. Created with formatFasta.
    -p    Output file with the p-value for all kmers
    -o    Order (default: 10).
    -g    Generate control from chromosomes. Default: NO.
    -n    Number of controls per peak (default: 10, if -g set default: 3).
    -c    Bed file with the control coordinates. This option will use your own control for the calculations.

Program: kmerge

This program merge multiple kweight files into a single file.

The output is a text or binary file with the identified kmers and their weights.

The input file is a list with the names of the kweight files to merge

Usage

    -v    Print info
    -h    Display this usage information.
    -i    Text file with a list of kweight file names.
    -o    Output file.

Program: snp2Fasta

This program generates a fasta file from the SNP coordinates using a "length" to add bases before and after the SNP position. This file can be used to run FIMO

The output is fasta file.

The format for the input file is

    chromosome_name<tab>position<tab>snpID<tab>refAle<tab>altAle

Usage

    -v    Print info
    -h    Display this usage information.
    -i    Input file with SNP coordinates.
    -c    Chromosomes binary fasta file. Created with formatFasta.
    -o    Fasta file with the sequences
    -l    Sequence length to be added before and after the SNP position

Program: snp2svmModel

This program generates a SVM model using libSVM trainning code

The output is model file

Usage

    -v    Print info
    -h    Display this usage information.
    -i    Input config file

    Input conf file format (tab delimited), copy the next 9 lines to your config file:

    in	input_file_name.txt		# Input file with SNP coordinates and the (1/-1) values for SNPs
    svm_type	0			Set type of SVM (default 0)
				0 -- C-SVC		(multi-class classification)
				1 -- nu-SVC		(multi-class classification)
				2 -- one-class SVM
				3 -- epsilon-SVR	(regression)
				4 -- nu-SVR		(regression)
    kernel_type	0			Set type of kernel function (default 0)
				0 -- linear: u'*v
				1 -- polynomial: (gamma*u'*v + coef0)^degree
				2 -- radial basis function: exp(-gamma*|u-v|^2)
				3 -- sigmoid: tanh(gamma*u'*v + coef0)
				4 -- precomputed kernel (kernel values in training_set_file)
    w1	4			Weight for the positive set. w1 = fold, which is the ratio between the
                    size of negative set and the size of the positive set.
                    Therefore, w-1 should be always 1.
    probability	1				# Whether to train a SVC or SVR model for
                                  probability estimates, 0 or 1 (default 1)
    model	/path-to/svm.model		# SVM Model obtained from the data
    order	4,6,8,10,12				# Order (default: 4,6,8,10,12)
    chrs	/path-to/hg19.fa.bin	# Chromosomes files in binary mode.
                                      Format: hg19.fa.bin.
                                      Binary files created by formatFasta
    weight	/path-to/kmers_sigValue_sorted	# Kmers weight file. Generated with kweight o kmerge
    neighbors	100		# Pb to be added before and after the SNP position. Default 100
    fimo	fimo_output_file.txt		# Use FIMO output. Set to: 0 for not using FIMO output
    pwm_EnsembleID	pwm_EnsembleID_mapping	# File mapping TF names with Ensembl IDs.
                                              Provided in resources folder
    expression	57epigenomes.RPKM.pc	# Expression file
    expression_code	E116		# Tissue code used to extract expression data from
                                      the expression file.
                                      Extract this code from the tissue ID mapping file EG.name.txt
    abbrev-mtf-mapped	abbrev-mtf-mapped-to-whole-label.all.info.renamed	# TF name cutoff P-Value
                                                                              mapped by our group

Program: cape

This programs calculate the descriptors for the SVM models and predict the probability the probability of genetic variants disrupting major transcription factors binding in a given cellular context.

The programs uses as input a LibSVM model, a file with the kmer's weight (file generated with kweight), a binary file (generated with formatFasta) and a text file with the SNP coordinates. All thses data is encasulated in a config file which is passed to the program using the option -i.

This program generate a ZScore matrix and use it as input of a SVM predict function extracted from LibSVM source code.

The final output is a text file with the SNP and the probability the probability of genetic variants disrupting major transcription factors binding in a given cellular context

The format for the input SNP coordinate file is:

    chromosome_name<tab>position<tab>snpID<tab>refAle<tab>altAle

Important note the snpID in the input format have to be a unique ID.

Usage

    -v    Print info
    -h    Display this usage information.
    -i    Input config file

    Input conf file format (tab delimited), copy the next 9 lines to your config file:

    in	input_file_name.txt			# Input file with SNP coordinates
    out	output_file_name.out		# Output file with SNP coordinates and probabilities
    order	10					    # Order (default: 10)
    chrs	/path-to/hg19.fa.bin	# Chromosomes files in binary mode.
                                    Format: hg19.fa.bin. Binary files created by formatFasta
    weight	/path-to/10mers_sigValue_sorted		# Kmers weight file. Generated with kweight
    neighbors	100				    # Pb to be added before and after the SNP position. Default 100
    model	/path-to/svm.model		# SVM Model
    probability	1				    # 1 if the model use probability estimates
    fimo	fimo_output_file.txt	# Use FIMO output. Set to: 0 for not using FIMO output
    pwm_EnsembleID	pwm_EnsembleID_mapping		# File mapping TF names with Ensembl IDs. Provided in resources folder
    expression	57epigenomes.RPKM.pc	 # Expression file
    expression_code	E116				 # Tissue code used to extract expression data from the expression file.
						                 Extract this code from the tissue ID mapping file EG.name.txt
    abbrev-mtf-mapped	abbrev-mtf-mapped-to-whole-label.all.info.renamed	 # TF name cutoff P-Value mapped by our group

    ********************************************************************************
    For internal use at NCBI:

	    These parameters can be include into the input conf file to use FIMO
	    output through NCBI internal index files
	    Please, note that "fimo" option should be set to "0" or simple
	    delete it from the input file

    TibInfoFileName	tib/hg19/tib.info
    TFBSIdxDirName	common/tib/hg19/5			# The program reads files with name chrN.idx and chrN.tib

More description

Read this paper for a more detailed description of the algorithms, models and strategies used by theses programs