A standalone version is available in GitHub
git clone https://github.com/ncbi/dcode-cape
On Unix systems, type make to build the libkm static library and programs kweight, formatFasta, snp2Fasta and cape. These executables will be located inside the `bin` folder
This static C++ library include all the classes and functions used by the programs. It can be used by others just keeping the headers.
libkm includes in-house modified source code from R and Rmath-julia for the calculation of the p-Values. Additionally, source code from LibSVM was used for the SVM predictions.
This program reads fasta sequences in file or multiple files in one directory and creates a binary files used by cape. This file conversion reduce the time required for loading the chromosome sequences needed during the calculations. It also includes a reverse option to generate conventional fasta file from a binary file.
The format for the input data is FASTA format
-v Print info -h Display this usage information. -r Read binary and print fasta -i Input file. -o Output file. -d Directory with fasta files. Extension: .fa (all files will be combined in one binary file)
This program generate Kmer's weight used by cape as one of the SVM Model descriptor. The main input file is a bed file with chromosome positions. It will scan the chromosomes generating controls sequences and choosing a user designated number of them randomly. It require the masked version of the chromosomes fasta files.
The output is a text or binary file with the identified kmers and their weights.
The format for the input file is
chromosome_name<tab>start_position<tab>end_position
-v Print info -h Display this usage information. -b Coordination of enhancers or DHS peaks. -m Chromosomes masked binary fasta file. Created with formatFasta. -p Output file with the p-value for all kmers -o Order (default: 10). -g Generate control from chromosomes. Default: NO. -n Number of controls per peak (default: 10, if -g set default: 3). -c Bed file with the control coordinates. This option will use your own control for the calculations.
This program merge multiple kweight files into a single file.
The output is a text or binary file with the identified kmers and their weights.
The input file is a list with the names of the kweight files to merge
-v Print info -h Display this usage information. -i Text file with a list of kweight file names. -o Output file.
This program generates a fasta file from the SNP coordinates using a "length" to add bases before and after the SNP position. This file can be used to run FIMO
The output is fasta file.
The format for the input file is
chromosome_name<tab>position<tab>snpID<tab>refAle<tab>altAle
-v Print info -h Display this usage information. -i Input file with SNP coordinates. -c Chromosomes binary fasta file. Created with formatFasta. -o Fasta file with the sequences -l Sequence length to be added before and after the SNP position
This program generates a SVM model using libSVM trainning code
The output is model file
-v Print info -h Display this usage information. -i Input config file Input conf file format (tab delimited), copy the next 9 lines to your config file: in input_file_name.txt # Input file with SNP coordinates and the (1/-1) values for SNPs svm_type 0 Set type of SVM (default 0) 0 -- C-SVC (multi-class classification) 1 -- nu-SVC (multi-class classification) 2 -- one-class SVM 3 -- epsilon-SVR (regression) 4 -- nu-SVR (regression) kernel_type 0 Set type of kernel function (default 0) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u'*v + coef0) 4 -- precomputed kernel (kernel values in training_set_file) w1 4 Weight for the positive set. w1 = fold, which is the ratio between the size of negative set and the size of the positive set. Therefore, w-1 should be always 1. probability 1 # Whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 1) model /path-to/svm.model # SVM Model obtained from the data order 4,6,8,10,12 # Order (default: 4,6,8,10,12) chrs /path-to/hg19.fa.bin # Chromosomes files in binary mode. Format: hg19.fa.bin. Binary files created by formatFasta weight /path-to/kmers_sigValue_sorted # Kmers weight file. Generated with kweight o kmerge neighbors 100 # Pb to be added before and after the SNP position. Default 100 fimo fimo_output_file.txt # Use FIMO output. Set to: 0 for not using FIMO output pwm_EnsembleID pwm_EnsembleID_mapping # File mapping TF names with Ensembl IDs. Provided in resources folder expression 57epigenomes.RPKM.pc # Expression file expression_code E116 # Tissue code used to extract expression data from the expression file. Extract this code from the tissue ID mapping file EG.name.txt abbrev-mtf-mapped abbrev-mtf-mapped-to-whole-label.all.info.renamed # TF name cutoff P-Value mapped by our group
This programs calculate the descriptors for the SVM models and predict the probability the probability of genetic variants disrupting major transcription factors binding in a given cellular context.
The programs uses as input a LibSVM model, a file with the kmer's weight (file generated with kweight), a binary file (generated with formatFasta) and a text file with the SNP coordinates. All thses data is encasulated in a config file which is passed to the program using the option -i.
This program generate a ZScore matrix and use it as input of a SVM predict function extracted from LibSVM source code.
The final output is a text file with the SNP and the probability the probability of genetic variants disrupting major transcription factors binding in a given cellular context
The format for the input SNP coordinate file is:
chromosome_name<tab>position<tab>snpID<tab>refAle<tab>altAle
Important note the snpID in the input format have to be a unique ID.
-v Print info -h Display this usage information. -i Input config file Input conf file format (tab delimited), copy the next 9 lines to your config file: in input_file_name.txt # Input file with SNP coordinates out output_file_name.out # Output file with SNP coordinates and probabilities order 10 # Order (default: 10) chrs /path-to/hg19.fa.bin # Chromosomes files in binary mode. Format: hg19.fa.bin. Binary files created by formatFasta weight /path-to/10mers_sigValue_sorted # Kmers weight file. Generated with kweight neighbors 100 # Pb to be added before and after the SNP position. Default 100 model /path-to/svm.model # SVM Model probability 1 # 1 if the model use probability estimates fimo fimo_output_file.txt # Use FIMO output. Set to: 0 for not using FIMO output pwm_EnsembleID pwm_EnsembleID_mapping # File mapping TF names with Ensembl IDs. Provided in resources folder expression 57epigenomes.RPKM.pc # Expression file expression_code E116 # Tissue code used to extract expression data from the expression file. Extract this code from the tissue ID mapping file EG.name.txt abbrev-mtf-mapped abbrev-mtf-mapped-to-whole-label.all.info.renamed # TF name cutoff P-Value mapped by our group ******************************************************************************** For internal use at NCBI: These parameters can be include into the input conf file to use FIMO output through NCBI internal index files Please, note that "fimo" option should be set to "0" or simple delete it from the input file TibInfoFileName tib/hg19/tib.info TFBSIdxDirName common/tib/hg19/5 # The program reads files with name chrN.idx and chrN.tib
Read this paper for a more detailed description of the algorithms, models and strategies used by theses programs