Getting started

Integrative Genomic Analysis Pipeline is a pipeline which can be used to generate nTBA values for any given sequences and also combine any other data with it for further analysis.

IGAP is written in Python 3. It includes BayesPI2 software in binary form, which is available for Linux and OS X operating systems.

Installation:

You can install the package for python 3 using this command:

python setup.py install

please run the following commands to install prerequire package

pip install numpy scipy scikit-learn

After installing the package, you should also install the bedtools and add it to the path in order to use the package. Please refer to this link for more info.

https://bedtools.readthedocs.io/en/latest/content/installation.html

The main package also supports command line commands. You can run the command line version of the package using igap command inside of the terminal.

git repository content:

The repo has three subfolders:

Igap: Integrative genomic analysis pipeline source code and binaries.

Sample: This folder contains some sample files which shows how you can use the library.

Tools: This folder contains some tools to preprocess input data and prepare them for the package.

Sample folder:

Sample folder is the best place to see how you can use the package. Let’s take a look at it’s contents:

Additional: this folder contains all of the files you require to run the demo sample.

Configs: This folder contains different config files which the application needs them in order to generate the output. You can see the content and learn how you should set the variables.

Experiment_data: In our project, we combine genomic modification data with the ntba in order to generate our output features. You can see the structure of each file and be familiar with the standard input of the package for any of your desired dataset.

Fa_file: In this folder, you can find the preprocessed fa files from the hg19 data set. These files are required in order to calculate nTBA values.

Interaction: in this folder, you can find a text file for each cell type for chromosome 21. These files generated from HiC data.

Pwm: in this folder, you can find the position weight matrixes of each motif for computing the nTBA value.

Region: This folder contains files for each specific region, for example, HOT, TSS and etc.

1-ntba.py: This file read the fa files and generate nTBA value.

2-ntba_plot.py: Read nTBA generated files from the previous step and filter them by HOT, TSS, GENE and enhancer region. After that try to interpolate the missing points of each region and save the data.

3-gm_plot: The same thing like the previous step for genomic modification data in order to make them ready to combine with the nTBA values.

4-freq.py: Read all previous output and combine them with HiC data to calculate the frequency and generate the heat map plot.

5- clustering.py: Read the output of the previous step and generate feature vector per each bin and classify it into different clusters.

Usage:

Run the sample project:

You can simply import igap package inside of your python code and use it’s functions. For more information about how you can use the functions you can take a look at the sample folder.

Inorder to run the sample project, you should execute the python files inside of sample folder in order to generate the feature vectors and classify each region of the chromosome.

Cmd_sample.bash:

This file is a sample of how you can generate the outputs using the command line tools. Please type igap -h for more info about the command line tools.

Tools folder:

This folder contains files which we used to convert datasets to our standard input format. These codes written in Python2. You can use it as follows.

Divide the hg19 dataset to our standard fa files:

python divide_chr_make_windows.py --In_path in/fastas --Chr 22 --Out_path out/chromosome --Header NAME

Generate region files:

For enhancer:

python enhancer_HOT_regions.py --In_path in/all_enhancers/sorted_all_tissues.bed --Out_path out --Chr ALL --File_type ENH --Merge_regions Y --Min_size 100

For HOT:

python enhancer_HOT_regions.py --In_path in/HOT_input.bed --Out_path out --Chr ALL --File_type HOT --Min_size 100

For Gene and TSS region:

python genes_regions.py --In_path in/refFlat.txt --Gene_info in/human_gene_info_nov2017.csv --Out_path out

API Manual

Calculating ntba value

To calculate the ntba value for a sequence file, you need to call the calculate_ntba function from BayesPi_nTBA class.

from igap.bayespi_ntba import BayesPi_nTBA

BayesPi_nTBA.calculate_ntba(chemical_potentials, reference_sequences, iterations, pwm_folder, output_folder,reuse_output, parallel_dict, use_interct=False, random_seed=False, only_collect=False, remove_temp=True,             only_merge=False, input_folder="", window_size=50)

Inputs:

chemical potentials: a list of string which contains chemical potentials which you want to calculate the ntba value for them, for example ['none', '-10']

refrence_sequences: a. fa file in FASTA format which contains sequence information, as an example you can refer to additional /fa_file/group1.fa in the sample folder.

iterations: number of iterations

pwm_folder: a folder which contains position weight matrix files, as an example you can refer to additional/pwm in sample folder.

output_folder: a path in string which is used by the program to store all the outputs.

reuse_output: if this option is set to True, program will try to load all possible previously calculated output and continue from the last step in the previous run.

parallel_dict: Path of a parallel file in str. You can configure parallel configuration of the application using this file. You can define number of threads or set the required variables in order to run the program on a cluster. As an example, you can refer to 'additional/configs/parallel_options.txt' in sample folder.

use_interct: predicted di-nucleotide dependent weight matrice by bayesPI (*.interct). If specified, the di-nucleotide model will be used. It should be stored in a same path as a pwm file but the extension should be .interct instead of .mlp.

random_seed: Should the program use a fixed seed each time or a random seed for calculating the values.

only_collect: If it’s true it will skip the ntba calculation and start the second step to merge the already calculated data which should be stored in input_folder.

remove_temp: If it’s true, it will automatically remove all the temp files after each step to reduce needed storage.

only_merge: If It’s true it will skip the first two step and will try to just convert all dba files to bed files and merge them together.

input_folder: If only_merge or only_collect are true, you need to assign this variable to the folder which you stored all the calculation. (output_folder of previous execution)

window_size: length of the window for exporting ntba value (default 50).

output:

A file containing ntba value of the for each chemical potential.

Summarize the ntba value for your interest regions

You can define your interested regions in a bed file and merge them with the ntba values and calculate the value over the center of those regions using preprocess_ntba_data function. Program will use regression to infer the value of the missing points for each region.

from igap.preprocess_ntba_data import DataPreprocessor

DataPreprocessor.preprocess_ntba_data(config_file, output_folder, min_overlap=1E-9)

Inputs:

config_file:

path of a json file which you need to define all parameters to configure the execution. As an example, you can refer to additional/configs/preprocess_ntba_mix.json in sample folder.

There are several parameters which you need to specify which you can read more about them in following.

region_files: a list which you can put several bed files in it to specify your interest region. For each of them you need to assign a name and path of the bed file in bed_file_address. The program will calculate each one of them starting from the first one. Your bed file needs to have a standard template which you can refer to additional/region folder in sample.

ntba_bed_files_address: A list of string which you need to put the address of ntba files which you calculated before per each chemical potential. If you want to calculate it for multiple chromosomes you need to use wild card symbol to address all of the ntba files per each chemical in one line.

chromosome: chromosome number in string, if you want to calculate it for multiple chromosomes use ‘mix’ instead.

output_folder: a path in string which is used by the program to store all the outputs.

min_overlap: Acceptable overlap between interest regions and ntba values.

output:

Three different files per each chemical, one of them contains all the points per each region, and the other two contains mean and sum of the values of all the regions. You can use those files to draw a plot of ntba value for your region.

Preprocess Histon data:

You can preprocess the Histon data in a same way as the previous step, for doing so you need to call process_histon_data function from HistDataPreprocess.

from igap.preprocess_histon_data import HistDataPreprocess

HistDataPreprocess.process_histon_data('additional/configs/preprocess_gm_data.json', 'output/3')

input:

config_file: like the previous function, but this time you need to define two set of parameters for HOT and TSS regions. Please refer to additional/configs/preprocess_gm_data.json as an example.

output_folder: a path in string which is used by the program to store all the outputs.

output:

Three different files per each cell types, one of them contains all the points per each region, and the other two contains mean and sum of the values of all the regions. In each one of them, you can find the value of each modifications in columns.

Frequency:

To calculate the frequency and draw the heat map plot you can use calculate_frequency function from Frequency class.

from igap.frequency import Frequency

Frequency.calculate_frequency('additional/configs/freq.json', 'output/4')

inputs:

config_file_address: path of a json file which you need to define all parameters to configure the execution. As an example, you can refer to additional/configs/freq.json in sample folder.

There are several parameters which you need to specify which you can read more about them in following.

config: You need to create an entry per each of your region in this list, each entry needs to have a name, bed_file path of your interest regions, chr_file (all_data output file of previous function) and cells which you need to define an entry per each of your desire cells.

win_length: Resolution of your HIC data

potentials: a list of all chemical potentials in str

chromosome: You need to define the chromosome which you want to do the calculation for it.

histon_modification: name of the modifications which you want to process.

cell_type_config: path to the HIC data for each of your cell type.

output_folder: a path in string which is used by the program to store all the outputs.

output:

Frequency of interaction among different regions, per each chemical and cell types.

Clustering:

For clustering different regions of chromosome based on ntba values, you need to call process_features function from igap.features_manager and then based on the calculated feature you can cluster regions using your desire clustering algorithm. For an example of how you can use this data to calculate the cluster members you can refer to sample five.

from igap.features_manager import FeaturesManager

data, normalized = FeaturesManager.process_features(files, 'output/5', cell)

inputs:

files_list: csv files which been calculated using frequency function

output_folder: a path in string which is used by the program to store all the outputs.

tag_name: Custom name for distinguishing output files.

output:

It will process provided data and provide you the normalized and raw value which can be used to cluster the regions.