Getting started
Integrative Genomic Analysis Pipeline is a pipeline which can be used to generate nTBA values for any given sequences and also combine any other data with it for further analysis.
IGAP is written in Python 3. It includes BayesPI2 software in binary form, which is available for Linux and OS X operating systems.
Installation:
You can install the package for python 3 using this command:
python setup.py install
please run the following commands to install prerequire package
pip install numpy scipy scikit-learn
After installing the package, you should also install the bedtools and add it to the path in order to use the package. Please refer to this link for more info.
https://bedtools.readthedocs.io/en/latest/content/installation.html
The main package also supports command line commands. You can run the command line version of the package using igap command inside of the terminal.
git repository content:
The repo has three subfolders:
Igap: Integrative genomic analysis pipeline source code and binaries.
Sample: This folder contains some sample files which shows how you can use the library.
Tools: This folder contains some tools to preprocess input data and prepare them for the package.
Sample folder:
Sample folder is the best place to see how you can use the package. Let’s take a look at it’s contents:
Additional: this folder contains all of the files you require to run the demo sample.
Configs: This folder contains different config files which the application needs them in order to generate the output. You can see the content and learn how you should set the variables.
Experiment_data: In our project, we combine genomic modification data with the ntba in order to generate our output features. You can see the structure of each file and be familiar with the standard input of the package for any of your desired dataset.
Fa_file: In this folder, you can find the preprocessed fa files from the hg19 data set. These files are required in order to calculate nTBA values.
Interaction: in this folder, you can find a text file for each cell type for chromosome 21. These files generated from HiC data.
Pwm: in this folder, you can find the position weight matrixes of each motif for computing the nTBA value.
Region: This folder contains files for each specific region, for example, HOT, TSS and etc.
1-ntba.py: This file read the fa files and generate nTBA value.
2-ntba_plot.py: Read nTBA generated files from the previous step and filter them by HOT, TSS, GENE and enhancer region. After that try to interpolate the missing points of each region and save the data.
3-gm_plot: The same thing like the previous step for genomic modification data in order to make them ready to combine with the nTBA values.
4-freq.py: Read all previous output and combine them with HiC data to calculate the frequency and generate the heat map plot.
5- clustering.py: Read the output of the previous step and generate feature vector per each bin and classify it into different clusters.
Usage:
Run the sample project:
You can simply import igap package inside of your python code and use it’s functions. For more information about how you can use the functions you can take a look at the sample folder.
Inorder to run the sample project, you should execute the python files inside of sample folder in order to generate the feature vectors and classify each region of the chromosome.
Cmd_sample.bash:
This file is a sample of how you can generate the outputs using the command line tools. Please type igap -h for more info about the command line tools.
Tools folder:
This folder contains files which we used to convert datasets to our standard input format. These codes written in Python2. You can use it as follows.
Divide the hg19 dataset to our standard fa files:
python divide_chr_make_windows.py --In_path in/fastas --Chr 22 --Out_path out/chromosome --Header NAME
Generate region files:
For enhancer:
python enhancer_HOT_regions.py --In_path in/all_enhancers/sorted_all_tissues.bed --Out_path out --Chr ALL --File_type ENH --Merge_regions Y --Min_size 100
For HOT:
python enhancer_HOT_regions.py --In_path in/HOT_input.bed --Out_path out --Chr ALL --File_type HOT --Min_size 100
For Gene and TSS region:
python genes_regions.py --In_path in/refFlat.txt --Gene_info in/human_gene_info_nov2017.csv --Out_path out
API
Manual
Calculating
ntba value
To calculate the ntba value for a sequence file,
you need to call the calculate_ntba function from BayesPi_nTBA class.
from igap.bayespi_ntba import BayesPi_nTBA
BayesPi_nTBA.calculate_ntba(chemical_potentials, reference_sequences, iterations, pwm_folder, output_folder,reuse_output, parallel_dict, use_interct=False, random_seed=False, only_collect=False, remove_temp=True, only_merge=False, input_folder="", window_size=50)
Inputs:
chemical potentials: a list of string which contains
chemical potentials which you want to calculate the ntba value for them, for
example ['none', '-10']
refrence_sequences: a. fa file in
FASTA format which contains sequence
information, as an example you can refer to additional /fa_file/group1.fa in
the sample folder.
iterations: number of iterations
pwm_folder: a folder
which contains position weight matrix files, as an example you can refer to
additional/pwm in sample folder.
output_folder: a path in string
which is used by the program to store all the outputs.
reuse_output: if this option
is set to True, program will try to load all possible previously calculated
output and continue from the last step in the previous run.
parallel_dict: Path of a
parallel file in str. You can configure parallel configuration of the
application using this file. You can define number of threads or set the
required variables in order to run the program on a cluster. As an example, you
can refer to 'additional/configs/parallel_options.txt' in sample
folder.
use_interct: predicted
di-nucleotide dependent weight matrice by bayesPI (*.interct). If specified,
the di-nucleotide model will be used. It should be stored in a same path as a
pwm file but the extension should be .interct instead of .mlp.
random_seed: Should the
program use a fixed seed each time or a random seed for calculating the values.
only_collect: If it’s true
it will skip the ntba calculation and start the second step to merge the
already calculated data which should be stored in input_folder.
remove_temp: If it’s true,
it will automatically remove all the temp files after each step to reduce
needed storage.
only_merge: If It’s true
it will skip the first two step and will try to just convert all dba files to
bed files and merge them together.
input_folder: If only_merge or only_collect are true, you need to
assign this variable to the folder which you stored all the calculation. (output_folder
of previous execution)
window_size:
length of the window for exporting ntba value (default
50).
output:
A file containing ntba value of the for each chemical potential.
Summarize the ntba value for your interest regions
You can define your
interested regions in a bed file and merge them with the ntba values and
calculate the value over the center of those regions using preprocess_ntba_data function. Program will
use regression to infer the value of the missing points for each region.
from igap.preprocess_ntba_data import DataPreprocessor
DataPreprocessor.preprocess_ntba_data(config_file, output_folder, min_overlap=1E-9)
Inputs:
config_file:
path of a json file which
you need to define all parameters to configure the execution. As an example, you
can refer to additional/configs/preprocess_ntba_mix.json in sample folder.
There are several
parameters which you need to specify which you can read more about them in following.
region_files: a list which
you can put several bed files in it to specify your interest region. For each
of them you need to assign a name and path of the bed file in
bed_file_address. The program will calculate each one of them starting from the
first one. Your bed file needs to have a standard template which you can refer
to additional/region folder in sample.
ntba_bed_files_address: A list of string
which you need to put the address of ntba files which you calculated before per
each chemical potential. If you want to calculate it for multiple chromosomes
you need to use wild card symbol to address all of the ntba files per each
chemical in one line.
chromosome: chromosome
number in string, if you want to calculate it for multiple chromosomes use ‘mix’
instead.
output_folder: a path in string
which is used by the program to store all the outputs.
min_overlap: Acceptable overlap
between interest regions and ntba values.
output:
Three different files per each chemical, one of them contains all the
points per each region, and the other two contains mean and sum of the values
of all the regions. You can use those files to draw a plot of ntba value for
your region.
Preprocess Histon
data:
You can preprocess the Histon
data in a same way as the previous step, for doing so you need to call process_histon_data function from HistDataPreprocess.
from igap.preprocess_histon_data import HistDataPreprocess
HistDataPreprocess.process_histon_data('additional/configs/preprocess_gm_data.json', 'output/3')
input:
config_file: like the
previous function, but this time you need to define two set of parameters for HOT
and TSS regions. Please refer to additional/configs/preprocess_gm_data.json as
an example.
output_folder: a path in string
which is used by the program to store all the outputs.
output:
Three different files per each cell types, one of them contains all the
points per each region, and the other two contains mean and sum of the values
of all the regions. In each one of them, you can find the value of each modifications
in columns.
Frequency:
To calculate the frequency
and draw the heat map plot you can use calculate_frequency function from Frequency class.
from igap.frequency import Frequency
Frequency.calculate_frequency('additional/configs/freq.json', 'output/4')
inputs:
config_file_address: path of a json
file which you need to define all parameters to configure the execution. As an example,
you can refer to additional/configs/freq.json in sample folder.
There are several
parameters which you need to specify which you can read more about them in following.
config: You need to create an entry per each of your
region in this list, each entry needs to have a name, bed_file path of your
interest regions, chr_file (all_data output file of previous function) and
cells which you need to define an entry per each of your desire cells.
win_length: Resolution of
your HIC data
potentials: a list of all
chemical potentials in str
chromosome: You need to
define the chromosome which you want to do the calculation for it.
histon_modification: name of the
modifications which you want to process.
cell_type_config: path to the
HIC data for each of your cell type.
output_folder: a path in string
which is used by the program to store all the outputs.
output:
Frequency of interaction among
different regions, per each chemical and cell types.
Clustering:
For clustering different
regions of chromosome based on ntba values, you need to call process_features function
from igap.features_manager and then based on the calculated feature you can
cluster regions using your desire clustering algorithm. For an example of how
you can use this data to calculate the cluster members you can refer to sample
five.
from igap.features_manager import FeaturesManager
data, normalized = FeaturesManager.process_features(files, 'output/5', cell)
inputs:
files_list: csv files
which been calculated using frequency function
output_folder: a path in string
which is used by the program to store all the outputs.
tag_name: Custom name
for distinguishing output files.
output:
It will process provided
data and provide you the normalized and raw value which can be used to cluster
the regions.