Introduction
Our workflow takes as input gene annotations of a collection of genomes and their corresponding Hi-C maps. It preprocesses this data
and runs the GraphTeam algorithm on it. Afterwards, additional analyses are performed. For any further information regarding the
workflow's single steps, we refer to our paper.
After downloading, the workflow may be directly run using the data which is provided for download as well. The workflow is a Snakemake workflow
which allows for many customized adjustments. For example, it may only be executed partially or distributed on multiple cores.
A detailed documentation on how to handle such a workflow can be found at Snakemake's documentation.
Running the workflow using customized data requires the input data to meet certain file formats and to place it at the expected locations. Both is described below.
Requirements
In order to run the workflow properly the following dependencies need to be satisfied:
Data Placement
For each organism of your dataset
- Create a folder data/genomic/{organism name} and copy an Ensemble Data table of the organism into it.
- Create a folder data/hic_maps/{organism name} and copy all intra-chromosomal Hi-C maps of the organism into it.
Moreover the workflow allows to perform a deeper analysis of the identified graph teams using the Gene Ontology database.
In order to perform this analysis, a version of the database in OBO format has to be placed in data/go together with a
GO annotation file of the desired organism.
Data Formats
Gene Annotations
Gene annotations of each organism should be one file in the format of an Ensemble Data table which contains annotations
of all chromosomes. Each line of such a table contains a chromosome identifier which assigns the line's annotation to a
chromosome. Make sure that this identifier is the same as used in the file names of the Hi-C maps. The file name suffix
must be as specificed in the config.yaml file.
Hi-C Maps
Intra-chromosomal Hi-C maps should contain two header lines and a line identifier at the beginning of each further line
followed by tab-separated Hi-C values.
The data normalization requires all maps to have the same resolution. However, maps with different resolutions may be
used if the highest resolution among all Hi-C maps divides all other used resolutions without remainder.
File names should contain a substring "chrID." ("chrID.chrID." for interchromosomal maps) where "ID" is the identifier
of the chromosome(s) belonging to the map which is/are also used in the corresponding annotation file. A map's resolution
in base pairs has to be stated directly in front of the file name suffix. The suffix itself must be as specificed in
the config.yaml file.
See provided Hi-C maps for an example.
Execution
Before you execute the workflow do not forget to configure the config.yaml file and adjust the choices of the delta threshold.
The workflow can be started via the command line by switching to the package directory and executing
> snakemake
Identified graph teams will be written to the graph_teams sub-directory.