BiBiServ2 - GraphTeams

Introduction

Our workflow takes as input gene annotations of a collection of genomes and their corresponding Hi-C maps. It preprocesses this data and runs the GraphTeam algorithm on it. Afterwards, additional analyses are performed. For any further information regarding the workflow's single steps, we refer to our paper.

After downloading, the workflow may be directly run using the data which is provided for download as well. The workflow is a Snakemake workflow which allows for many customized adjustments. For example, it may only be executed partially or distributed on multiple cores. A detailed documentation on how to handle such a workflow can be found at Snakemake's documentation.

Running the workflow using customized data requires the input data to meet certain file formats and to place it at the expected locations. Both is described below.

Requirements

In order to run the workflow properly the following dependencies need to be satisfied:

Data Placement

For each organism of your dataset

Create a folder data/genomic/{organism name} and copy an Ensemble Data table of the organism into it.
Create a folder data/hic_maps/{organism name} and copy all intra-chromosomal Hi-C maps of the organism into it.

Moreover the workflow allows to perform a deeper analysis of the identified graph teams using the Gene Ontology database. In order to perform this analysis, a version of the database in OBO format has to be placed in data/go together with a GO annotation file of the desired organism.

Data Formats

Gene Annotations

Gene annotations of each organism should be one file in the format of an Ensemble Data table which contains annotations of all chromosomes. Each line of such a table contains a chromosome identifier which assigns the line's annotation to a chromosome. Make sure that this identifier is the same as used in the file names of the Hi-C maps. The file name suffix must be as specificed in the config.yaml file.

Hi-C Maps

Intra-chromosomal Hi-C maps should contain two header lines and a line identifier at the beginning of each further line followed by tab-separated Hi-C values. The data normalization requires all maps to have the same resolution. However, maps with different resolutions may be used if the highest resolution among all Hi-C maps divides all other used resolutions without remainder. File names should contain a substring "chrID." ("chrID.chrID." for interchromosomal maps) where "ID" is the identifier of the chromosome(s) belonging to the map which is/are also used in the corresponding annotation file. A map's resolution in base pairs has to be stated directly in front of the file name suffix. The suffix itself must be as specificed in the config.yaml file. See provided Hi-C maps for an example.

Execution

Before you execute the workflow do not forget to configure the config.yaml file and adjust the choices of the delta threshold. The workflow can be started via the command line by switching to the package directory and executing > snakemake

Identified graph teams will be written to the graph_teams sub-directory.