BiBiServ2 - newdist

The newdist software package provides implementations of three similarity measures of different expression power to quantify genomic relationships on the basis of the gene connection model.

All core methods are implemented in Python while some helper scripts are written in BASH; Some of the core methods will output integer linear programs (ILPs) in a format suitable for IBM's solver CPLEX. newdist has the following system requirements:

Python 2.7
NetworkX ≥ 1.10
IBM ILOG CPLEX Optimization Studio
GNU/Unix (optional)
Biopython ≥ 1.6 (optional)

In the gene connection model, genomes are compared based on their bipartite gene connection graph: Given two genomes S and T, a gene connection graph G(S, T) of S and T is a bipartite graph with one vertex for each gene of S and one vertex for each gene of T. An edge between two vertices, one from S and one from T, indicates that there is some connection between the two genes represented by these vertices. The input to the methods in this software package are gene connection graphs that can be constructed from BLAST tables with the provided script pairwise_similarities.py.

The provided methods facilitate the inter-species comparison of gene orders between two genomes by means of counting conserved adjacencies, which are defined as follows: Given an integer θ ≥ 1, a pair of index positions (i, i') with i' ≤ i + θ in a string is a (θ-) adjacency. Further, a pair of adjacencies between two genomes S and T is conserved if

their corresponding genes are connected in their gene connection graph G(S, T) and
their corresponding genes' relative orientation is identical

The different similarity measures provided by newdist are expressed by the following three problem statements:

Problem 1 (total adjacency model). Given two genomes S and T and a gene connection graph G(S,T), count the number of pairs of index positions (i,i') in S and (j,j') in T that form a conserved adjacency. In other words, compute adj(S,T) = |{(i, i',j, j')) | 1 ≤ i < i' ≤ |S|, 1 ≤ j < j' ≤ |T| and (i,i' || j,j')}|.

Problem 2 (gene matching model). Given two genomes S and T, a gene connection graph G(S,T) and a real-valued parameter α ∈ [0, 1], find a bipartite matching M in G(S, T) such that the induced sequences S^M and T^M maximize the measure F_α (M ) = α · adj (S M , T M ) + (1 − α) · edg (M), where edg(M) = |M| is the size of matching M. (The induced sequences S^M and T^Mare the subsequences of S and T, respectively, that contain those characters incident to edges of M.)

Solving Problem 2 is NP-hard even for 1-adjacencies. Therefore we provide a third, intermediate measure, which is more efficient to compute in practice, while still producing one-to-one correspondences between gene extremities. It is defined as the size of the largest subset of non-conflicting conserved adjacencies found in a pair of genomes, where two conserved adjacencies are denoted conflicting if their intervals in either genome are overlapping.

Problem 3 (adjacency matching model). Given two genomes S and T and a gene connection graph G(S,T), let C be the set of conserved adjacencies between S and T. Compute the size |C^*| of a maximum cardinality set of non-conflicting conserved adjacencies C^* ⊆ C.

newdist provides the following core scripts:

enumerate_adjs.py - script to solve Problem 1 and first part of Problem 3 (see paper);
write_ffadj_ilp.py - script to construct an ILP in CPLEX format solving Problem 2;
matching_simple_adjs.py - script to solve the second part of Problem 3 for 1-adjacencies;
write_generalized_adjs_matching_ilp.py - script to construct an ILP in CPLEX format solving Problem 3 for θ-adjacencies with θ > 1;
identify_anchors.py - script to presolve simple subgraphs as preprocessing for ILPs constructed by write_ffadj_ilp.py and write_generalized_adjs_matching_ilp.py;
sol_to_matching.py - a script to extract a matching solution from '*.sol' files produced by CPLEX.