BiBiServ2 - ROCOCO

Introduction

Rococo reconstructs ancestral gene clusters. Given the topology of a phylogenetic tree and the gene orders of the leaf nodes, it calculates optimal sets of gene clusters for the inner nodes. The optimization criterion combines two properties: parsimony, i.e. the number of gains and losses of gene clusters has to be minimal, and consistency, i.e. for each ancestral node, there must exist at least one gene order that contains all the reconstructed clusters.

The underlying model, the labelling problem and the method are introduced in [STO:WIT:2009] by Stoye and Wittler. A more recent, extensive description can be found in [WIT:2010].

To keep the in- and output files clearly arranged, the structure of the phylogenetic tree and its annotation are separated into two files. One file describes the structure of the tree and the second one annotates the leaves.

INPUT :: phylogenetic tree

The tree file contains the structure of the phylogenetic tree in a form similar to the Newick tree format [FEL:2005]. This format enables a simple, flat ascii-representation of a tree.

Leaves are represented by their names.
Internal nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas.
Internal nodes can have names. These names follow the right parenthesis for that internal node.
A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets. It can also be empty. An unnamed node will be named automatically.
The tree ends with a semicolon.
Blanks, tabs or newlines can be placed nearly everywhere in the tree.
Lines starting with a '#' are ignored and can be used to comment the file at any position. Empty lines can be added at any position to clarify the structure of the file.

Example:

# toyexample.tree

                    # This file contains the phylogenetic tree of some toy example species.

                    # Use the corresponding genome file toyexample.cog

                    (

                    ����(Leaf_1,Leaf_2),

                    ����(Leaf_3,Leaf_4)

                    )Root;

INPUT :: genome family

The input requirements of Rococo are designed to be conform to the output of GhostFam, which is part of the project Gecko. If you use this tool to determine families of homologous genes, you can simply export the results from GhostFam in csd-format and use this file as input for rococo.

The genome file contains a block for each leaf node used in the tree file. Each block starts with a line containing only the name of the leaf. In the subsequent lines, the gene order is specified. For each gene, there is a line containing different informations separated by tabs, comma or semicolon:

gene family identifier (number or string)
orientation of the gene ('+' or '-')
any string (not used by rococo)
genome specific identifier of the gene (string)
description(function) of the gene (string)
name of the gene (string)

The first two items are mandatory, whereas the other items are optional. Each appearance of the gene identifier '0' is interpreted as a unique gene. This can be used for genes not belonging to any gene family, unidentified genes, or genes not belonging to the core genome. Lines starting with a '#' or '*' are ignored and can be used to comment the file at any position. Empty lines and blanks can be added at any position to clarify the structure of the file. A line consisting of '|' indicates the end of a linear chromosome, and a line consisting of ')' indicates the end of a circular chromosome. If a chromosome is not chopped explicitly, it is considered as one linear chromosome.

Example:

# Gene orders of the mitochondrial genome of 30 species without the tRNA.

# The orders are taken from the OGRe Database (http://drake.physics.mcmaster.ca/ogre/).



                    Epigonichthys_lucayanus

                    *

                    1       +               epiluc01        Cytochrome c oxidase subunit 1  COX1

                    3       +               epiluc02        Cytochrome c oxidase subunit 3  COX3

                    6       +               epiluc03        NADH dehydrogenase subunit 3    ND3

                    2       +               epiluc04        Cytochrome c oxidase subunit 2  COX2

                    13      +               epiluc05        ATP synthase protein 8  ATP8

                    12      +               epiluc06        ATP synthase F0 subunit 6       ATP6

                    7       +               epiluc07        NADH-ubiquinone oxidoreductase chain 4L ND4L

                    # end of chromosome 1

                    |

                    8       +               epiluc08        NADH dehydrogenase subunit 4    ND4

                    10      +               epiluc09        NADH dehydrogenase subunit 6    ND6

                    9       -               epiluc10        NADH dehydrogenase subunit 5    ND5

                    11      +               epiluc11        cytochrome b subunit    CYTB

                    15      +               epiluc12        ?       RNS

                    14      +               epiluc13        ?       RNL

                    4       +               epiluc14        NADH dehydrogenase subunit 1    ND1

                    5       +               epiluc15        NADH dehydrogenase subunit 2    ND2

                    

                    Halocynthia_roretzi

                    *

                    1       +               epiluc01        Cytochrome c oxidase subunit I  COX1

                    13      +               halror02                ATP8

                    6       +               halror03                ND3

                    8       +               halror04                ND4

                    10      +               halror05                ND6

                    3       +               halror06        Cytochrome c oxidase subunit III        COX3

                    7       +               halror07                ND4L

                    15      +               halror08                RNS

                    2       +               halror09        Cytochrome c oxidase subunit II COX2

                    11      +               halror10                CYTB

                    5       +               halror11                ND2

                    9       +               halror12                ND5

                    14      +               halror13                RNL

                    4       +               halror14                ND1

                    12      +               halror15                ATP6

                    

...

OUTPUT :: Zipped Output

The output consists of several files:

Summary file: This file contains some information about the input and the reconstruction results. Reconstructed gene clusters are reported in a densed form: Overlapping gene clusters are joined to CARs (Contiguous Ancestral Regions). Elements enclosed by '(...)' are reconstructed to occur contiguously but in arbitrary order, whereas elements enclosed by '[...]' are reconstructed to occur exactly in the given order. This structure can be nested, e.g. '[1,2,(3,4,5),6]' describes the gene orders: (1,2,3,4,5,6), (1,2,3,5,4,6), (1,2,4,3,5,6), (1,2,4,5,3,6) ,(1,2,5,3,4,6), and (1,2,5,4,3,6) (or the inverse gene orders, resp.).

Labeling files: For each internal node of the given tree, one file containing a detailed description of the reconstructed labeling is created. Each of these files gives a short summary of the CAR sizes and then lists all CARs. For each CAR, its structure (as described above) is given and the annotations of the comprised genes are listed. At the end of each file, the gene clusters which are in contradiction with other reconstructed clusters and have therefore been deleted during the optimization phase are given.

Parameter

Rococo-Compare

Use the 'Open file'-button to open reconstruction results (labeling files) of different internal nodes or from different runs of Rococo. In the 'File-open'-dialog, you can select several files at a time by using 'Shift' or 'Ctrl' to open them simultaneously. For each file, one line in the 'Comparison'-field is added. Files can be closed by pressing the 'X'-button of a specific file or by using the 'Close all files'-button. If the file names are too long, the are shown in shortened form. To see the full file name, hold the mouse pointer on the shortened file name.

Choose a reference file by selecting the radio button at the left of one of the files. Choose a CAR from the reference file using '<<' or '>>'. Details of the CAR are shown in the 'CAR Details' field at the bottom of the window. For all other files, only CARs with overlapping gene content are shown. The common genes are highlighted in red. Choose a CAR from any file using '<<' or '>>'. Use '+/-' to invert the order of a CAR. (Signs will get lost.)

'Filter by...': Here, you can filter CARs by specifying a minimal size or by searching for some string which has to be contained as a whole word in the CAR details. (You can also use regular expressions, e.g. '.*\d[\])][,\])].*' to filter for flexible CARs.) From the reference file, only matching CARs will be shown. CARs of the other files with overlapping gene content, which do not meet the filter criteria are shown in gray, where common genes are highlighted in yellow.