Login Logged in as anonymous / My BiBiServ / Logout
Navigation
ROCOCO
Welcome
Submission
WebService
Download
Manual
References
Reset Session

Introduction

Rococo reconstructs ancestral gene clusters. Given the topology of a phylogenetic tree and the gene orders of the leaf nodes, it calculates optimal sets of gene clusters for the inner nodes. The optimization criterion combines two properties: parsimony, i.e. the number of gains and losses of gene clusters has to be minimal, and consistency, i.e. for each ancestral node, there must exist at least one gene order that contains all the reconstructed clusters.

The underlying model, the labelling problem and the method are introduced in [STO:WIT:2009] by Stoye and Wittler. A more recent, extensive description can be found in [WIT:2010].

To keep the in- and output files clearly arranged, the structure of the phylogenetic tree and its annotation are separated into two files. One file describes the structure of the tree and the second one annotates the leaves.

INPUT :: phylogenetic tree

The tree file contains the structure of the phylogenetic tree in a form similar to the Newick tree format [FEL:2005]. This format enables a simple, flat ascii-representation of a tree.
  • Leaves are represented by their names.
  • Internal nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas.
  • Internal nodes can have names. These names follow the right parenthesis for that internal node.
  • A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets. It can also be empty. An unnamed node will be named automatically.
  • The tree ends with a semicolon.
  • Blanks, tabs or newlines can be placed nearly everywhere in the tree.
  • Lines starting with a '#' are ignored and can be used to comment the file at any position. Empty lines can be added at any position to clarify the structure of the file.
Example:
# toyexample.tree
# This file contains the phylogenetic tree of some toy example species.
# Use the corresponding genome file toyexample.cog
(
����(Leaf_1,Leaf_2),
����(Leaf_3,Leaf_4)
)Root;

INPUT :: genome family

The input requirements of Rococo are designed to be conform to the output of GhostFam, which is part of the project Gecko. If you use this tool to determine families of homologous genes, you can simply export the results from GhostFam in csd-format and use this file as input for rococo.

The genome file contains a block for each leaf node used in the tree file. Each block starts with a line containing only the name of the leaf. In the subsequent lines, the gene order is specified. For each gene, there is a line containing different informations separated by tabs, comma or semicolon:

  • gene family identifier (number or string)
  • orientation of the gene ('+' or '-')
  • any string (not used by rococo)
  • genome specific identifier of the gene (string)
  • description(function) of the gene (string)
  • name of the gene (string)

The first two items are mandatory, whereas the other items are optional. Each appearance of the gene identifier '0' is interpreted as a unique gene. This can be used for genes not belonging to any gene family, unidentified genes, or genes not belonging to the core genome. Lines starting with a '#' or '*' are ignored and can be used to comment the file at any position. Empty lines and blanks can be added at any position to clarify the structure of the file. A line consisting of '|' indicates the end of a linear chromosome, and a line consisting of ')' indicates the end of a circular chromosome. If a chromosome is not chopped explicitly, it is considered as one linear chromosome.

Example:
# Gene orders of the mitochondrial genome of 30 species without the tRNA.
# The orders are taken from the OGRe Database (http://drake.physics.mcmaster.ca/ogre/).

Epigonichthys_lucayanus
*
1 + epiluc01 Cytochrome c oxidase subunit 1 COX1
3 + epiluc02 Cytochrome c oxidase subunit 3 COX3
6 + epiluc03 NADH dehydrogenase subunit 3 ND3
2 + epiluc04 Cytochrome c oxidase subunit 2 COX2
13 + epiluc05 ATP synthase protein 8 ATP8
12 + epiluc06 ATP synthase F0 subunit 6 ATP6
7 + epiluc07 NADH-ubiquinone oxidoreductase chain 4L ND4L
# end of chromosome 1
|
8 + epiluc08 NADH dehydrogenase subunit 4 ND4
10 + epiluc09 NADH dehydrogenase subunit 6 ND6
9 - epiluc10 NADH dehydrogenase subunit 5 ND5
11 + epiluc11 cytochrome b subunit CYTB
15 + epiluc12 ? RNS
14 + epiluc13 ? RNL
4 + epiluc14 NADH dehydrogenase subunit 1 ND1
5 + epiluc15 NADH dehydrogenase subunit 2 ND2

Halocynthia_roretzi
*
1 + epiluc01 Cytochrome c oxidase subunit I COX1
13 + halror02 ATP8
6 + halror03 ND3
8 + halror04 ND4
10 + halror05 ND6
3 + halror06 Cytochrome c oxidase subunit III COX3
7 + halror07 ND4L
15 + halror08 RNS
2 + halror09 Cytochrome c oxidase subunit II COX2
11 + halror10 CYTB
5 + halror11 ND2
9 + halror12 ND5
14 + halror13 RNL
4 + halror14 ND1
12 + halror15 ATP6

...

OUTPUT :: Zipped Output

The output consists of several files:

Summary file: This file contains some information about the input and the reconstruction results. Reconstructed gene clusters are reported in a densed form: Overlapping gene clusters are joined to CARs (Contiguous Ancestral Regions). Elements enclosed by '(...)' are reconstructed to occur contiguously but in arbitrary order, whereas elements enclosed by '[...]' are reconstructed to occur exactly in the given order. This structure can be nested, e.g. '[1,2,(3,4,5),6]' describes the gene orders: (1,2,3,4,5,6), (1,2,3,5,4,6), (1,2,4,3,5,6), (1,2,4,5,3,6) ,(1,2,5,3,4,6), and (1,2,5,4,3,6) (or the inverse gene orders, resp.).

Labeling files: For each internal node of the given tree, one file containing a detailed description of the reconstructed labeling is created. Each of these files gives a short summary of the CAR sizes and then lists all CARs. For each CAR, its structure (as described above) is given and the annotations of the comprised genes are listed. At the end of each file, the gene clusters which are in contradiction with other reconstructed clusters and have therefore been deleted during the optimization phase are given.

Parameter

Name Description
Runtime Specify maximum runtime in minutes up to one hour (default : 2 minutes)
Minimum cluster size Minimum cluster size (Number of genes in a cluster). Minimum cluster size has no affect for adjacenies models.
Maximum cluster size Maximum cluster size (Number of genes in a cluster, must be larger or equal to minimum cluster size. Maximum cluster size has no affect for adjacenies models.
Model You can choose between different gene cluster models:
  • common intervals on sequences without duplications: A set of genes {g_1, ..., g_n} is contained in a gene order as a common interval, if all genes g_1 to g_n occur contiguously in the gene order. That means: in any order and with any sign but not interrupted by any other gene.
  • framed common intervals on sequences without duplications: A framed common interval [a{b_1,...,b_n}c] is contained in a signed gene order, if all genes b_1 to b_n occur contiguously in the gene order, framed by gene a on the left and gene b on the right, or by gene -b on the left and gene -a on the right.
  • nested common intervals on sequences without duplications: Nested common intervals are defined recursively. A nested common interval {g_1,g_2} is contained in a gene order, if the genes g_1 and g_2 are adjacent in the gene order. A nested common interval {{{{g_1,g_2},...},g_(n-1)},g_n} is contained in a gene order, if the genes g_1 to g_n occur contiguously and the nested common interval {{{g_1,g_2},...},g_(n-1)} is contained in the gene order as well.
  • unsigned adjacencies on sequences: An unsigned adjacency {a,b} is contained in a gene order, if the genes a and b are adjacent in the gene order.
  • signed adjacencies on sequences: A signed adjacency {a,b} is contained in a signed gene order, if gene a is followed by gene -b or b by -a in the gene order.
See [WIT:2010] or [STO:WIT:2009] for further details of the definitions.
approach You can specify if you want to use only one of the sparse approaches or the two phase approach (sparse and dense phase). Confining to the sparse phase yields a shorter runtime but less precise results (less true positives). See [WIT:2010] or [STO:WIT:2009] for further details of the definitions.

Rococo-Compare

Use the 'Open file'-button to open reconstruction results (labeling files) of different internal nodes or from different runs of Rococo. In the 'File-open'-dialog, you can select several files at a time by using 'Shift' or 'Ctrl' to open them simultaneously. For each file, one line in the 'Comparison'-field is added. Files can be closed by pressing the 'X'-button of a specific file or by using the 'Close all files'-button. If the file names are too long, the are shown in shortened form. To see the full file name, hold the mouse pointer on the shortened file name.

Choose a reference file by selecting the radio button at the left of one of the files. Choose a CAR from the reference file using '<<' or '>>'. Details of the CAR are shown in the 'CAR Details' field at the bottom of the window. For all other files, only CARs with overlapping gene content are shown. The common genes are highlighted in red. Choose a CAR from any file using '<<' or '>>'. Use '+/-' to invert the order of a CAR. (Signs will get lost.)

'Filter by...': Here, you can filter CARs by specifying a minimal size or by searching for some string which has to be contained as a whole word in the CAR details. (You can also use regular expressions, e.g. '.*\d[\])][,\])].*' to filter for flexible CARs.) From the reference file, only matching CARs will be shown. CARs of the other files with overlapping gene content, which do not meet the filter criteria are shown in gray, where common genes are highlighted in yellow.