Introduction
Rococo reconstructs ancestral gene clusters. Given the topology of a phylogenetic tree and
the gene orders of the leaf nodes, it calculates optimal sets of gene clusters for the inner
nodes. The optimization criterion combines two properties: parsimony, i.e. the number of
gains and losses of gene clusters has to be minimal, and consistency, i.e. for each ancestral
node, there must exist at least one gene order that contains all the reconstructed clusters.
The underlying model, the labelling problem and the method are introduced in
[STO:WIT:2009] by Stoye and
Wittler. A more recent, extensive description can be found in
[WIT:2010].
To keep the in- and output files clearly arranged, the structure of the phylogenetic
tree and its annotation are separated into two files. One file describes the structure
of the tree and the second one annotates the leaves.
INPUT :: phylogenetic treeThe tree file contains the structure of the phylogenetic
tree in a form similar to the Newick tree format [FEL:2005].
This format enables a simple, flat ascii-representation of a tree.
- Leaves are represented by their names.
- Internal nodes are represented by a pair of matched parentheses. Between them are representations
of the nodes that are immediately descended from that node, separated by commas.
- Internal nodes can have names. These names follow the right parenthesis for that internal node.
- A name can be any string of printable characters except blanks, colons, semicolons, parentheses,
and square brackets. It can also be empty. An unnamed node will be named automatically.
- The tree ends with a semicolon.
- Blanks, tabs or newlines can be placed nearly everywhere in the tree.
- Lines starting with a '#' are ignored and can be used to comment the file at any position.
Empty lines can be added at any position to clarify the structure of the file.
Example:
# toyexample.tree
# This file contains the phylogenetic tree of some toy example species.
# Use the corresponding genome file toyexample.cog
(
����(Leaf_1,Leaf_2),
����(Leaf_3,Leaf_4)
)Root;
INPUT :: genome familyThe input requirements of Rococo are designed to be conform to the output of GhostFam, which is part of the
project Gecko. If you use this tool to
determine families of homologous genes, you can simply export the results from GhostFam in csd-format and use
this file as input for rococo.
The genome file contains a block for each leaf node used in the tree file. Each block starts with a line
containing only the name of the leaf. In the subsequent lines, the gene order is specified. For each gene,
there is a line containing different informations separated by tabs, comma or semicolon:
- gene family identifier (number or string)
- orientation of the gene ('+' or '-')
- any string (not used by rococo)
- genome specific identifier of the gene (string)
- description(function) of the gene (string)
- name of the gene (string)
The first two items are mandatory, whereas the other items are optional.
Each appearance of the gene identifier '0' is interpreted as a unique gene. This can be used for genes not
belonging to any gene family, unidentified genes, or genes not belonging to the core genome.
Lines starting with a '#' or '*' are ignored and can be used to comment the file at any position. Empty
lines and blanks can be added at any position to clarify the structure of the file. A line consisting of
'|' indicates the end of a linear chromosome, and a line consisting of ')' indicates the end of a circular
chromosome. If a chromosome is not chopped explicitly, it is considered as one linear chromosome.
Example:
# Gene orders of the mitochondrial genome of 30 species without the tRNA.
# The orders are taken from the OGRe Database (http://drake.physics.mcmaster.ca/ogre/).
Epigonichthys_lucayanus
*
1 + epiluc01 Cytochrome c oxidase subunit 1 COX1
3 + epiluc02 Cytochrome c oxidase subunit 3 COX3
6 + epiluc03 NADH dehydrogenase subunit 3 ND3
2 + epiluc04 Cytochrome c oxidase subunit 2 COX2
13 + epiluc05 ATP synthase protein 8 ATP8
12 + epiluc06 ATP synthase F0 subunit 6 ATP6
7 + epiluc07 NADH-ubiquinone oxidoreductase chain 4L ND4L
# end of chromosome 1
|
8 + epiluc08 NADH dehydrogenase subunit 4 ND4
10 + epiluc09 NADH dehydrogenase subunit 6 ND6
9 - epiluc10 NADH dehydrogenase subunit 5 ND5
11 + epiluc11 cytochrome b subunit CYTB
15 + epiluc12 ? RNS
14 + epiluc13 ? RNL
4 + epiluc14 NADH dehydrogenase subunit 1 ND1
5 + epiluc15 NADH dehydrogenase subunit 2 ND2
Halocynthia_roretzi
*
1 + epiluc01 Cytochrome c oxidase subunit I COX1
13 + halror02 ATP8
6 + halror03 ND3
8 + halror04 ND4
10 + halror05 ND6
3 + halror06 Cytochrome c oxidase subunit III COX3
7 + halror07 ND4L
15 + halror08 RNS
2 + halror09 Cytochrome c oxidase subunit II COX2
11 + halror10 CYTB
5 + halror11 ND2
9 + halror12 ND5
14 + halror13 RNL
4 + halror14 ND1
12 + halror15 ATP6
...
OUTPUT :: Zipped OutputThe output consists of several files:
Summary file: This file contains some
information about the input and the reconstruction results.
Reconstructed gene clusters are reported in a densed form:
Overlapping gene clusters are joined to CARs (Contiguous Ancestral Regions).
Elements enclosed by '(...)' are reconstructed to occur contiguously but in
arbitrary order, whereas elements enclosed by '[...]' are reconstructed to occur
exactly in the given order. This structure can be nested, e.g. '[1,2,(3,4,5),6]'
describes the gene orders: (1,2,3,4,5,6), (1,2,3,5,4,6), (1,2,4,3,5,6),
(1,2,4,5,3,6) ,(1,2,5,3,4,6), and (1,2,5,4,3,6) (or the inverse gene orders, resp.).
Labeling files: For each internal node of the given
tree, one file containing a detailed description of the reconstructed labeling
is created. Each of these files gives a short summary of the CAR sizes and then
lists all CARs. For each CAR, its structure (as described above) is given and the
annotations of the comprised genes are listed. At the end of each file, the gene
clusters which are in contradiction with other reconstructed clusters and have
therefore been deleted during the optimization phase are given.
Parameter
Runtime |
Specify maximum runtime in minutes up to one hour (default : 2 minutes) |
Minimum cluster size |
Minimum cluster size (Number of genes in a cluster). Minimum cluster size has no affect for adjacenies models. |
Maximum cluster size |
Maximum cluster size (Number of genes in a cluster, must be larger or equal to minimum cluster size.
Maximum cluster size has no affect for adjacenies models. |
Model |
You can choose between different gene cluster models:
- common intervals on sequences without duplications:
A set of genes {g_1, ..., g_n} is contained in a gene order as a common interval,
if all genes g_1 to g_n occur contiguously in the gene order. That means: in any
order and with any sign but not interrupted by any other gene.
- framed common intervals on sequences without duplications:
A framed common interval [a{b_1,...,b_n}c] is contained in a signed gene order,
if all genes b_1 to b_n occur contiguously in the gene order, framed by gene a
on the left and gene b on the right, or by gene -b on the left and gene -a on the right.
- nested common intervals on sequences without duplications:
Nested common intervals are defined recursively. A nested common interval {g_1,g_2} is
contained in a gene order, if the genes g_1 and g_2 are adjacent in the gene order.
A nested common interval {{{{g_1,g_2},...},g_(n-1)},g_n} is contained in a gene order,
if the genes g_1 to g_n occur contiguously and the nested common interval
{{{g_1,g_2},...},g_(n-1)} is contained in the gene order as well.
- unsigned adjacencies on sequences: An unsigned adjacency
{a,b} is contained in a gene order, if the genes a and b are adjacent in the gene order.
- signed adjacencies on sequences: A signed adjacency {a,b}
is contained in a signed gene order, if gene a is followed by gene -b or b by -a
in the gene order.
See [WIT:2010] or [STO:WIT:2009]
for further details of the definitions. |
approach |
You can specify if you want to use only one of the sparse approaches
or the two phase approach (sparse and dense phase). Confining to the sparse
phase yields a shorter runtime but less precise results (less true positives).
See [WIT:2010] or
[STO:WIT:2009]
for further details of the definitions. |
Rococo-Compare
Use the 'Open file'-button to open reconstruction results (labeling files) of different
internal nodes or from different runs of Rococo. In the 'File-open'-dialog, you can
select several files at a time by using 'Shift' or 'Ctrl' to open them simultaneously.
For each file, one line in the 'Comparison'-field is added. Files can be closed by
pressing the 'X'-button of a specific file or by using the 'Close all files'-button.
If the file names are too long, the are shown in shortened form. To see the full file
name, hold the mouse pointer on the shortened file name.
Choose a reference file by selecting the radio button at the left of one of the files.
Choose a CAR from the reference file using '<<' or '>>'. Details of the CAR are shown
in the 'CAR Details' field at the bottom of the window.
For all other files, only CARs with overlapping gene content are shown. The common genes
are highlighted in red. Choose a CAR from any file using '<<' or '>>'.
Use '+/-' to invert the order of a CAR. (Signs will get lost.)
'Filter by...': Here, you can filter CARs by specifying a minimal size
or by searching for some string which has to be contained as a whole word in the CAR details.
(You can also use regular expressions, e.g. '.*\d[\])][,\])].*' to filter for flexible CARs.)
From the reference file, only matching CARs will be shown. CARs of the other files with
overlapping gene content, which do not meet the filter criteria are shown in gray, where
common genes are highlighted in yellow.
|
|