r2cat - Manual
The program r2cat has two main objectives:
- Match a set of contigs to a related reference genome to order and orient
the contigs based on that reference. The mapping is visualized in an interactive
synteny plot.
- Create and visualize a synteny plot of two genomes.
Note: The matching works well for prokaryotic
genomes. Huge eukaryotic genomes will cause quite likely memory problems.
Prerequisites for ordering the contigs:
To run the program the following has to be provided:
- The contigs of a newly sequenced genome in a multiple FASTA
file.
- An already finished reference genome in FASTA format which should be
related to the contig genome. It will be used to map the contigs on it to
determine their order and orientation. The reference genome itself can consist
of several replicons and should then be saved in one multiple FASTA
file.
Alternatively two genomes in FASTA format can be used to calculate a synteny
plot.
Running r2cat
Start the program and choose "Match new" from the
"File" menu. Select two FASTA files. For this search the contigs are the queries and the
reference genome is the target. Click on "Start Matching" and the program tries to find
regions of up to 8% difference that have at least 44 exact matches of possibly
overlapping 11-mers, which are each not further apart than 64 bases. After the matching
has finished click on "Continue" and you can see the matches as dot plot. The contigs
are ordered (and stacked) along the y axis in the order of the underlying fasta file.
The contigs can now be sorted based on their matches using "Options->Sort queries".
After that, the ordering can be manually adjusted with "Window->Show queries/contigs".
At this point, you might want to save the matches and the order with "save project" from
the "file menu". All the matching information and the order of the contigs are then
saved to a file. This file can be edited manually or parsed by other programs. Each
reference and contig has a section specifying its length and the associated file as well
as other information. After that, all matching regions are given in a tab separated
table. A saved project can then be loaded without the time consuming process of
matching. After the matching and ordering of the contigs, the order and orientation of
the contigs can be written to a FASTA file. Select "File->Export contigs as FASTA file"
from the menu and select a file to write the results to. The Contigs are written into
that file in the displayed order while being reverse complemented if necessary.
treecat - Manual
The program treecat can be
used to estimate an ordering for a set of contigs. A so called layout
graph shows the unique order where possible and gives alternatives where
necessary. The layout graph is calculated based on the matches to
several related genomes as well as the information given by a phylogenetic tree of the
involved species. Prerequisites
Please provide the following
information / files: - The contigs of a newly sequenced genome in a multiple FASTA
file.
- Several already finished reference genomes, each in one FASTA file. One
genome can consist of several replicons and should be saved in one multiple
FASTA file.
- A phylogenetic tree of the species in Newick format. The species names in
the tree must have as names the filenames of the FASTA files without
extension.
Running treecat
If treecat is run it displays an input mask where
the user can specify the above mentioned files. Additionally a project directory has to
be given where a few files will be cached and where the result of the contig ordering
algorithm will be stored. Note that if the phylogenetic tree is left out, then the
algorithm weights each reference genome equally. After providing all necessary
information, the algorithm can be started using the run button. After that three phases
follow: - All matches from the contigs to each reference genome are calculated, that
are longer than 64 bases with less than 8% errors. Note: The matching component
of treecat works well for prokaryotic genomes. Huge eukaryotic genomes will
quite likely cause memory problems. The matches are cached in the project
directory in files with the extension *.r2c which can be opened with r2cat to
visualize the matches.
- The matches are used to calculate a contig adjacency graph that gives for
each pair of contig ends a likelihood how adjacent they are. If available the
phylogenetic information is used to weight the connections.
- From the contig adjacency graph a layout graph is computed that shows the
most promising edges. The layout graph in neato format will be written to the
project directory.
The layout graph is the final result of treecat. It can be visualized using
the neato program of the Graphviz package. neato -Tps -o layout_graph.ps
layout_graph.neato