BiBiServ2 - CG-CAT

r2cat - Manual

The program r2cat has two main objectives:

Match a set of contigs to a related reference genome to order and orient the contigs based on that reference. The mapping is visualized in an interactive synteny plot.
Create and visualize a synteny plot of two genomes.

Note: The matching works well for prokaryotic genomes. Huge eukaryotic genomes will cause quite likely memory problems.

Prerequisites for ordering the contigs:

To run the program the following has to be provided:

The contigs of a newly sequenced genome in a multiple FASTA file.
An already finished reference genome in FASTA format which should be related to the contig genome. It will be used to map the contigs on it to determine their order and orientation. The reference genome itself can consist of several replicons and should then be saved in one multiple FASTA file.

Alternatively two genomes in FASTA format can be used to calculate a synteny plot.

Running r2cat

Start the program and choose "Match new" from the "File" menu. Select two FASTA files. For this search the contigs are the queries and the reference genome is the target. Click on "Start Matching" and the program tries to find regions of up to 8% difference that have at least 44 exact matches of possibly overlapping 11-mers, which are each not further apart than 64 bases. After the matching has finished click on "Continue" and you can see the matches as dot plot. The contigs are ordered (and stacked) along the y axis in the order of the underlying fasta file. The contigs can now be sorted based on their matches using "Options->Sort queries". After that, the ordering can be manually adjusted with "Window->Show queries/contigs". At this point, you might want to save the matches and the order with "save project" from the "file menu". All the matching information and the order of the contigs are then saved to a file. This file can be edited manually or parsed by other programs. Each reference and contig has a section specifying its length and the associated file as well as other information. After that, all matching regions are given in a tab separated table. A saved project can then be loaded without the time consuming process of matching. After the matching and ordering of the contigs, the order and orientation of the contigs can be written to a FASTA file. Select "File->Export contigs as FASTA file" from the menu and select a file to write the results to. The Contigs are written into that file in the displayed order while being reverse complemented if necessary.

treecat - Manual

The program treecat can be used to estimate an ordering for a set of contigs. A so called layout graph shows the unique order where possible and gives alternatives where necessary. The layout graph is calculated based on the matches to several related genomes as well as the information given by a phylogenetic tree of the involved species.

Prerequisites

Please provide the following information / files:

The contigs of a newly sequenced genome in a multiple FASTA file.
Several already finished reference genomes, each in one FASTA file. One genome can consist of several replicons and should be saved in one multiple FASTA file.
A phylogenetic tree of the species in Newick format. The species names in the tree must have as names the filenames of the FASTA files without extension.

Running treecat

If treecat is run it displays an input mask where the user can specify the above mentioned files. Additionally a project directory has to be given where a few files will be cached and where the result of the contig ordering algorithm will be stored. Note that if the phylogenetic tree is left out, then the algorithm weights each reference genome equally. After providing all necessary information, the algorithm can be started using the run button. After that three phases follow:

All matches from the contigs to each reference genome are calculated, that are longer than 64 bases with less than 8% errors. Note: The matching component of treecat works well for prokaryotic genomes. Huge eukaryotic genomes will quite likely cause memory problems. The matches are cached in the project directory in files with the extension *.r2c which can be opened with r2cat to visualize the matches.
The matches are used to calculate a contig adjacency graph that gives for each pair of contig ends a likelihood how adjacent they are. If available the phylogenetic information is used to weight the connections.
From the contig adjacency graph a layout graph is computed that shows the most promising edges. The layout graph in neato format will be written to the project directory.

The layout graph is the final result of treecat. It can be visualized using the neato program of the Graphviz package.neato -Tps -o layout_graph.ps layout_graph.neato