-----------------------   Gecko2   ---------------------------------------
-    ***  GEne Cluster detection in proKaryotic and genOmes ***          -
- Katharina Jahn (kjahn@cebitec.uni-bielefeld.de), Leon Kuchenbecker     -
- AG Genome Informatics, Technical Faculty, Bielefeld University Germany -
- This software is an extension of Gecko (developed by Thomas Schmidt)   -
--------------------------------------------------------------------------

USER MANUAL
-----------

1. Installation

For instructions concerning installation and system requirements,
please consult the Readme.txt provided with the program files.

##################################################################

2. Getting started ...

To get a first impression about the functionality of Gecko2
it is recommended to download a sample genome file, e. g.
'ghostFamOutStandard.cog'
##################################################################

3. General Information

Gecko2 is a further development of the original Gecko software.
It allows for the systematic detection of gene cluster conservation
in a large number of genomes. The enhancement over the original Gecko
tool is the use of set-distance based gene cluster models that allow
for the detection of gene clusters with diverse conservation patterns
including gaps and missing genes in cluster occurrences. It is also
tolerant of errors in gene homology assignment. Gecko2 provides a
ranking of the predictions based on statistical significance and an
interactive visualization that supports the functional evaluation of
the clustered genes.

The original version of Gecko is still available on the download page.
Ghostfam, the tool for data preparation, that was integrated into
Gecko is now available as a stand-alone tool.


4. Gecko2

Given a set of genomes in which each gene is assigned to a family
of homologous genes, Gecko2 detects sets of genes that appear in an
approximately conserved neighborhood among the genomes.
A typical Gecko session is divided into three parts: genome selection,
cluster detection, manual evaluation of predictions. These parts are
described in the following in more detail.


4.1 Data Preparation

For Gecko2, the basic requirement is that the genomes are given as
sequences of strings where each character (here each number)
represents a certain family containing at least one gene. All
genes in a family should be homologs performing the same (or very
similar) function. We recommend two different types
of input data. The first type is based on the classification of
genes in the COG database, the second source of data is the
exported family classification from GhostFam (see the description
of the original version of Gecko for details).

Input files should have the file extension '.cog' and have to be
organized as follows:

GenomeName <COMMA> Descriptive Text  <NEWLINE>
Descriptive Text (ignored) <NEWLINE> <Genome Content> <NEWLINE>

Where in <Genome Content> each line contains information about the
family and function of single genes in the order of their
occurrence in the genome:

COG no. <TAB> Strand (+ or -) <TAB> COG functional category <TAB>
Gene Name <TAB> functional annotation <NEWLINE>

Example from the file COG_data.cog (see supplementary data in the
manual section of Gecko on the BiBiServ):

Aquifex aeolicus, complete genome - 0..1551335
1529 proteins
0480    +   J   fusA    elongation factor EF-G
0050    +   J   tufA1   elongation factor EF-Tu
0051    +   J   rpsJ    ribosomal protein S10
            ...
0459    +   O   mopA    GroEL
0000    -   -   ----    putative protein
0612    -   R   ymxG    processing protease

Clostridium acetobutylicum ATCC824, complete genome - 0..3940880
3672 proteins
0593    +   L   dnaA    DNA replication initiator protein, ATPase
0592    +   L   dnaN    DNA polymerase III beta subunit
2501    +   S   ----    Small conserved protein, ortholog of YAAA B.subtilis

This file can be simply created by downloading each single genome
from the NCBI database
(http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html), rearranging cloumns
and cutting out additional columns. Then all genomes have to be merged
into one file.

The GhostFam data file is structured like the COG data file, but the
field <Genome Content> is defined as follows:

FamilyID (number only) <TAB> Strand (+ or -) <TAB> COG functional
category (usually unknown) <TAB> LongGeneID <TAB> functional
annotation <TAB> Gene Name <TAB> TrEMBL ID <NEWLINE>

Example from 'sample.csd':

B.longum
*
62  +   ?   Biflo_0001  cold shock protein cspA AAN23868.1
336 +   ?   Biflo_0002  chaperone   groEL AAN23869.1
0   -   ?   Biflo_0003  hypothetical protein    BL0003 AAN23870.1
0   +   ?   Biflo_0004  hypothetical protein    BL0004 AAN23871.1
             ...
580 -   ?   Biflo_1726  uracil-DNA glycosylase  ung AAN25596.1
0   +   ?   Biflo_1727  hypothetical protein    BL1813  AAN25594.1

B.subtilis
*
503 +   ?   Bacsu_0001      dnaA    CAB11777.1
504 +   ?   Bacsu_0002  DNA polymerase III (beta subunit)   dnaN    CAB11778.1

After selecting an input file via 'File'->'Open session or genome file',
Gecko determines automatically from the file ending whether it loads a
genome file (.cog) or a stored session (.gck). In case a genome file is
selected, it is parsed and all found chromosomes are listed in a table.
Ticking the check boxes next to a chromosomes in the table, one can
choose the chromosomes that should be part of the search for approximate
gene clusters. Different chromosomes of one genome can be marked and
grouped by clicking on the 'Group' button. Gecko2 suggests a grouping of
chromosomes based on chromosome names. This can be reverted by marking
the grouped chromosomes and clicking on the 'Ungroup' button. Genome
selection is finished by clicking on the button 'OK'. The genomes are
then visualized in a genome browser, allowing to inspect the genomes,
contained genes, and gene annotations.


4.2 Cluster Detection

When clicking the 'start computation" button, the user is asked to
select a search mode, as well as global and model-dependent parameters
before the actual search begins. The first step is to choose between the
median, center, and reference gene cluster model. For all models, the
minimum cluster size, the maximum distance and optionally a quorum
parameter can be set. The minimum cluster size defines the minimum
number of genes that a gene cluster must contain to be reported by the
program. The distance threshold, gives an upper bound on the number of
differences that are allowed between a gene set and its approximate
occurrences. In case of the median, the value gives the maximum set
distance that is allowed between the median and the approximate
occurrences. In center and reference mode, the value determines the
maximum pair-wise distance between the center, respectively the
reference set, and the each approximate occurrence. The third parameter,
determines the minimum number of genomes in which a gene cluster must
have an approximate occurrence in order to be reported. By default, this
value is set to the number of selected genomes. Then, only gene clusters
with an approximate occurrence in all genomes are reported.  In case the
reference mode is selected, one can chose between three sub-modes. In
the 'all against all' mode, gene cluster are predicted using all input
genomes one after the other as reference genome. In the 'fixed genome'
mode only one genome is used as reference. It can be chosen from a
drop-down list containing the previously selected genomes. In the
'manual cluster' mode, a sequence of genes can be typed in manually, or
pasted when e. g. copying a cluster from the result list of a previous
run of Gecko2.  After all parameters are set, computation can be started
by clicking the 'OK' button.

During computation, a progress bar shows the status of the computation.
If computation takes too long it can be terminated by clicking on the
'Stop' button. For large distance threshold, the reference-based modes
have much lower runtimes than median and center mode.


4.3 Graphical Evaluation

After completion of computations, results are shown in tabular form
below the genome browser. The table contains the list of all predicted
gene clusters, listing the number of genes, the number of included
genomes, the score of the best occurrence combination (negative
logarithm of p-value) and a list with the IDs of all involved genomes.
By default, the gene cluster list is sorted by decreasing score.  A gene
cluster can be selected with a double-click on the entry -- its best
occurrence will then be visualized by the genome browser, and details
about the cluster will be displayed in an information area. Additional
(and, if enabled by the user, also suboptimal) occurrences can be
selected from a separate list for median and center gene clusters, or be
selected with navigation buttons next to the genomes for reference gene
clusters.

The visualization of a selected gene cluster has been optimized to allow
for an easy inspection of the gene cluster -- the genome browser allows
to visualize the neighborhood on each genome, mouseover tooltips provide
the user with the annotation data available for genes or chromosomes,
and the information area allows for a more detailed inspection of the
search result. It is possible to filter for clusters containing
individual genes or functional gene annotations by typing the respective
information into the 'Search' field above the genome browser.

The results of a Gecko2 session can be stored in a file with ending
'.gck'. It may also be useful to copy individual clusters for re-use as
'fixed cluster' input in an additional run of Gecko2 under reference
mode.