-----------------------   Gecko   --------------------------------
-    ***  GEne Cluster detection in proKaryotic genOmes ***      -
- Thomas Schmidt                                                 -
- (tschmidt@cebitec.uni-bielefeld.de) International NRW Graduate -
- School in Bioinformatics and Genome Research Center of         -
- Biotechnology (CeBiTec) Bielefeld University Germany           -
------------------------------------------------------------------

USER MANUAL
-----------

1. Installation

For instructions concerning installation and system requirements,
please consult the Readme.txt provided with the program files.

##################################################################

2. Getting started ...

To get a first impression about the functionality of Gecko and its
data preparation tool GhostFam, it is highly recommended to
download the "sample_data.zip" and decompress it into a
subdirectory, e.g. "../Gecko/data/".

The following manual is organized in a way, such that a complete
application run containing all important aspects of Gecko and
GhostFam is described. Therefore, those readers who are only
interested in Gecko and not in GhostFam, may skip Section 3 and
continue with the description of Gecko in Section 4.

##################################################################

3. GhostFam

The gene cluster detection with Gecko requires the representation
of genomes by sequences of numbers, where each number corresponds
to a certain family of homologous genes. The occurrence of a
number at a certain position in the sequence indicates that the
gene at that position belongs to the family identified by that
number.

3.1 Input data files

To successfully run a GhostFam session, three different types of
input files are required. The first input file is a text file
ending with the extension ".list". It must contain a list of the
genomes which have to be processed. Example:

B.longum    Biflo   bl  no_path
B.subtilis  Bacsu   bs  no_path
C.diphtheriae   Cordi   cd  no_path
E.coli K12  Ecoli   ec  no_path

Each line of the text file is reserved for one genome (make sure
to end the last line with a carriage return). A line is organized
as follows: Genome name <TAB> LongTag <TAB> ShortTag <TAB>
information about the source (ignored by GhostFam so far)
<NEWLINE>. Where LongTag and ShortTag are abbreviations for the
genome name used to identify the genome in different contexts.

For each genome an additional annotation file is required. By
default, the name of the file starts with the LongTag of the
genome followed by "_annotation.txt". Example:
"Biflo_annotation.txt". This text file contains information about
essential properties of the genes in the order of their occurrence
in the genome. Example:

Biflo_00001 bl0001  +   80  49..288 AAN23868.1      BL0001  cspA
cold shock protein  CspA; COG family: cold shock proteins;
PFAM_ID: CSD

Biflo_00002 bl0002  +   542 526..2151   AAN23869.1 BL0002  groEL
chaperone   GroEL; COG family: chaperonin GroEL (Hsp60
family);PFAM_ID: cpn60_TCP1

Biflo_00003 bl0003  -   97 complement(2248..2538)  AAN23870.1
BL0003  BL0003 hypothetical protein    COG family:
methyl-accepting chemotaxis protein

The organization is as follows: LongGeneID <TAB> ShortGeneID <TAB>
Strand (+ or -) <TAB> Length(AA) <TAB> Location on the genome
<TAB> TrEMBL-ID <TAB> Locus-Tag <TAB> Gene Name <TAB> function
<TAB> additional annotation <NEWLINE>.

Finally, the third type of input file is for each genome a list of
all TBLASTN hits of all its genes against the genes from all other
genomes in the genome list file (see above). The default naming of
this file is the LongTag plus the string "_vs_All.blast2p".
Example from "Biflo_vs_All.blast2p":

bl0001  pb0464  35.38   65  36  2   1   65  1   59  0.004   36.96
bl0001  cd0733  32.47   77  46  2   1   77  1   71  0.006   36.58
bl0002  bl0002  100.00  541 0   0   1   541 1   541 0.0   1027.7
bl0002  mt0443  71.85   540 148 1   1   540 1   536 0.0    743.8
bl0002  ma3936  72.66   534 141 2   1   534 1   529 0.0    740.0

The general form of a line is: ShortGeneID (1.Gene) <TAB>
ShortGeneID (2. Gene) <TAB> Matching AA (in %) <TAB>
Length(alignment) <TAB> no of non-matching AA <TAB> Gap openings
<TAB> Start(alignment in 1. Gene) <TAB> End(a.i.1.G.) <TAB>
Start(alignment in 2. Gene) <TAB> End(a.i.2.G.) <TAB> E-Value
<TAB> Score <NEWLINE>.


3.2 Starting a new GhostFam Session

After starting GhostFam a new session can be created from the menu
'File'->'New Session'. In the following file dialog a file
containing the list of genomes (see 3.1) has to be selected. After
a successful parsing the found genomes are listed in the table.
The next dialog allows the setting of some parameters for the
clustering algorithms as well as settings regarding the file and
directory options. In the this dialog the default options for the
filenames and extensions can be modified. In the general case none
of the options needs a modification upon first trial. Before
selecting 'OK', confirm that either the genome list file is
located in the same directory as the blast- and annotation files
or the directory those data is correctly denoted in 'Directory to
data files'. In the following, all data files are processed and
the families of paralogs are grouped together according to the
chosen parameters. The results are presented in the main window of
GhostFam. For further details on the grouping of genes to families
see Section 3.4.


3.3 The Main Window

The main window of GhostFam contains four different elements. The
table in the upper left region contains the list of all created
families. The families have a unique identifier (FamilyID) and are
ordered according to the number of genes (#CDS) they contain. The
third column of the table contains the LongGeneID of each gene in
the cluster. From this table one can select a gene family for a
more detailed inspection. This gene family is then shown
graphically on the right side next to the table. In the graphic,
each gene of a family is placed on a virtual circle and connected
by a black line to all other genes from the family if they have a
common significant blast hit. In the graphic, it is possible to
select a single gene by clicking on it. The selected gene is then
displayed in green color and all blast hits from this gene to
other genes are highlighted as follows: orange (highly significant
hit), yellow (significant hit), blue (less significant hit). The
classification of the hits is according to the settings of the
parameters (see 3.5). Under the menu 'Edit'->'Change Color
Display' there is also the option to disable the colors if
desired. When a gene is selected in the graphic, it is also
highlighted in the lower right table that contains the list of all
gene from a family together with further details and its
functional annotation. Also it is possible to select a gene from
this list which is then highlighted in the graphic. The lower left
table contains for a selected gene all its blast hits to other
genes inside and outside the family. By selecting a hit from this
list, the corresponding line in the graphic is highlighted in
green color. If the blast hit points to a gene that is not in the
actual family the last column shows in which family the gene can
be found. By clicking on the familyID in this column, the
corresponding family is selected in the upper left table and drawn
in the graphical representation. The first column in this table
contains characters allowing an easy overview of the quality of a
blast hit. The characters have the following meaning: 'H' ->
highly significant hit, 'S' -> significant hit, -> 'L' low
significant hit, '!!' -> highly significant hit to a gene not in
the current family, '!' -> significant hit to a gene not in the
current family. If the graphical representation is of special
interest for documentation, it can be exported under
'Edit'->'Export Graphic' in the portable network graphic format
(PNG) or as JPEG bitmap file (JPG).

The search field in the tool bar allows the efficient search for
genes by their LongGeneID (case sensetive) or annotation. Each
found gene is displayed with its corresponding family. In
addition, there is also the option to search directly for a
distinct family by entering the familyID (e.g. '234') into the
search field.


3.4 Building the Families

The creation of gene families from the BLAST data is divided into
consecutive steps. The first two steps (the parsing of the data
and the creation of families of paralogs) is done immediately
after a new session is created. However, if the data files have
been updated or the creation of the paralog families has not
provided the expected results, these steps can be repeated using
the file menu entry 'Algorithms'->'Restart Parser' resp.
'Algorithms'->'Regroup Paralogs'. Before the paralog families are
regrouped it might be necessary to change the thresholds for the
identification of paralogs, this can be done using the dialog from
'Edit'->'Preferences' (see also 3.5). A further option to modify
the family classification is the use of the manual family editor
under 'Edit'->'Manual Editor' (see 3.6). The editor allows to
perform individual modifications of the gene families to optimize
the automatic classification. After the families of paralogs are
grouped to cluster of sufficient quality, the third step merging
corresponding families of paralogs to one family of homologs can
be performed under 'Algorithms'->'Create Homologous Groups'.
Afterwards, again using the manual editor the grouping can be
modified by hand.

If the grouping of the genes to the families is completed, the
export option under 'File'->'Export' prepares the data file which
can be imported as clustered sequence data file (.csd) to Gecko.


3.5 Settings and Impact of Parameters

An important part for the creation of the gene families is the
identification of orthologous and paralogous genes. Therefore,
from the dialog 'Edit'->'Preferences' it is possible to change the
default thresholds for the classifications of orthologs and
paralogs. To establish the groups of paralogs, three values from a
blast hit of two genes are concerned, i.e. the E-Value (EV), the
percent rate of identical amino acids in their alignment (PI), and
the coverage (CO) of the blast hit. The coverage of a hit is the
fraction the length of the alignment divided by the length (number
of amino acids) of the longer gene.

In GhostFam, two genes A and B called paralogous, if the coverage
of the hit from A and B exceeds the threshold CO and either the
rate of matching amino acids of A and B exceeds PI or their
E-Value is below EV. The families of paralogs are grouped such
that if two genes A and B are paralogs it is ensured that they
share the same family.

In principle the characterization of the orthologous genes is
similar to the definition of paralogs. Therefore, a further set of
threshold values is provided together with a new parameter called
overlap ratio (OR) of the grouping. When starting the grouping
process that creates the families of homologs, two families are
said to form a pair of orthologous gene families, if the number of
genes in the smaller family having an ortholog in the larger
family, divided by the total number of genes in the smaller
family, is larger that OR. This condition is tested for each pair
(F,F') of families, where F is fixed to the largest occurring
family, and F' iterates over all families of paralogs that have
been detected. Afterwards, those pairs fulfilling the test
condition are grouped together into one family of homologs and the
next grouping step continues without those families. If finally
all families are grouped together it is guaranteed, that there is
no ungrouped pair of gene families left that share more than (OR
* size of the smaller family) orthologous genes.


3.6 Manual Editor

Since the grouping of genes to their families of paralogs and
homologs is designed to be fully automated and comparably fast, it
is not feasible to test neither for each gene nor for each family
different alternative scenarios, which might result in a more
plausible grouping from the biological point of view. Instead,
GhostFam provides an manual editor that allows to perform
corrections in the assignment of the genes to their families by
hand. At this point it should be stated that if the results of a
grouping are not looking sensible at all, the first choice should
always be a re-grouping with a different set of parameters, since
the proper correction of gene families by hand needs some
experience and even more time.

The editor can be started from 'Edit'->'Manual Editor' and the
appearance of the window looks like a duplicated main window.
Indeed, all functions concerning the graphics and tables from the
main window are also available in the editor. In principle, the
editor allows to perform three different operations (with the five
buttons in the center of the window).

The first operation is the cutting of a family into two families.
This operation is required, if the similarity of a few genes from
two different groups lead to their grouping into one family (e.g.
in the family with ID 30 in the sample.sdf clearly shows such
characteristic). The split of a family is performed by the
following steps: 1) Selecting the family from the upper list on
one side of the editor, 2) marking of genes to be split from the
family (by double click on the gene in the graphic or marking the
checkbox in the annotation table), 3) clicking on the button in
the center to create a new family from the selected genes, 4)
checking the split (the newly family created from the selected
genes is shown opposite to the original family).

The second operation is the movement of a set of marked gene from
one family to another. The movement done as follows: 1) Selecting
on one side of the editor the source family and on the other side
the family which receives the genes. 2) marking the genes to be
transferred in the source family. 3) clicking on the button in the
center to move the genes from the source to the target family 4)
checking the movement (the updated families are immediately
shown).

The third editor option is the merging of two families. This is
the simplest step and can be easily performed by choosing on the
both sides the families to be merged and then clicking on the
merge button in the center of the window.

Finally, to exit the manual editor choose 'ok' from the buttons in
the center.

##################################################################


4. Gecko

Given a set of genomes in which each gene is assigned to a family
of homologous genes, Gecko detects sets of genes that appear in a
conserved neighborhood among the genomes. After the detection step
different preprocessing phases allow also the reconstruction of
only partially conserved gene clusters. A typical Gecko session is
divided into five parts (Data Preparation, Cluster Detection,
Graphical Evaluation, Reconstruction, Final Evaluation). These
parts are described in the following in more detail.


4.1 Data Preparation

For Gecko, the basic requirement is that the genomes are given as
sequences of stings where each character (here each number)
represents a certain family containing at least one gene. All
genes in a family should be homologs performing the same (or very
similar) function. Therefore, Gecko supports two different types
of input data. The first type is based on the classification of
genes in the COG database, the second source of data is the
exported family classification from GhostFam (see 3.4).

Input files based on COG data base should have the file extension
'.cog' and have to be organized as follows:

GenomeName <COMMA> Descriptive Text (ignored) <NEWLINE>
Descriptive Text (ignored) <NEWLINE> <Genome Content> <NEWLINE>

Where in <Genome Content> each line contains information about the
family and function of single genes in the order of their
occurrence in the genome:

COG no. <TAB> Strand (+ or -) <TAB> COG functional category <TAB>
Gene Name <TAB> functional annotation <NEWLINE>

Example from the file COG_data.cog (see supplementary data in the
manual section of Gecko on the BiBiServ):

Aquifex aeolicus, complete genome - 0..1551335
1529 proteins
0480    +   J   fusA    elongation factor EF-G
0050    +   J   tufA1   elongation factor EF-Tu
0051    +   J   rpsJ    ribosomal protein S10
            ...
0459    +   O   mopA    GroEL
0000    -   -   ----    putative protein
0612    -   R   ymxG    processing protease

Clostridium acetobutylicum ATCC824, complete genome - 0..3940880
3672 proteins
0593    +   L   dnaA    DNA replication initiator protein, ATPase
0592    +   L   dnaN    DNA polymerase III beta subunit
2501    +   S   ----    Small conserved protein, ortholog of YAAA B.subtilis

This file can be simply created by downloading each single genome
from the NCBI database
(http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html) and
cutting out additional columns. Then all genomes have to be merged
into one file.

The creation of a GhostFam data file (clustered sequence data file
'.csd') is described in detail in 3.4. The file is structured like
the COG data file, but the field <Genome Content> is defined as
follows:

FamilyID (number only) <TAB> Strand (+ or -) <TAB> COG functional
category (usually unknown) <TAB> LongGeneID <TAB> functional
annotation <TAB> Gene Name <TAB> TrEMBL ID <NEWLINE>

Example from 'sample.csd':

B.longum
*
62  +   ?   Biflo_0001  cold shock protein cspA AAN23868.1
336 +   ?   Biflo_0002  chaperone   groEL AAN23869.1
0   -   ?   Biflo_0003  hypothetical protein    BL0003 AAN23870.1
0   +   ?   Biflo_0004  hypothetical protein    BL0004 AAN23871.1
             ...
580 -   ?   Biflo_1726  uracil-DNA glycosylase  ung AAN25596.1
0   +   ?   Biflo_1727  hypothetical protein    BL1813  AAN25594.1

B.subtilis
*
503 +   ?   Bacsu_0001      dnaA    CAB11777.1
504 +   ?   Bacsu_0002  DNA polymerase III (beta subunit)   dnaN    CAB11778.1


After selecting the input file and type under 'Sequences'->'Open
Data File', the file is parsed and all found genomes are listed in
the table. By marking the genomes in the table, one can choose
which genomes from the data file should be part of the search for
conserved neighborhoods.

In the menu 'Sequences', there is the option to view the input
data file ('Show Data') or to view some statistical information
about the genomes ('Show Info'). Under 'Sequences'->'Select' there
is also the option to change the selection of the input sequences.


4.2 Cluster Detection

After finishing the data preparation, the cluster detection can be
started from the menu 'Algorithm'->'Run Algorithm'. In the
following dialog the user has to set the parameters for the
minimal size of a cluster to be detected, and k' the minimal
number of genome sequences a cluster has to be found in. There are
two further options available in the 'Algorithm' menu. The option
'Optimize sequence order' allows (if enabled) an optimization of
the genome sequence order such that the genome with the lowest
number of genes is processed first and therefore listed on the top
of the graphical output. Disabling this option preserves the order
of the genomes from the input data file in the output, but might
result in a significantly longer computation time. The second
option 'Remove 0 from sequences' deletes (if enabled) all genes
from the genome that are not member of a COG cluster or a gene
family of at least two genes. The enabling of this option might
result in a faster computation time, depending on the number of
such un-grouped genes. On the other hand, the deletion of genes
from the genome immediately causes the creation of artificial
neighborhoods, e.g. if one deletes from the sequence of
(1,2,3,0,5) the 0, automatically 3 and 5 become neighbors.

IMPORTANT: If the cluster detection terminates with an 'out of
memory error', restart Gecko with more memory assigned to the JVM
(see Readme.txt for details).


4.3 Graphical Evaluation

The results of the cluster detection algorithm are presented in
the table in the upper left region of the main window. The
clusters are sorted by the number of different genes they contain.
If two clusters contain the same number of genes, the cluster that
is found in less genomes is listed first. The first column in the
table contains for each gene a unique cluster identifier. The
following columns show the number of different genes in the
cluster and the number of sequences in which the cluster occurs.
The fourth column '#subcluster' indicates whether the cluster is
basic (0 subclusters) or is the result of a merge of many clusters
(>= 2 subclusters).

The typical situation in which two clusters are merged to one is
the following: A cluster contains the genes 1,2,3 and is located
in three genomes. A second cluster containing the genes 1,2,3,4 is
only found in two of the three genomes. If no merging is
performed, theses two clusters appear at separate positions in the
table, although it is obvious the cluster containing the genes
1,2,3 is a part of the other cluster. Therefore, the two clusters
are merged to one, and the original clusters are stored as
subclusters of the merged one. These subclusters are listed in the
table in the lower left.

The columns five to seven indicate if the gene clusters shows a
characteristic pattern containing additional information. These
pattern are only of interest in the final evaluation and described
in Section 4.5. The last column contains the list of all gene
families in the gene cluster.

From both tables one can select a single cluster for a more
detailed graphical representation right next to the table. In the
graphic, each genome that (partially) contains the selected gene
cluster appears as a sequence of arrows and numbers. The arrows
represent the genes whose neighborhoods are conserved. The
familyID or COG number of a gene is shown in the center of the
arrow. The blue italic number between the arrows indicate how many
genes separate the conserved regions. By moving the mouse over an
arrow, further information about the single gene (Position in the
genome, LongGeneID, annotation) appears as a tool-tip. The
direction in which an arrow is pointing indicates on which strand
the gene is located. By clicking on a blue italic number, all
genes from the interleaving region are display in a new window
together with all available information about their annotation and
function. This extended view is also available for a region of
connected arrows. By clicking with the left mouse button on one of
the connected genes from a genome, all the genes from this region
are displayed with the available information about their names,
positions, IDs, and functional annotations.

A further extremely helpful option comparing the organization of
gene clusters between two or more genomes is the centering tool.
It is activated by a click with the right mouse button on the gene
along which the other regions should be aligned. The alignment
shifts the graphic for each genome, such that the first occurrence
of the chosen gene in each genomes appears in one column. If the
first occurrence in a genome appears on the opposite strand, the
whole graphical representation for a genome is flipped. To
indicate such a flip, the genome is now drawn on a yellow
background.

A further useful tool for the localization of single genes in
different clusters is the filter field located in the toolbar of
Gecko. In this field one can enter one or more numbers (separated
by a comma) which are used as a filter for the list of all
clusters. When this filter is applied, the left table shows only
those gene clusters containing at least one gene from the filter
field. To show all clusters again, one can use the trash box
symbol (rightmost icon in the toolbar) or re-run the filtering
with an empty filter field.


4.4 Reconstruction

An additional preprocessing step allows a further optimization of
the clustering. Consider the following example: Given the two gene
clusters A with the genes 1,2,3,4 and B with the genes 1,2,4,5. In
many cases such a situation hints to false family classification,
i.e. that the genes in family 3 and 5 are homologs. To easily
overcome the problem that A and B are not reported together, Gecko
provides the option to run a further preprocessing to combine gene
clusters that show a user defined rate of identical genes. In the
example a rate of less than 0.75 will result in a combination of
the clusters A and B. The postprocessing can be started from
'Algorithm'->'Run Postprocessing'. After setting the identity
rate, the postprocessing starts and when finished, the cluster
list in the upper left table is updated as well as the check-boxes
for the characteristic patterns.


4.5 Final Evaluation

For the final evaluation, all options from the Graphical
Evaluation (see 4.3) are available, and also additional features
of gene clusters are highlighted using the columns P1, P2, and P3
in the list of all clusters.

P1: The 'Gene Replacement Pattern' indicates that a gene cluster
has a set of subclusters which contain the same number of genes,
but exactly one gene from each subcluster does not appear in the
intersection of the gene sets. Example: A gene cluster contains
the genes 1,2,3,4,5 and the two subclusters with the genes 1,2,3,4
and 1,2,3,5. Here, it is most likely, that the gene '4' from one
cluster is replaced by gene '5' in the other cluster.

P2: The 'Cluster Separation Pattern' is highlighted for a gene
cluster if in at least on genome, the set of neighboring genes is
divided into two or more regions which together result in the
complete cluster. Example: Given a gene cluster C of the genes
1,2,3,4,5 with subclusters X containing 1,2,3, Y containing 3,4,
and Z containing 4,5. Here the complete cluster is separated in
some genomes in the parts X,Y, and Z.

P3: The 'Duplication Pattern' simply indicates that the gene
cluster contains at least one paralogous gene or one region of
internal duplication. Given the genome G1=(1,2,3,...,1,2,3) and
G2=(1,2,2,3). The resulting gene cluster contains a paralogous
gene (the '2' in G2) and a region of internal duplication (the
region 1,2,3 appear twice in G1).

Another feature, also available without the preprocessing, is the
export of a graphical representation into a file in the portable
network graphic format (PNG) or as JPEG bitmap file (JPG). This
can be done by the menu 'Options'->'Export MainCluster View' for
the complete cluster or by 'Options'->'Export SubCluster View' for
a selected subcluster.