----------------------- Gecko2 --------------------------------------- - *** GEne Cluster detection in proKaryotic and genOmes *** - - Katharina Jahn (kjahn@cebitec.uni-bielefeld.de), Leon Kuchenbecker - - AG Genome Informatics, Technical Faculty, Bielefeld University Germany - - This software is an extension of Gecko (developed by Thomas Schmidt) - -------------------------------------------------------------------------- USER MANUAL ----------- 1. Installation For instructions concerning installation and system requirements, please consult the Readme.txt provided with the program files. ################################################################## 2. Getting started ... To get a first impression about the functionality of Gecko2 it is recommended to download a sample genome file, e. g. 'ghostFamOutStandard.cog' ################################################################## 3. General Information Gecko2 is a further development of the original Gecko software. It allows for the systematic detection of gene cluster conservation in a large number of genomes. The enhancement over the original Gecko tool is the use of set-distance based gene cluster models that allow for the detection of gene clusters with diverse conservation patterns including gaps and missing genes in cluster occurrences. It is also tolerant of errors in gene homology assignment. Gecko2 provides a ranking of the predictions based on statistical significance and an interactive visualization that supports the functional evaluation of the clustered genes. The original version of Gecko is still available on the download page. Ghostfam, the tool for data preparation, that was integrated into Gecko is now available as a stand-alone tool. 4. Gecko2 Given a set of genomes in which each gene is assigned to a family of homologous genes, Gecko2 detects sets of genes that appear in an approximately conserved neighborhood among the genomes. A typical Gecko session is divided into three parts: genome selection, cluster detection, manual evaluation of predictions. These parts are described in the following in more detail. 4.1 Data Preparation For Gecko2, the basic requirement is that the genomes are given as sequences of strings where each character (here each number) represents a certain family containing at least one gene. All genes in a family should be homologs performing the same (or very similar) function. We recommend two different types of input data. The first type is based on the classification of genes in the COG database, the second source of data is the exported family classification from GhostFam (see the description of the original version of Gecko for details). Input files should have the file extension '.cog' and have to be organized as follows: GenomeName Descriptive Text Descriptive Text (ignored) Where in each line contains information about the family and function of single genes in the order of their occurrence in the genome: COG no. Strand (+ or -) COG functional category Gene Name functional annotation Example from the file COG_data.cog (see supplementary data in the manual section of Gecko on the BiBiServ): Aquifex aeolicus, complete genome - 0..1551335 1529 proteins 0480 + J fusA elongation factor EF-G 0050 + J tufA1 elongation factor EF-Tu 0051 + J rpsJ ribosomal protein S10 ... 0459 + O mopA GroEL 0000 - - ---- putative protein 0612 - R ymxG processing protease Clostridium acetobutylicum ATCC824, complete genome - 0..3940880 3672 proteins 0593 + L dnaA DNA replication initiator protein, ATPase 0592 + L dnaN DNA polymerase III beta subunit 2501 + S ---- Small conserved protein, ortholog of YAAA B.subtilis This file can be simply created by downloading each single genome from the NCBI database (http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html), rearranging cloumns and cutting out additional columns. Then all genomes have to be merged into one file. The GhostFam data file is structured like the COG data file, but the field is defined as follows: FamilyID (number only) Strand (+ or -) COG functional category (usually unknown) LongGeneID functional annotation Gene Name TrEMBL ID Example from 'sample.csd': B.longum * 62 + ? Biflo_0001 cold shock protein cspA AAN23868.1 336 + ? Biflo_0002 chaperone groEL AAN23869.1 0 - ? Biflo_0003 hypothetical protein BL0003 AAN23870.1 0 + ? Biflo_0004 hypothetical protein BL0004 AAN23871.1 ... 580 - ? Biflo_1726 uracil-DNA glycosylase ung AAN25596.1 0 + ? Biflo_1727 hypothetical protein BL1813 AAN25594.1 B.subtilis * 503 + ? Bacsu_0001 dnaA CAB11777.1 504 + ? Bacsu_0002 DNA polymerase III (beta subunit) dnaN CAB11778.1 After selecting an input file via 'File'->'Open session or genome file', Gecko determines automatically from the file ending whether it loads a genome file (.cog) or a stored session (.gck). In case a genome file is selected, it is parsed and all found chromosomes are listed in a table. Ticking the check boxes next to a chromosomes in the table, one can choose the chromosomes that should be part of the search for approximate gene clusters. Different chromosomes of one genome can be marked and grouped by clicking on the 'Group' button. Gecko2 suggests a grouping of chromosomes based on chromosome names. This can be reverted by marking the grouped chromosomes and clicking on the 'Ungroup' button. Genome selection is finished by clicking on the button 'OK'. The genomes are then visualized in a genome browser, allowing to inspect the genomes, contained genes, and gene annotations. 4.2 Cluster Detection When clicking the 'start computation" button, the user is asked to select a search mode, as well as global and model-dependent parameters before the actual search begins. The first step is to choose between the median, center, and reference gene cluster model. For all models, the minimum cluster size, the maximum distance and optionally a quorum parameter can be set. The minimum cluster size defines the minimum number of genes that a gene cluster must contain to be reported by the program. The distance threshold, gives an upper bound on the number of differences that are allowed between a gene set and its approximate occurrences. In case of the median, the value gives the maximum set distance that is allowed between the median and the approximate occurrences. In center and reference mode, the value determines the maximum pair-wise distance between the center, respectively the reference set, and the each approximate occurrence. The third parameter, determines the minimum number of genomes in which a gene cluster must have an approximate occurrence in order to be reported. By default, this value is set to the number of selected genomes. Then, only gene clusters with an approximate occurrence in all genomes are reported. In case the reference mode is selected, one can chose between three sub-modes. In the 'all against all' mode, gene cluster are predicted using all input genomes one after the other as reference genome. In the 'fixed genome' mode only one genome is used as reference. It can be chosen from a drop-down list containing the previously selected genomes. In the 'manual cluster' mode, a sequence of genes can be typed in manually, or pasted when e. g. copying a cluster from the result list of a previous run of Gecko2. After all parameters are set, computation can be started by clicking the 'OK' button. During computation, a progress bar shows the status of the computation. If computation takes too long it can be terminated by clicking on the 'Stop' button. For large distance threshold, the reference-based modes have much lower runtimes than median and center mode. 4.3 Graphical Evaluation After completion of computations, results are shown in tabular form below the genome browser. The table contains the list of all predicted gene clusters, listing the number of genes, the number of included genomes, the score of the best occurrence combination (negative logarithm of p-value) and a list with the IDs of all involved genomes. By default, the gene cluster list is sorted by decreasing score. A gene cluster can be selected with a double-click on the entry -- its best occurrence will then be visualized by the genome browser, and details about the cluster will be displayed in an information area. Additional (and, if enabled by the user, also suboptimal) occurrences can be selected from a separate list for median and center gene clusters, or be selected with navigation buttons next to the genomes for reference gene clusters. The visualization of a selected gene cluster has been optimized to allow for an easy inspection of the gene cluster -- the genome browser allows to visualize the neighborhood on each genome, mouseover tooltips provide the user with the annotation data available for genes or chromosomes, and the information area allows for a more detailed inspection of the search result. It is possible to filter for clusters containing individual genes or functional gene annotations by typing the respective information into the 'Search' field above the genome browser. The results of a Gecko2 session can be stored in a file with ending '.gck'. It may also be useful to copy individual clusters for re-use as 'fixed cluster' input in an additional run of Gecko2 under reference mode.