----------------------- Gecko -------------------------------- - *** GEne Cluster detection in proKaryotic genOmes *** - - Thomas Schmidt - - (tschmidt@cebitec.uni-bielefeld.de) International NRW Graduate - - School in Bioinformatics and Genome Research Center of - - Biotechnology (CeBiTec) Bielefeld University Germany - ------------------------------------------------------------------ USER MANUAL ----------- 1. Installation For instructions concerning installation and system requirements, please consult the Readme.txt provided with the program files. ################################################################## 2. Getting started ... To get a first impression about the functionality of Gecko and its data preparation tool GhostFam, it is highly recommended to download the "sample_data.zip" and decompress it into a subdirectory, e.g. "../Gecko/data/". The following manual is organized in a way, such that a complete application run containing all important aspects of Gecko and GhostFam is described. Therefore, those readers who are only interested in Gecko and not in GhostFam, may skip Section 3 and continue with the description of Gecko in Section 4. ################################################################## 3. GhostFam The gene cluster detection with Gecko requires the representation of genomes by sequences of numbers, where each number corresponds to a certain family of homologous genes. The occurrence of a number at a certain position in the sequence indicates that the gene at that position belongs to the family identified by that number. 3.1 Input data files To successfully run a GhostFam session, three different types of input files are required. The first input file is a text file ending with the extension ".list". It must contain a list of the genomes which have to be processed. Example: B.longum Biflo bl no_path B.subtilis Bacsu bs no_path C.diphtheriae Cordi cd no_path E.coli K12 Ecoli ec no_path Each line of the text file is reserved for one genome (make sure to end the last line with a carriage return). A line is organized as follows: Genome name LongTag ShortTag information about the source (ignored by GhostFam so far) . Where LongTag and ShortTag are abbreviations for the genome name used to identify the genome in different contexts. For each genome an additional annotation file is required. By default, the name of the file starts with the LongTag of the genome followed by "_annotation.txt". Example: "Biflo_annotation.txt". This text file contains information about essential properties of the genes in the order of their occurrence in the genome. Example: Biflo_00001 bl0001 + 80 49..288 AAN23868.1 BL0001 cspA cold shock protein CspA; COG family: cold shock proteins; PFAM_ID: CSD Biflo_00002 bl0002 + 542 526..2151 AAN23869.1 BL0002 groEL chaperone GroEL; COG family: chaperonin GroEL (Hsp60 family);PFAM_ID: cpn60_TCP1 Biflo_00003 bl0003 - 97 complement(2248..2538) AAN23870.1 BL0003 BL0003 hypothetical protein COG family: methyl-accepting chemotaxis protein The organization is as follows: LongGeneID ShortGeneID Strand (+ or -) Length(AA) Location on the genome TrEMBL-ID Locus-Tag Gene Name function additional annotation . Finally, the third type of input file is for each genome a list of all TBLASTN hits of all its genes against the genes from all other genomes in the genome list file (see above). The default naming of this file is the LongTag plus the string "_vs_All.blast2p". Example from "Biflo_vs_All.blast2p": bl0001 pb0464 35.38 65 36 2 1 65 1 59 0.004 36.96 bl0001 cd0733 32.47 77 46 2 1 77 1 71 0.006 36.58 bl0002 bl0002 100.00 541 0 0 1 541 1 541 0.0 1027.7 bl0002 mt0443 71.85 540 148 1 1 540 1 536 0.0 743.8 bl0002 ma3936 72.66 534 141 2 1 534 1 529 0.0 740.0 The general form of a line is: ShortGeneID (1.Gene) ShortGeneID (2. Gene) Matching AA (in %) Length(alignment) no of non-matching AA Gap openings Start(alignment in 1. Gene) End(a.i.1.G.) Start(alignment in 2. Gene) End(a.i.2.G.) E-Value Score . 3.2 Starting a new GhostFam Session After starting GhostFam a new session can be created from the menu 'File'->'New Session'. In the following file dialog a file containing the list of genomes (see 3.1) has to be selected. After a successful parsing the found genomes are listed in the table. The next dialog allows the setting of some parameters for the clustering algorithms as well as settings regarding the file and directory options. In the this dialog the default options for the filenames and extensions can be modified. In the general case none of the options needs a modification upon first trial. Before selecting 'OK', confirm that either the genome list file is located in the same directory as the blast- and annotation files or the directory those data is correctly denoted in 'Directory to data files'. In the following, all data files are processed and the families of paralogs are grouped together according to the chosen parameters. The results are presented in the main window of GhostFam. For further details on the grouping of genes to families see Section 3.4. 3.3 The Main Window The main window of GhostFam contains four different elements. The table in the upper left region contains the list of all created families. The families have a unique identifier (FamilyID) and are ordered according to the number of genes (#CDS) they contain. The third column of the table contains the LongGeneID of each gene in the cluster. From this table one can select a gene family for a more detailed inspection. This gene family is then shown graphically on the right side next to the table. In the graphic, each gene of a family is placed on a virtual circle and connected by a black line to all other genes from the family if they have a common significant blast hit. In the graphic, it is possible to select a single gene by clicking on it. The selected gene is then displayed in green color and all blast hits from this gene to other genes are highlighted as follows: orange (highly significant hit), yellow (significant hit), blue (less significant hit). The classification of the hits is according to the settings of the parameters (see 3.5). Under the menu 'Edit'->'Change Color Display' there is also the option to disable the colors if desired. When a gene is selected in the graphic, it is also highlighted in the lower right table that contains the list of all gene from a family together with further details and its functional annotation. Also it is possible to select a gene from this list which is then highlighted in the graphic. The lower left table contains for a selected gene all its blast hits to other genes inside and outside the family. By selecting a hit from this list, the corresponding line in the graphic is highlighted in green color. If the blast hit points to a gene that is not in the actual family the last column shows in which family the gene can be found. By clicking on the familyID in this column, the corresponding family is selected in the upper left table and drawn in the graphical representation. The first column in this table contains characters allowing an easy overview of the quality of a blast hit. The characters have the following meaning: 'H' -> highly significant hit, 'S' -> significant hit, -> 'L' low significant hit, '!!' -> highly significant hit to a gene not in the current family, '!' -> significant hit to a gene not in the current family. If the graphical representation is of special interest for documentation, it can be exported under 'Edit'->'Export Graphic' in the portable network graphic format (PNG) or as JPEG bitmap file (JPG). The search field in the tool bar allows the efficient search for genes by their LongGeneID (case sensetive) or annotation. Each found gene is displayed with its corresponding family. In addition, there is also the option to search directly for a distinct family by entering the familyID (e.g. '234') into the search field. 3.4 Building the Families The creation of gene families from the BLAST data is divided into consecutive steps. The first two steps (the parsing of the data and the creation of families of paralogs) is done immediately after a new session is created. However, if the data files have been updated or the creation of the paralog families has not provided the expected results, these steps can be repeated using the file menu entry 'Algorithms'->'Restart Parser' resp. 'Algorithms'->'Regroup Paralogs'. Before the paralog families are regrouped it might be necessary to change the thresholds for the identification of paralogs, this can be done using the dialog from 'Edit'->'Preferences' (see also 3.5). A further option to modify the family classification is the use of the manual family editor under 'Edit'->'Manual Editor' (see 3.6). The editor allows to perform individual modifications of the gene families to optimize the automatic classification. After the families of paralogs are grouped to cluster of sufficient quality, the third step merging corresponding families of paralogs to one family of homologs can be performed under 'Algorithms'->'Create Homologous Groups'. Afterwards, again using the manual editor the grouping can be modified by hand. If the grouping of the genes to the families is completed, the export option under 'File'->'Export' prepares the data file which can be imported as clustered sequence data file (.csd) to Gecko. 3.5 Settings and Impact of Parameters An important part for the creation of the gene families is the identification of orthologous and paralogous genes. Therefore, from the dialog 'Edit'->'Preferences' it is possible to change the default thresholds for the classifications of orthologs and paralogs. To establish the groups of paralogs, three values from a blast hit of two genes are concerned, i.e. the E-Value (EV), the percent rate of identical amino acids in their alignment (PI), and the coverage (CO) of the blast hit. The coverage of a hit is the fraction the length of the alignment divided by the length (number of amino acids) of the longer gene. In GhostFam, two genes A and B called paralogous, if the coverage of the hit from A and B exceeds the threshold CO and either the rate of matching amino acids of A and B exceeds PI or their E-Value is below EV. The families of paralogs are grouped such that if two genes A and B are paralogs it is ensured that they share the same family. In principle the characterization of the orthologous genes is similar to the definition of paralogs. Therefore, a further set of threshold values is provided together with a new parameter called overlap ratio (OR) of the grouping. When starting the grouping process that creates the families of homologs, two families are said to form a pair of orthologous gene families, if the number of genes in the smaller family having an ortholog in the larger family, divided by the total number of genes in the smaller family, is larger that OR. This condition is tested for each pair (F,F') of families, where F is fixed to the largest occurring family, and F' iterates over all families of paralogs that have been detected. Afterwards, those pairs fulfilling the test condition are grouped together into one family of homologs and the next grouping step continues without those families. If finally all families are grouped together it is guaranteed, that there is no ungrouped pair of gene families left that share more than (OR * size of the smaller family) orthologous genes. 3.6 Manual Editor Since the grouping of genes to their families of paralogs and homologs is designed to be fully automated and comparably fast, it is not feasible to test neither for each gene nor for each family different alternative scenarios, which might result in a more plausible grouping from the biological point of view. Instead, GhostFam provides an manual editor that allows to perform corrections in the assignment of the genes to their families by hand. At this point it should be stated that if the results of a grouping are not looking sensible at all, the first choice should always be a re-grouping with a different set of parameters, since the proper correction of gene families by hand needs some experience and even more time. The editor can be started from 'Edit'->'Manual Editor' and the appearance of the window looks like a duplicated main window. Indeed, all functions concerning the graphics and tables from the main window are also available in the editor. In principle, the editor allows to perform three different operations (with the five buttons in the center of the window). The first operation is the cutting of a family into two families. This operation is required, if the similarity of a few genes from two different groups lead to their grouping into one family (e.g. in the family with ID 30 in the sample.sdf clearly shows such characteristic). The split of a family is performed by the following steps: 1) Selecting the family from the upper list on one side of the editor, 2) marking of genes to be split from the family (by double click on the gene in the graphic or marking the checkbox in the annotation table), 3) clicking on the button in the center to create a new family from the selected genes, 4) checking the split (the newly family created from the selected genes is shown opposite to the original family). The second operation is the movement of a set of marked gene from one family to another. The movement done as follows: 1) Selecting on one side of the editor the source family and on the other side the family which receives the genes. 2) marking the genes to be transferred in the source family. 3) clicking on the button in the center to move the genes from the source to the target family 4) checking the movement (the updated families are immediately shown). The third editor option is the merging of two families. This is the simplest step and can be easily performed by choosing on the both sides the families to be merged and then clicking on the merge button in the center of the window. Finally, to exit the manual editor choose 'ok' from the buttons in the center. ################################################################## 4. Gecko Given a set of genomes in which each gene is assigned to a family of homologous genes, Gecko detects sets of genes that appear in a conserved neighborhood among the genomes. After the detection step different preprocessing phases allow also the reconstruction of only partially conserved gene clusters. A typical Gecko session is divided into five parts (Data Preparation, Cluster Detection, Graphical Evaluation, Reconstruction, Final Evaluation). These parts are described in the following in more detail. 4.1 Data Preparation For Gecko, the basic requirement is that the genomes are given as sequences of stings where each character (here each number) represents a certain family containing at least one gene. All genes in a family should be homologs performing the same (or very similar) function. Therefore, Gecko supports two different types of input data. The first type is based on the classification of genes in the COG database, the second source of data is the exported family classification from GhostFam (see 3.4). Input files based on COG data base should have the file extension '.cog' and have to be organized as follows: GenomeName Descriptive Text (ignored) Descriptive Text (ignored) Where in each line contains information about the family and function of single genes in the order of their occurrence in the genome: COG no. Strand (+ or -) COG functional category Gene Name functional annotation Example from the file COG_data.cog (see supplementary data in the manual section of Gecko on the BiBiServ): Aquifex aeolicus, complete genome - 0..1551335 1529 proteins 0480 + J fusA elongation factor EF-G 0050 + J tufA1 elongation factor EF-Tu 0051 + J rpsJ ribosomal protein S10 ... 0459 + O mopA GroEL 0000 - - ---- putative protein 0612 - R ymxG processing protease Clostridium acetobutylicum ATCC824, complete genome - 0..3940880 3672 proteins 0593 + L dnaA DNA replication initiator protein, ATPase 0592 + L dnaN DNA polymerase III beta subunit 2501 + S ---- Small conserved protein, ortholog of YAAA B.subtilis This file can be simply created by downloading each single genome from the NCBI database (http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html) and cutting out additional columns. Then all genomes have to be merged into one file. The creation of a GhostFam data file (clustered sequence data file '.csd') is described in detail in 3.4. The file is structured like the COG data file, but the field is defined as follows: FamilyID (number only) Strand (+ or -) COG functional category (usually unknown) LongGeneID functional annotation Gene Name TrEMBL ID Example from 'sample.csd': B.longum * 62 + ? Biflo_0001 cold shock protein cspA AAN23868.1 336 + ? Biflo_0002 chaperone groEL AAN23869.1 0 - ? Biflo_0003 hypothetical protein BL0003 AAN23870.1 0 + ? Biflo_0004 hypothetical protein BL0004 AAN23871.1 ... 580 - ? Biflo_1726 uracil-DNA glycosylase ung AAN25596.1 0 + ? Biflo_1727 hypothetical protein BL1813 AAN25594.1 B.subtilis * 503 + ? Bacsu_0001 dnaA CAB11777.1 504 + ? Bacsu_0002 DNA polymerase III (beta subunit) dnaN CAB11778.1 After selecting the input file and type under 'Sequences'->'Open Data File', the file is parsed and all found genomes are listed in the table. By marking the genomes in the table, one can choose which genomes from the data file should be part of the search for conserved neighborhoods. In the menu 'Sequences', there is the option to view the input data file ('Show Data') or to view some statistical information about the genomes ('Show Info'). Under 'Sequences'->'Select' there is also the option to change the selection of the input sequences. 4.2 Cluster Detection After finishing the data preparation, the cluster detection can be started from the menu 'Algorithm'->'Run Algorithm'. In the following dialog the user has to set the parameters for the minimal size of a cluster to be detected, and k' the minimal number of genome sequences a cluster has to be found in. There are two further options available in the 'Algorithm' menu. The option 'Optimize sequence order' allows (if enabled) an optimization of the genome sequence order such that the genome with the lowest number of genes is processed first and therefore listed on the top of the graphical output. Disabling this option preserves the order of the genomes from the input data file in the output, but might result in a significantly longer computation time. The second option 'Remove 0 from sequences' deletes (if enabled) all genes from the genome that are not member of a COG cluster or a gene family of at least two genes. The enabling of this option might result in a faster computation time, depending on the number of such un-grouped genes. On the other hand, the deletion of genes from the genome immediately causes the creation of artificial neighborhoods, e.g. if one deletes from the sequence of (1,2,3,0,5) the 0, automatically 3 and 5 become neighbors. IMPORTANT: If the cluster detection terminates with an 'out of memory error', restart Gecko with more memory assigned to the JVM (see Readme.txt for details). 4.3 Graphical Evaluation The results of the cluster detection algorithm are presented in the table in the upper left region of the main window. The clusters are sorted by the number of different genes they contain. If two clusters contain the same number of genes, the cluster that is found in less genomes is listed first. The first column in the table contains for each gene a unique cluster identifier. The following columns show the number of different genes in the cluster and the number of sequences in which the cluster occurs. The fourth column '#subcluster' indicates whether the cluster is basic (0 subclusters) or is the result of a merge of many clusters (>= 2 subclusters). The typical situation in which two clusters are merged to one is the following: A cluster contains the genes 1,2,3 and is located in three genomes. A second cluster containing the genes 1,2,3,4 is only found in two of the three genomes. If no merging is performed, theses two clusters appear at separate positions in the table, although it is obvious the cluster containing the genes 1,2,3 is a part of the other cluster. Therefore, the two clusters are merged to one, and the original clusters are stored as subclusters of the merged one. These subclusters are listed in the table in the lower left. The columns five to seven indicate if the gene clusters shows a characteristic pattern containing additional information. These pattern are only of interest in the final evaluation and described in Section 4.5. The last column contains the list of all gene families in the gene cluster. From both tables one can select a single cluster for a more detailed graphical representation right next to the table. In the graphic, each genome that (partially) contains the selected gene cluster appears as a sequence of arrows and numbers. The arrows represent the genes whose neighborhoods are conserved. The familyID or COG number of a gene is shown in the center of the arrow. The blue italic number between the arrows indicate how many genes separate the conserved regions. By moving the mouse over an arrow, further information about the single gene (Position in the genome, LongGeneID, annotation) appears as a tool-tip. The direction in which an arrow is pointing indicates on which strand the gene is located. By clicking on a blue italic number, all genes from the interleaving region are display in a new window together with all available information about their annotation and function. This extended view is also available for a region of connected arrows. By clicking with the left mouse button on one of the connected genes from a genome, all the genes from this region are displayed with the available information about their names, positions, IDs, and functional annotations. A further extremely helpful option comparing the organization of gene clusters between two or more genomes is the centering tool. It is activated by a click with the right mouse button on the gene along which the other regions should be aligned. The alignment shifts the graphic for each genome, such that the first occurrence of the chosen gene in each genomes appears in one column. If the first occurrence in a genome appears on the opposite strand, the whole graphical representation for a genome is flipped. To indicate such a flip, the genome is now drawn on a yellow background. A further useful tool for the localization of single genes in different clusters is the filter field located in the toolbar of Gecko. In this field one can enter one or more numbers (separated by a comma) which are used as a filter for the list of all clusters. When this filter is applied, the left table shows only those gene clusters containing at least one gene from the filter field. To show all clusters again, one can use the trash box symbol (rightmost icon in the toolbar) or re-run the filtering with an empty filter field. 4.4 Reconstruction An additional preprocessing step allows a further optimization of the clustering. Consider the following example: Given the two gene clusters A with the genes 1,2,3,4 and B with the genes 1,2,4,5. In many cases such a situation hints to false family classification, i.e. that the genes in family 3 and 5 are homologs. To easily overcome the problem that A and B are not reported together, Gecko provides the option to run a further preprocessing to combine gene clusters that show a user defined rate of identical genes. In the example a rate of less than 0.75 will result in a combination of the clusters A and B. The postprocessing can be started from 'Algorithm'->'Run Postprocessing'. After setting the identity rate, the postprocessing starts and when finished, the cluster list in the upper left table is updated as well as the check-boxes for the characteristic patterns. 4.5 Final Evaluation For the final evaluation, all options from the Graphical Evaluation (see 4.3) are available, and also additional features of gene clusters are highlighted using the columns P1, P2, and P3 in the list of all clusters. P1: The 'Gene Replacement Pattern' indicates that a gene cluster has a set of subclusters which contain the same number of genes, but exactly one gene from each subcluster does not appear in the intersection of the gene sets. Example: A gene cluster contains the genes 1,2,3,4,5 and the two subclusters with the genes 1,2,3,4 and 1,2,3,5. Here, it is most likely, that the gene '4' from one cluster is replaced by gene '5' in the other cluster. P2: The 'Cluster Separation Pattern' is highlighted for a gene cluster if in at least on genome, the set of neighboring genes is divided into two or more regions which together result in the complete cluster. Example: Given a gene cluster C of the genes 1,2,3,4,5 with subclusters X containing 1,2,3, Y containing 3,4, and Z containing 4,5. Here the complete cluster is separated in some genomes in the parts X,Y, and Z. P3: The 'Duplication Pattern' simply indicates that the gene cluster contains at least one paralogous gene or one region of internal duplication. Given the genome G1=(1,2,3,...,1,2,3) and G2=(1,2,2,3). The resulting gene cluster contains a paralogous gene (the '2' in G2) and a region of internal duplication (the region 1,2,3 appear twice in G1). Another feature, also available without the preprocessing, is the export of a graphical representation into a file in the portable network graphic format (PNG) or as JPEG bitmap file (JPG). This can be done by the menu 'Options'->'Export MainCluster View' for the complete cluster or by 'Options'->'Export SubCluster View' for a selected subcluster.