Author: D. Doerr

Family Free Genome Comparison (FFGC) is a self-contained workflow system that provides functionality for all steps of a family-free gene order analysis starting from annotated genome sequences: (1) the computation of local sequence alignment scores between genes of two or more gene order sequences using on BLAST+ (Necessary BLAST+ binaries are included in the binary releases of this software); (2) the establishment of gene relationships; and (3) the actual family-free gene order analysis.

Supplied with a set of genome sequences, FFGC creates a project associated with an autarkic folder hierarchy. In creating the directories, FFGC includes parameter values in folder names for all steps of the workflow, thereby enabling the simultaneous execution, allocation, and simple maintenance of generated data from several analyses with varying parameter settings.

The create_project command of FFGC admits three alternative input variants: (i) files in genbank format, (ii) pairs of (multi-record) fasta and general feature format (GFF) files, and (iii) fasta files that are already in an anticipated formatting.

For input variant (ii), each fasta file must be associated with a single organism and each of its records must correspond to a contig or chromosome of the organism's genome. Further, their corresponding GFF files must share the same name -- except for the file ending '.gff'.

For input variant (iii), each fasta record must correspond to an atomic unit of a gene order sequence, featuring a unique record ID that indicates its association to a chromosome or contig. Further, the order of the records in their corresponding fasta files must already correspond to the order of their anticipated gene order sequences.

For input variants (i) and (ii), create_project allows annotated genomic features to be filtered by type. The selected annotations are then extracted and in further analysis considered as the atomic units of gene order sequences that are henceforth denoted as genes. FFGC enables the performance of family-free analysis on either the nucleotide or the protein sequence level.

After the project is created, the genome sequences, intermediary data, and outputs of the various family-free analyses are maintained by the rule-based workflow management system snakemake. The workflow and paramater settings of individual family-free analyses can be adjusted in the configuration file config.yaml located in the root directory of each FFGC project.

Snakemake workflow schema of FFGC

The figure above gives an overview of the snakemake workflow provided by FFGC. Each terminal rule (a leaf of the tree, starting with "all_") represents a workflow task associated with a family-free analysis. The tasks are executed by running snakemake -p <task> in the root directory of a FFGC project.

