Family Free Genome Comparison (FFGC) is a self-contained workflow
system that provides functionality for all steps of a family-free
gene order analysis starting from annotated genome sequences: (1)
the computation of local sequence alignment scores between genes of
two or more gene order sequences using on BLAST+
(Necessary BLAST+ binaries are included in the binary releases of
this software); (2) the establishment of gene relationships; and
(3) the actual family-free gene order analysis.
Supplied with a set of genome sequences,
FFGC creates a project associated with
an autarkic folder hierarchy. Creating the directories,
FFGC includes parameter values in
folder names for all steps of the workflow, thereby enabling the
simultaneous execution, allocation, and simple maintenance of
generated data from several analyses with varying parameter
settings.
The create_project command of
FFGC admits three alternative input
variants: (i) files in genbank format, (ii)
pairs of (multi-record) FASTA and
general feature format (GFF) files, and (iii)
FASTA files that are already in an anticipated formatting.
For input variant (ii), each FASTA file must be associated
with a single organism and each of its records must correspond to a
contig or chromosome of the organism's genome. Further, their
corresponding GFF files must share the same name -- except for the
file ending '.gff'.
For input variant (iii), each FASTA record must correspond to
an atomic unit of a gene order sequence, featuring a unique
record ID that indicates its association to a
chromosome or contig. Further, the order of the records in their
corresponding FASTA files must already correspond to the order of
their anticipated gene order sequences.
For input variants (i) and (ii),
create_project allows annotated
genomic features to be filtered by type. The selected annotations
are then extracted and in further analysis considered as the atomic
units of gene order sequences that are henceforth denoted as
genes. FFGC
enables the performance of family-free analysis on either the
nucleotide or the protein sequence level.
After the project is created, the genome sequences,
intermediary data, and outputs of the various family-free analyses
are maintained by the rule-based workflow management
system snakemake.
The workflow and parameter settings of individual family-free
analyses can be adjusted in the configuration file
config.yaml located in the root directory of
each FFGC project.
The figure above gives an overview of the
snakemake workflow provided by
FFGC. Each terminal rule (a leaf of the
tree, starting with "all_") represents a workflow task associated
with a family-free analysis. The tasks are executed by running
snakemake -p <task> in the root directory
of a FFGC project.