Usage
Type "mmfind" or "mmfind -h" to get information about the program and
the options which modify the behaviour of the program:
mmfind [options] <multiple_fasta_file>
Evaluates an alignment in multiple FASTA format for mismatches.
Quality scores are considered if a file with scores is supplied
(the file should have the same name but a "qual"-extension;
example: test.fa and test.qual). Scores should be in FASTA format
and will be mapped on the aligned sequences. mmfind is a command-
line tool written in Python. It is tested with Python 2.6, 2.7
and 3.1.
FILTERING
-a <integer> | (aligned:) minimal number of aligned sequences (default:2).
The alignment will not be processed in case it contains
less sequences (error code: -602).
|
-L <integer> | (length:) minimal alignment length (default:200, error code: -605). |
-p <integer> | (polymorphism cutoff:) ignore alignments with percent mismatches per
length exceeding given value (default:3, error code: -656). |
-l <integer> | (length:) maximal length of mismatches to be reported (default:3). |
-b <integer> | (border distance:) minimal distance of a mismatch to the alignment
ends to be reported (default:80). |
-s <integer> | (score:) minimal average score of the bases of a mismatch (default:20). |
-n <integer> | (neighborhood scores:) minimal average score of the 10 neighboring bases
of a mismatch (5 upstream, 5 downstream) (default:15). |
-A | (all mismatches:) prevent filtering and display all mismatches (default:
use default filtering, see above). |
OUTPUT OPTIONS:
-o <basename_of_outfiles> | (outfile:) files to which the reports should be appended (default:
<basename_of_infile>.alignments.csv and <basename_of_infile>.mismatches.csv). |
-d | (description:) write a descriptive headline to the report files (default:
no headline). |
Output files
columns in the alignments file
ID | Alignment ID. |
ALN_LEN | Length of the aligned sequences including the gaps. |
MISM | All differences between the aligned sequences. |
SNVS | Single Nucleotide Variations = single-base SNPs. |
SNV_B | Single Nucleotide Variation bases. |
MNVS | Multiple Nucleotide Variations = multi-base SNPs. |
MNV_B | Multiple Nucleotide Variation bases. |
S_IND | Single-base InDels. |
S_IND_B | Single-base InDel bases. |
M_IND | Multi-base InDels. |
M_IND_B | Multi-base InDel bases. |
P_PERC | Percent polymorphic per aligned bases |
STATUS | If the alignment is OK the status is 1, otherwise 0. |
ERROR | Error code:
-42: sequences of the multiple fasta file are not equal in length.
-605: alignment too short.
-602: not enough sequences in the alignment.
-656: fraction of mismatches too high.
|
columns in the mismatches files
ID | Alignment ID. |
TYPE | Mismatch type (SNV, MNV, InDel, Mixed). |
ALN_S | Position of the first base of a polymorphism in the alignment. |
PS_CONS | Consensus sequence of the polymorphism represented in ambiguity code. |
PS_LEN | Length of the polymorphism. |
CONS | Consensus sequence of 100 bp upstream and 100 bp downstream the polymorphism. |
MINBORD | Minimal distance of the polymorphism to the start or the end of the alignment. |
PSAVGSC | Average score of the polymorphis site. |
NGAVGSC | Average score of the neighboring 2x5 bases. |
N_COUNT | Number of N's in the consensus sequence. |
STATUS | Result of 3 tests:
1: length of the polymorphism <= maximal polymorphism length (= binary 1),
2: distance to the start or the end of the alignment > minimal border distance (= binary 10 = decimal 2),
3: Average score of the neighboring 2x5 bases >= cutoff and average score
of the polymorphis site >= cutoff (= binary 100 = decimal 4)
All 3 tests passed successfully accounts for binary 111 = decimal 7.
|
Examples
Processing one multiple alignment (e.g. "example.mfa" and "example.qual" in the
directory "test1" in the downloaded archive):
mmfind -d test1/example.mfa
or
cd test1
mmfind -d example.mfa
In both cases two files ("example.alignments.csv" and "example.mismatches.csv") are
written to the directory from where the script is called. The '-d'-option is not
required, but responsible for a descriptive header line in the output files.
Processing of multiple alignments (e.g. files in the directory "test2" in the
downloaded archive):
mmfind -d -o test2/summary test2
With the -o option the result of the evaluation is not written to separate files for
each alignment, but summarized in the two files "summary.alignments.csv" and
"summary.mismatches.csv". Only fasta files with the following extension are evaluated:
fa, fasta, fas or mfa.
For a more user friendly alignment format use the -c option. For each multiple fasta
alignment an appropriate clustal-like version will be written.