Now you are interested in aligning two strains of E. coli. In order to save about half the amount space, you decide to build an index comprised of the first sequence only according to instructions b) of Section 3.
mkvtree -dna -suf -tis -sti1 -pl 8 -db ecoli.fna
Using this index (which has the same name as the sequence it contains) means every time you run mga you have to specify the missing second sequence.
mga -l 20 -greedy -o example -mem ecoli_O157.fna ecoli.fna
The last mga run created a greedy alignment of short gaps which you can inspect in the file example.align:
73:229-301 90:229-318 Sbjct: ------------------ttaccacaggtaacggtgcgggctgacgcgtacaggaaacac 60 !!!!!!!!!!!!!!!!!! Query: ccaccatcaccattaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacac 60 Sbjct: agaaaaaagcccgcacctgacagtgcgggct 91 ! Query: agaaaaaagcccgcacctgacagtgcgggc- 91
The query sequence, i.e. strain O157:H7 is approximately 900k bases longer than strain K-12 MG1655. This difference is reflected in a number of long, unaligned gaps where the region or part of the query sequence is substantially longer than the other one. The most extreme case is presented below:
2707:1634868-1637574 172666:2138604-2311269 Gap67: 2707 (min. length) + 169959 (diff.) = 172666 (max. length)
Suppose you want to align two strains of M. tuberculosis using MUMs as anchors. This is only possible when creating an index as before:
mkvtree -dna -suf -tis -sti1 -pl 8 -db mtub.fna
Now you want to align not just short gaps but indeed every gap that is similar in both sequences:
mga -l 50 -maxdist 1000 -greedy -o example -mum mtub_CDC1551.fna mtub.fna
You find several alignments that are not computed without option -maxdist. An excerpt of the longest one is shown below; note that it contains more than 10k bases.
10722:4134709-4145430 10721:4126967-4137687 Sbjct: gagagcggcgcagatgatcctaaccggacgcaccggcttgctggccctgatctgcgtcct 60 ! Query: -agagcggcgcagatgatcctaaccggacgcaccggcttgctggccctgatctgcgtcct 60 Sbjct: gccgatagcgctgtccccttggccggcaagggctttcgtgatgttgctggtggcgcttgc 120 Query: gccgatagcgctgtccccttggccggcaagggctttcgtgatgttgctggtggcgcttgc 120
In order to obtain the sequence data of putative SNPs you either have to include option -identity 0 in the previous call or you remove option -maxdist from it. Here is one putative SNP between two particularly long MUMs:
19138 3270781 3265104 !3289919 !3284242 Sbjct: c 1 Query: a 1 6919 3289920 3284243