Pairwise Sequence Alignment

Next: DTD for XML Output Up: Examples Previous: Multiple Sequence Alignment

Pairwise Sequence Alignment

Now you are interested in aligning two strains of E. coli. In order to save about half the amount space, you decide to build an index comprised of the first sequence only according to instructions b) of Section 3.

mkvtree -dna -suf -tis -sti1 -pl 8 -db ecoli.fna

Using this index (which has the same name as the sequence it contains) means every time you run mga you have to specify the missing second sequence.

mga -l 20 -greedy -o example -mem ecoli_O157.fna ecoli.fna

The last mga run created a greedy alignment of short gaps which you can inspect in the file example.align:

73:229-301 90:229-318 
Sbjct: ------------------ttaccacaggtaacggtgcgggctgacgcgtacaggaaacac        60
       !!!!!!!!!!!!!!!!!!                                          
Query: ccaccatcaccattaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacac        60

Sbjct: agaaaaaagcccgcacctgacagtgcgggct                                     91
                                     !
Query: agaaaaaagcccgcacctgacagtgcgggc-                                     91

The query sequence, i.e. strain O157:H7 is approximately 900k bases longer than strain K-12 MG1655. This difference is reflected in a number of long, unaligned gaps where the region or part of the query sequence is substantially longer than the other one. The most extreme case is presented below:

2707:1634868-1637574 172666:2138604-2311269 
Gap67:    2707 (min. length) +  169959 (diff.) =  172666 (max. length)

Suppose you want to align two strains of M. tuberculosis using MUMs as anchors. This is only possible when creating an index as before:

mkvtree -dna -suf -tis -sti1 -pl 8 -db mtub.fna

Now you want to align not just short gaps but indeed every gap that is similar in both sequences:

mga -l 50 -maxdist 1000 -greedy -o example -mum mtub_CDC1551.fna mtub.fna

You find several alignments that are not computed without option -maxdist. An excerpt of the longest one is shown below; note that it contains more than 10k bases.

10722:4134709-4145430 10721:4126967-4137687 
Sbjct: gagagcggcgcagatgatcctaaccggacgcaccggcttgctggccctgatctgcgtcct      60
       !                                                         
Query: -agagcggcgcagatgatcctaaccggacgcaccggcttgctggccctgatctgcgtcct      60

Sbjct: gccgatagcgctgtccccttggccggcaagggctttcgtgatgttgctggtggcgcttgc     120
Query: gccgatagcgctgtccccttggccggcaagggctttcgtgatgttgctggtggcgcttgc     120

In order to obtain the sequence data of putative SNPs you either have to include option -identity 0 in the previous call or you remove option -maxdist from it. Here is one putative SNP between two particularly long MUMs:

19138 3270781 3265104
!3289919 !3284242 
Sbjct: c                                                                  1
      
Query: a                                                                  1

6919 3289920 3284243

Next: DTD for XML Output Up: Examples Previous: Multiple Sequence Alignment

Jan Krueger 2012-12-14