Sequence Analysis with Distributed Resources - Alignments

		Multiple Sequence Alignments In the last section we made some experiences with pairwise alignment tools. Those tools are good to look for global and local similarities between only two sequences. This way you are able to find preserved regions distributed over the whole sequence. If we want to align more than two sequences we have to use multiple alignment tools. Methods that generalize the pairwise dynamic programming approach to multiple alignments are limited to small numbers of short sequences, as the problem is uncomputable for much more than 10 or so proteins of average length. Therefore, all of the methods below make use of heuristics. Each tool has different restrictions in handling sequence data. In principle it is better to align only sequences of equal lengths, which means the shortest and longest sequence should not differ more than one hundred bases in length. A really big advantage of multiple alignment tools is that you see global similarities between several sequences much better than within lots of pairwise sequence alignments on several sheets of paper spread all over your desk.
		ClustalW ClustalW [Thompson et al. 1994] is the most widely known multiple sequence alignment tool for DNA or proteins. It uses the fact the homologous sequences are evolutionarily related and builds up the alignment progressively by a series of pairwise alignments, following the branching order in a phylogentic tree. The most closely related sequences are aligned first, and then the more distant ones are added gradually.
		DCA Divide-and-Conquer Multiple Sequence Alignment (DCA) [Stoye et al. 1997] is a program for producing fast, high quality simultaneous multiple sequence alignments of amino acid, RNA, or DNA sequences. The program is based on the DCA algorithm, a heuristic approach to sum-of-pairs (SP) optimal alignment.
		DIALIGN While standard alignment methods rely on comparing single residues and imposing gap penalties, DIALIGN [Morgenstern 1999] constructs pairwise and multiple alignments by comparing whole segments of the sequences. No gap penalty is used. This approach is especially efficient where sequences are not globally related but share only local similarities, as is the case with genomic DNA and with many protein families.
		MSA MSA implements the Carillo-Lipman heuristic.
		BAliBASE In the exercises you have used three different alignment programs to align the same set of sequences. Which one generates the "best" alignment? This question is difficult to answer, as we do not know the evolutionary history of the sequences. Actually we try to reconstruct the history from the sequences we can obtain today. BAliBASE [Thompson et al. 1999] is a database of manually-refined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. The alignments are categorised by sequence length, similarity, and presence of insertions and N/C- terminal extensions.