Sequence Analysis with Distributed Resources

		Database Search In this section you will learn about one of the most important things in molecular biology: the comparison of data sequenced in the lab (nucleotide or protein) with all known sequences collected in a certain database. This procedure is often referred to as homology search. The search results, sequences that are similar to our sequence, might give an indication of the function of our new sequenced gene. NCBI's non-redundant (NR) protein database contains 2.5 million sequences with almost 850 million amino acids (June 2005). This precludes the direct approach of aligning the query sequence with each sequence in the database. Instead, efficient filtering or indexing methods are used to cut down the running time. These methods do not necessarily guarantee to find the best match, but nevertheless they are invaluable tools in a molecular biologist's daily life.
		BLAST The probably most well known database search tool is BLAST (Basic Local Alignment Search Tool), developed by S. Altschul et al. in the 1990s [Altschul et al. 1990].
		FASTA FASTA is another commonly used search sequence database search tool written by W.R. Pearson and D.J. Lipman in 1988 [Pearson et al. 1988].
		SSEARCH SSEARCH performs a rigorous Smith-Waterman alignment between a protein sequence and another protein sequence or a protein database, or with DNA sequence to another DNA sequence or a DNA library. As SSEARCH does a full alignment between the query and all database sequences, it is the most sensitive tool to use. But this also takes some time to compute.
		HMMER Compared to BLAST or FASTA, HMMER aims to be significantly more accurate and more able to detect remote homologs, because of the strength of its underlying probability models.
		e2g e2g [Krüger et al. 2004] is a specialized tool to compare a genomic sequence against all ESTs of the same organism. It uses an index structure which allows to compute the matches very efficiently.