about sitemap home home
Databases Data Formats Database Search Genome Browser RNA Secondary Structure Alignments Primer Design WebServices
Bielefeld University Center of Biotechnoloy Institute of Bioinformatics BiBiServ
Database Search
In this section you will learn about one of the most important things in molecular biology: the comparison of data sequenced in the lab (nucleotide or protein) with all known sequences collected in a certain database. This procedure is often referred to as homology search. The search results, sequences that are similar to our sequence, might give an indication of the function of our new sequenced gene.
NCBI's non-redundant (NR) protein database contains 2.5 million sequences with almost 850 million amino acids (June 2005). This precludes the direct approach of aligning the query sequence with each sequence in the database. Instead, efficient filtering or indexing methods are used to cut down the running time. These methods do not necessarily guarantee to find the best match, but nevertheless they are invaluable tools in a molecular biologist's daily life.
The probably most well known database search tool is BLAST (Basic Local Alignment Search Tool), developed by S. Altschul et al. in the 1990s [Altschul et al. 1990].
FASTA is another commonly used search sequence database search tool written by W.R. Pearson and D.J. Lipman in 1988 [Pearson et al. 1988].
SSEARCH performs a rigorous Smith-Waterman alignment between a protein sequence and another protein sequence or a protein database, or with DNA sequence to another DNA sequence or a DNA library. As SSEARCH does a full alignment between the query and all database sequences, it is the most sensitive tool to use. But this also takes some time to compute.
Compared to BLAST or FASTA, HMMER aims to be significantly more accurate and more able to detect remote homologs, because of the strength of its underlying probability models.
e2g [Krüger et al. 2004] is a specialized tool to compare a genomic sequence against all ESTs of the same organism. It uses an index structure which allows to compute the matches very efficiently.