 |
|
|
|
In this section you will learn about one
of the most important things in molecular biology: the comparison of
data sequenced in the lab (nucleotide or protein) with all known
sequences collected in a certain database. This procedure is often
referred to as homology search. The search results, sequences that
are similar to our sequence, might give an indication of the
function of our new sequenced gene.
NCBI's non-redundant (NR) protein database
contains 2.5 million sequences with almost 850 million amino acids
(June 2005). This precludes the direct approach of aligning the
query sequence with each sequence in the database. Instead,
efficient filtering or indexing methods are used to cut down the
running time. These methods do not necessarily guarantee to find the
best match, but nevertheless they are invaluable tools in a
molecular biologist's daily life.
|
|
|
The probably most well known database
search tool is BLAST (Basic Local Alignment Search Tool), developed
by S. Altschul et al. in the
1990s [ Altschul et al. 1990].
|
|
|
FASTA is another commonly used search sequence
database search tool written by W.R. Pearson and D.J. Lipman in
1988 [ Pearson et al. 1988]. |
|
|
SSEARCH performs a rigorous Smith-Waterman
alignment between a protein sequence and another protein sequence or
a protein database, or with DNA sequence to another DNA sequence or
a DNA library. As SSEARCH does a full alignment between the query
and all database sequences, it is the most sensitive tool to
use. But this also takes some time to compute. |
|
|
e2g [ Krüger et al.
2004] is a specialized tool to compare a genomic
sequence against all ESTs of the same organism. It uses an index
structure which allows to compute the matches very
efficiently.
|
|
|