BiBiServ2 - PoSSuMsearch

In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs. We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of such matrices in large databases. Our approach preprocesses the search space, e.g. a complete genome or a set of protein sequences, and builds an enhanced suffix array which is stored on file. The enhanced suffix array only requires 9 bytes per input symbol, and allows to search a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows to efficiently compute a matrix similarity threshold for a PSSM, given an E-value or a P-value. Our method is based on dynamic programming. In contrast to other methods it employs lazy evaluation of the dynamic programming matrix: it only evaluates those matrix entries that are necessary to derive the sought similarity threshold.

The PoSSuMsearch program, included in the PoSSuM software distribution implements the algorithms Simplesearch, Lookaheadsearch, ESAsearch and Lazydistrib. A user can search for PSSMs in enhanced suffix arrays built by mkvtree from the VMATCH package (also included in the PoSSuM software distribution), as well as on plain sequence data in Fasta, GENBANK, EMBL, or SPROT format. The search algorithm can be chosen from the command line.

PSSMs are specified in a simple plain text format, where one file may contain multiple PSSMs. The alphabet a PSSM refers to, and alphabet character to PSSM column assignments can be specified on a per-PSSM basis for most flexible alphabet support. All implemented algorithms support alphabet transformations. PSSMs can contain integer as well as floating point scores. To prevent rounding errors for integer based PSSMs, PoSSuMsearch uses integer arithmetics for these, resulting in an additional speedup on most CPU architectures. Searching on the reverse strand of nucleotide sequences is implemented by PSSM transformation according to Watson-Crick base pairing. Hence it is sufficient to build the enhanced suffix array for one strand only. This can then be used to search both strands.

The cutoff can be specified as p-value, E-value, MSS (matrix similarity score), or raw score threshold. If only the best matches with the highest scores need to be known, then PoSSuMsearch can be asked to report only the k highest scoring matches without even specifying an explicit cutoff. To do so, the search algorithms dynamically adapt the threshold during the search. When using p- or E-values, the score threshold is determined by either the lazy dynamic programming algorithm introduced in this contribution, or read from file that stores the complete precalculated probability distribution. Background distributions can be specified arbitrarily by the user, or determined from a given sequence database. We provide a tool, PoSSuMdist, to generate a compressed file containing the complete precalculated probability distributions for a set of PSSMs.

Availability: The PoSSuM software package, including the program PoSSuMsearch implementing the ESAsearch algorithm, is available free of charge for non-commercial research institutions