BiBiServ2 - REPuter

Textual Output: The result of a run can viewed/downloaded as a space separated table. Optional the output can be filtered. The head of a sample output looks like :

 # 235 -3 8 reputer_bibitest_1091788224_479525172.xmlrpc 

 9 150 F  9 151  0 5.92e-02 

 8 150 F  8 152  0 2.37e-01 

10 150 F 10 153 -1 4.44e-01 

 9 150 F  9 154 -1 1.60e+00 

[1][2][3][4][5] [6] [7] 

                    ...

The first line, starting with '#' is a comment. The sequence length (235), the maximum allowed distance ([-]3), the minimum repeat size (8) and the processed file are described here. The following lines contain repeats found , one line each .

repeat length of the first part
starting position of the first part
match direction
repeat length of the second part
starting position of the second part
distance of this repeat
calculated e-value of this repeat

Theoretical Background

This tool reports maximal forward, reverse, complemented, and reverse complemented repeats for a given input sequence. The definition of 'maximality' as in [1] basically limits the output to only the longest repeats in the sequence. These may contain shorter repeats which are not explicitly reported.

Let your input sequence be a text string s of length n. The characters in s are indexed from 0 to n-1, therefore s can be written as s=s ₀s ₁... s _n-1. For each reported repeat denoted by a triple ( l, i, j), i.e. size, starting position of a piece of sequence and starting position of its repeat counterpart, we postulate the size l>0 and the starting positions i, j ∈ [ 0, n-1].

REPuter distinguishes four different kinds of repeats:

Maximal forward repeat, MFR
Maximal reverse repeat, MRR
Maximal complemented repeat, MCR
Maximal palindromic (reverse complemented) repeat, MPR

The triple ( l, i, j) is a MFR if:

i ≠ j
(There is no identical starting position).
s _is _i+1... s _i+l-1 = s_js _j+1 ... s _j+l-1

(Both parts of the repeat have the same size).
If 0 ≤ i-1, then s _i-1 ≠ s _j-1

(If the first part of the repeat starts at a position greater or equal to 0, then the characters immediately to the left of each part are different).
If j+l ≤ n-1, then s _i+l ≠ s _j+l

(If the ending position of the second part of the repeat is less or equal than the total input sequence size, then the characters immediately to the right of each part are different).

[1] Gusfield, D., Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997

REPuter Sample Run

Consider the following 30 bases input sequence, which is a three-fold repetition of 'gacagtcagt':

>5.seq gacagtcagtgacagtcagtgacagtcagt

The REPuter engine produces the following raw data output, starting with the input sequence name. Following, each line describes one repeat, its size, starting position of the first part, one of the four possible modes (F, P, R, C), then the starting position of the second part.

The output below therefore reports two repeats, both starting at position 0. The first part of the first repeat starts at position 0, its second part at position 20.

# /tmp/5.seq.flat 30 20 0 F 20 10 0 2.30e-10 10 0 F 10 20 0 2.41e-04

Drawing the sequence in dark blue and the repeats in light blue this might look like this:

result example run

Note that according to the 'left character' rule 3. for MFRs in the Theoretical Background section, we do not report a repeat like "10 0 F 10", since this short repeat will become part of "20 0 F 10".

Additionally, to keep the starting position information visible, each part of a repeat is displayed on a separate strand:

forward repeats of example sequence

Changes to previous online versions of reputer

Nuclear Acid Repeat Calculation

In-/Output values

INPUT :: DNA input sequence

OUTPUT :: Textual Output

Parameter

Theoretical Background

REPuter Sample Run