The first algorithms for the prediction of RNA secondary
structures compute the structure with maximal number of base
pairs using Dynamic Programming in O(n^3) time and O(n^2)
space. Nowadays a more detailed energy model gives much
better results, but basically the algorithms remained
One of the most used implementations is the Vienna RNA packacke
. It contains several
RNA related tools, which we will use in the examples. A
slightly other approach is the program RNAshapes
, which computes only
dissimilar structures and thus reduces the suboptimal
structure space to the most meaningful ones.
A nice tool for the visualization of many RNA secondary
structures is RNAmovies
. It renders a sequential animation from
individual structures and thus allows for a convenient
overview of many alternative structures of one RNA
usually carry more information than each sequence alone. For
RNA, compensating base mutations are a strong evidence for
putative base pairs: If a putative G-C base pair mutates to an
A-U base pair these positions are likely to pair.
There are basically four possible ways
how to use multiple sequences for structure prediction:
- First align the sequences using standard alignment
tools. Then fold the alignment with RNAalifold and look for covarying
- Predict one (or more) consensus shapes for
the sequences, using the "consensus shapes" mode of
RNAshapes. Each (abstract) consensus shape comes with a set of
(concrete) structures, a shape representative structure
(shrep) for each of the sequences. Then, for each shape,
structurally align the shreps, e.g with the RNAforester.
- Simultaneously align and fold the sequences. Clearly,
this method yields the best results, but unfortunately needs
O(n^6) time and O(n^4) space already for two
sequences. A practical implementation of this algorithm is e.g.
The most reliable methods for RNA secondary structure
prediction are statistical methods using covariance or
mutual information. Unfortunately they need a large number
of sequences and a very good alignment, which at the moment
can not be made automatically.
Families of structurally related RNA molecules are modelled with stochastic context free grammars
(SCFGs). These are a generalization of HMMs (which we use for sequence families. SCFGs reflecting the fact
that RNA structures are not sequential, but have a branching, tree-like shape.
A SCFG is a conventional context free grammar (as widely used in computer science for the definition of
programming languages), essentially a set of rules which can be applied to derive sequences. These rules
reflect the properties of RNA -- e.g. two bases that form a base pair must be derived within the same rule.
The stochastic aspect comes from probabilities associated with the rules of the grammar. They are learned
from data using the program CMbuild from the Infernal package.
For searching a data base of family models with a given sequence, the sequence is parsed using the SCGF
with the probability parameters trained from the family members. This is implemented by the program CMfind
from the Infernal package.
The Rfam data base is currently the most important resource for RNA family models. It is maintained at
the Sanger Centre, using Infernal and lots of available structural data to build the family models.