Sequence Analysis with Distributed Resources - RNA Secondary Structure

		RNA Secondary Structure Prediction
		Single Sequence RNA Structure Prediction The first algorithms for the prediction of RNA secondary structures compute the structure with maximal number of base pairs using Dynamic Programming in O(n^3) time and O(n^2) space. Nowadays a more detailed energy model gives much better results, but basically the algorithms remained the same. One of the most used implementations is the Vienna RNA packacke. It contains several RNA related tools, which we will use in the examples. A slightly other approach is the program RNAshapes, which computes only dissimilar structures and thus reduces the suboptimal structure space to the most meaningful ones. A nice tool for the visualization of many RNA secondary structures is RNAmovies. It renders a sequential animation from individual structures and thus allows for a convenient overview of many alternative structures of one RNA molecule.
		Comparative RNA Structure Prediction Multiple sequences usually carry more information than each sequence alone. For RNA, compensating base mutations are a strong evidence for putative base pairs: If a putative G-C base pair mutates to an A-U base pair these positions are likely to pair. There are basically four possible ways how to use multiple sequences for structure prediction: First align the sequences using standard alignment tools. Then fold the alignment with RNAalifold and look for covarying positions. Predict one (or more) consensus shapes for the sequences, using the "consensus shapes" mode of RNAshapes. Each (abstract) consensus shape comes with a set of (concrete) structures, a shape representative structure (shrep) for each of the sequences. Then, for each shape, structurally align the shreps, e.g with the RNAforester. Simultaneously align and fold the sequences. Clearly, this method yields the best results, but unfortunately needs O(n^6) time and O(n^4) space already for two sequences. A practical implementation of this algorithm is e.g. Foldalign . The most reliable methods for RNA secondary structure prediction are statistical methods using covariance or mutual information. Unfortunately they need a large number of sequences and a very good alignment, which at the moment can not be made automatically.
		RNA family models Families of structurally related RNA molecules are modelled with stochastic context free grammars (SCFGs). These are a generalization of HMMs (which we use for sequence families. SCFGs reflecting the fact that RNA structures are not sequential, but have a branching, tree-like shape. A SCFG is a conventional context free grammar (as widely used in computer science for the definition of programming languages), essentially a set of rules which can be applied to derive sequences. These rules reflect the properties of RNA -- e.g. two bases that form a base pair must be derived within the same rule. The stochastic aspect comes from probabilities associated with the rules of the grammar. They are learned from data using the program CMbuild from the Infernal package. For searching a data base of family models with a given sequence, the sequence is parsed using the SCGF with the probability parameters trained from the family members. This is implemented by the program CMfind from the Infernal package. The Rfam data base is currently the most important resource for RNA family models. It is maintained at the Sanger Centre, using Infernal and lots of available structural data to build the family models.