next up previous
Next: Output Up: Efficient Multiple Genome Alignment Previous: The Program


Scripts

As explained in the introduction short gaps between matches are aligned by conventional alignment methods. Three of them are built-in and ready to use but there is no need to limit oneself to these methods. Any external tool can be used for that purpose as long as the resulting alignment is presented in a format mga can parse.

The idea is to have a short script that takes care of several steps:

  1. If the alignment tool can not read in FASTA files the script must call a conversion tool like readseq [Gil01] to provide the sequences in a suitable format.

  2. The alignment tool is called with appropriate parameters and is passed on the name of the file containing the sequence data.

  3. The resulting alignment must be stored either in GDE nucleotide format or in FASTA format. If the alignment tool is not able to output one of these formats the script must convert the sequences.

  4. The order of the sequences must be unchanged from the input file: mga depends on it.

  5. The filename of the aligned sequences must have the suffix .fna replaced by .gde--even if they are in FASTA format.

  6. The script carries the responsibility of deleting any intermediate files. Otherwise these files will accumulate.

After completion of the script mga will delete the input (.fna) and output (.gde) file.

Option -msascript expects the name of the script to be run. In order for this to work two things must be ensured: First, the ``x'' permission must be set. Second, the script must be in the search path of mga; alternatively you can specify the complete path.

mga is distributed along with an example script called myclustalw.sh. Its content is listed below:

#! /bin/sh
if [ $# -ne 1 ] ; then
  echo "Usage: $0 <inputfile>"
  exit 1
fi
inputfile=$1
options="-type=DNA -dnamatrix=IUB -output=GDE -outorder=input"
clustalw ${options} -infile=${inputfile}
if [ $? -ne 0 ] ;  then
  echo "failure: clustalw ${options} -infile=${inputfile}"
  exit 1
fi
dnd=`echo ${inputfile} | sed -e 's/fna\$/dnd/'`
rm -f ${dnd}

Inside this sh-script the filename can be accessed by refering to variable $1. ClustalW is called with several options, the last one specifying the location of the sequence data. The first option forces ClustalW to expect nucleotide data to prevent its recognition heuristics from failing in some cases. The second option chooses the IUB matrix for scoring the alignments. This is actually the default in recent versions. The options starting with -out are essential: they ensure that a) the resulting alignments are stored in a format mga can parse and b) the sequences are in the assumed order.

The last two lines of the script delete an intermediate file produced by ClustalW, the guide tree. Otherwise the files would accumulate.

This script does exactly what using -clustalw was doing in mga before version 2003-03-18. Now, this option has been slightly modified by omitting -dnamatrix=IUB in the call of ClustalW. This improves compatibility with older versions of that tool which do not have the IUB matrix built in.

Sometimes gaps appearing at the start and end of an alignment should be scored in a different way from gaps in the middle of an alignment. This can be achieved by parsing the FASTA headers of the sequences. mga writes out sequence number, length, start and end position of the sequence data. The prototypical FASTA header looks as follows (variables are enclosed by `<' and `>'):

> Seq=<seqnum> length=<length>, start=<start>, end=<end>


next up previous
Next: Output Up: Efficient Multiple Genome Alignment Previous: The Program
Jan Krueger 2012-12-14