about sitemap home home
Databases Data Formats Database Search Genome Browser RNA Secondary Structure Alignments Primer Design WebServices
FASTA Genbank EMBL XML
Exercise SAX Exercise XSLT
Bielefeld University Center of Biotechnoloy Institute of Bioinformatics BiBiServ
 
XML Data Format - Exercise 1
In this exercise we want to convert a file in Genbank XML format to FASTA format using a SAX parser.
  1. Browse to the Taxonomy Database at NCBI and download a TinySeq XML formatted file of all nucleotide sequences from Rat-kangaroos (Potoroidae).
  2. How does the structure of this XML file look like? Do you think it is easier to parse such a XML file instead of a Genbank flat file? Why?
  3. The Simple API for XML (SAX) is a public domain API developed cooperatively by the members of the XML-DEV mailing list. It provides an event-driven interface to the process of parsing an XML document.

    Download the example code of a SAX parser written in Java. The program needs two classes:

    SAXDriver.java:
    The SAXDriver class uses Java's SAXParserFactory to parse the XML file. Each time it finds a new XML tag, it uses the HandlerEvents class to perform the action defined in this class.

    HandlerEvents.java:
    The HandlerEvents class extends the DefaultHandler class by implementing the startElement, characters, endElement, and endDocument methods. When the startElement event is encountered for an <TSeq> tag, the sequence counter is incremented. The startElement event for the <TSeq_length> tag sets a flag which causes the characters event to parse the length of the sequence and add it to the total length. When the endDocument event is encountered, the number of sequences and total length is printed out.

    Compile and use the program to count the number of entries and the total sequence length in the XML file you just downloaded.

  4. Change the parser so that it converts the XML file to a multiple FASTA file. The header of each sequence should include the accession-version and definition of the sequence.