|
In this exercise we want to convert a file in Genbank XML
format to FASTA format using a SAX parser.
- Browse to the Taxonomy Database at NCBI and download a TinySeq XML formatted file of all nucleotide sequences from Rat-kangaroos (Potoroidae).
- How does the structure of this XML file look like? Do you think it is easier to parse such a XML file instead of a Genbank flat file? Why?
- The Simple API for XML (SAX) is a public domain API developed cooperatively
by the members of the XML-DEV mailing list. It provides an event-driven
interface to the process of parsing an XML document.
Download the example code of a SAX parser written in Java. The program needs two classes:
SAXDriver.java:
The SAXDriver class uses Java's SAXParserFactory to parse the XML file. Each time it finds
a new XML tag, it uses the HandlerEvents class to perform the action defined in this class.
HandlerEvents.java:
The HandlerEvents class extends the DefaultHandler class by implementing the startElement,
characters, endElement, and endDocument methods. When the startElement event is encountered
for an <TSeq> tag, the sequence counter is incremented. The startElement event for the <TSeq_length>
tag sets a flag which causes the characters event to parse the length of the sequence and add it to the total length.
When the endDocument event is encountered, the
number of sequences and total length is printed out.
Compile and use the program to count the number of entries and the total sequence length
in the XML file you just downloaded.
- Change the parser so that it converts the XML file to a multiple FASTA file. The header of each sequence
should include the accession-version and definition of the sequence.
|
|