15 December 2006

JAVA 1.6 Mustang, StAXand Bioinformatics

about StAX via xml.com ; Most current XML APIs fall into one of two broad classes: event-based APIs like SAX and XNI or tree-based APIs like DOM and JDOM. Most programmers find the tree-based APIs to be easier to use; but such APIs are less efficient, especially with respect to memory usage. (...) However, the common streaming APIs like SAX are all push APIs. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require programmers to adopt are unfamiliar and uncomfortable to many developers. (...)

StAX shares with SAX the ability to read arbitrarily large documents. However, in StAX the application is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is ready. Furthermore, StAX exceeds SAX by allowing programs to both read existing XML documents and create new ones. Unlike SAX, StAX is a bidirectional API.

I 've tested StaX to see how it could be used to read the NCBI/TinySeqXML format.
For each xml all TSeq sequence was parsed using the StaX API (XMLEventReader). Once in memory all sequences were printed to stdout using a XMLStreamWriter.

[the source code is here]


(...)
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.FALSE);
factory.setProperty("javax.xml.stream.isCoalescing", Boolean.TRUE);
/** create a XML Event parser */
XMLEventReader parser = factory.createXMLEventReader(in);
TSeq seq= null;


/** loop over the events */
while(parser.hasNext()) {
XMLEvent event = parser.nextEvent();

if(event.isStartElement())
{
StartElement start=((StartElement)event);
String localName= start.getName().getLocalPart();
if(localName.equals("TSeq"))
{
seq= new TSeq();
this.TSeqSet.addElement(seq);
}
else if(localName.equals("TSeq_seqtype"))
{
seq.type= start.getAttributeByName(new QName("value")).getValue();
}
else if(localName.equals("TSeq_gi"))
{
seq.gi= Integer.parseInt(parser.getElementText());
}
else if(localName.equals("TSeq_accver"))
{
seq.accver= parser.getElement
(...)

... and to write the sequences...
 (...)
XMLOutputFactory factory= XMLOutputFactory.newInstance();
XMLStreamWriter w= factory.createXMLStreamWriter(out);
w.writeStartDocument();
w.writeStartElement("TSeqSet");

for(TSeq seq: TSeqSet)
{
w.writeStartElement("TSeqSet");
w.writeEmptyElement("TSeq_seqtype");
w.writeAttribute("value", seq.type);
w.writeStartElement("TSeq_gi");
w.writeCharacters(String.valueOf(seq.gi));
w.writeEndElement();
w.writeStartElement("TSeq_accver");
w.writeCharacters(seq.accver);
w.writeEndElement();
w.writeStartElement("TSeq_sid");
w.writeCharacters(seq.sid);
w.writeEndElement();
w.writeStartElement("TSeq_taxid");
w.writeCharacters(String.valueOf(seq.taxid));
w.writeEndElement();
w.writeStartElement("TSeq_orgname");
w.writeCharacters(seq.orgname);
w.writeEndElement();
w.writeStartElement("TSeq_defline");
w.writeCharacters(seq.defline);
w.writeEndElement();
w.writeStartElement("TSeq_length");
w.writeCharacters(String.valueOf(seq.length));
w.writeEndElement();
w.writeStartElement("TSeq_sequence");
w.writeCharacters(seq.sequence);
w.writeEndElement();
w.writeEndElement();
}

w.writeEndElement();
w.writeEndDocument();
w.flush();
(....)

compiling and running...

pierre@linux:> javac org/lindenb/sandbox/STAXTinySeq.java

pierre@linux:> java org/lindenb/sandbox/STAXTinySeq tinyseq.xml


<?xml version="1.0" ?><TSeqSet><TSeqSet><TSeq_se
qtype value="nucleotide"/><TSeq_gi>27592135</TSeq_gi>&
lt;TSeq_accver>CB017399.1</TSeq_accver><TSeq_sid>gnl|d
bEST|16653996</TSeq_sid><TSeq_taxid>9031</TSeq_taxid&g
t;<TSeq_orgname>Gallus gallus</TSeq_orgname><TSeq_defl
ine>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus g
allus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1 simila
r to Rotavirus X associated non-structural protein (RoXaN) [Mus muscu
lus] ref|XP_193795.1| similar to Rotavirus X as></TSeq_defline&
gt;<TSeq_length>671</TSeq_length><TSeq_sequence>GGA
AGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTC
ACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCT
GGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
TATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGG
CAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGG
AGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCACATGGCTCTGTTTGATCC
CAGAAGTGATGACTACTTAGTGGTAAAAACACATTTCCAGACACACAACTTCAGAAAATGAGTGCAAGC
TTCAAGTCTGCCCTTTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTGGCTCA
ATTCTTGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTCTCACGCCCTCTGAATCAGTGCCTT



That's it.
Pierre

No comments: