Search code examples
javaxmlsax

Parsing Mixed-Content XML with SAX


I have a sample mixed-content XML document (structure cannot be modified):

<items>
    <item>  ABC123    <status>UPDATE</status>
    <units>
        <unit Description="Each     ">EA     <saleprice>2.99</saleprice>
            <saleprice2/>
        </unit>
    </units>
    <warehouses>
        <warehouse>100<availability>2987.000</availability>
        </warehouse>
    </warehouses>
    </item>
</items>

I am attempting to use SAX parser on this XML document, but the mixed-content elements are causing some issues. Namely, I get an empty String returned when attempting to handle the <item/> node.

My handler:

@Override
public void startElement(final String uri, 
        final String localName, final String qName, final Attributes attributes) throws SAXException {

    final String fixedQName = qName.toLowerCase();
    switch (fixedQName) {
        case "item":
            prod = new Product();
            //prod.setItem(content); <-- doesn't work, content is empty since element just started
            break;
    }

}

@Override
public void endElement(final String uri, final String localName, final String qName) throws SAXException {
    final String fixedQName = qName.toLowerCase();
    switch (fixedQName) {
        case "item":
            prod.setItem(content); // <-- doesn't work either, only returns an empty string
            // end element, set item
            productList.add(prod);
            break;
        case "status":
            prod.setStatus(content);
            break;
        // ... etc....
    }

}

@Override
public void characters(final char[] ch, final int start, final int length) throws SAXException {
    content = "";
    content = String.copyValueOf(ch, start, length).trim();
}

This handler works correctly for everything of interest, except the <item/> element. It always returns an empty string.

If I add a println() to the characters() method to print out the content, I can see the parser eventually does print the contents of <item/>, however it is after it is expected (on the next additional characters() method invocation by the parser)

Referencing http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html, I know I should attempt to aggregate the strings returned from characters(), however I don't see how this can be since I do need to retrieve the other element's data, and hard-coding an exception for the first element into the characters() method seems like the wrong approach.

Howe can I use SAX to retrieve the mixed-content <item/>'s data 'ABC123'?


Solution

  • If the item content is only made of the text before the opening tag of the status element then you could get the item content in startElement:

    public void startElement(final String uri, 
        final String localName, final String qName, final Attributes attributes) throws SAXException {
    
        final String fixedQName = qName.toLowerCase();
        switch (fixedQName) {
             case "item":
                 prod = new Product();
                 break;
             case "status":
                 prod.setItem(content);
                 break;
        }
    }
    

    To understand consider the flow of events:

    • startElement item
    • characters "ABC123"
    • startElement status
    • characters "UPDATE"
    • endElement status
    • characters ""
    • endElement item