Search code examples
javaxmlparsingstax

Infrequent Java 6 StAX parser bug in eventReader when parsing CDATA asCharacters


I have code to fetch characters from a StAX parser using eventReader. The code looks like this:

private String getNextCharacters(XMLEventReader eventReader) throws XMLStreamException {
    StringBuilder characters = new StringBuilder();
    XMLEvent event = eventReader.nextEvent();

    String data = event.asCharacters().getData();
    characters.append(data);

    while (eventReader.peek() != null && eventReader.peek().isCharacters()) {
        event = eventReader.nextEvent();
        data = event.asCharacters().getData();
        characters.append(data);
    }

    return characters.toString();
}

The while loop is because occasionally the asCharacters is not coalesced between adjacent isCharacters events. This seems to be independent of the is_coalescing flags being set or not. This seemed like a reasonable workaround but it seems to have driven a secondary bug. Occasionally I see ]]> appended to my character string. This is very infrequent--about once in 5000 lines of XML but it happens consistently. Debugging I find that it happens in the second isCharacters event when the first event is CDATA. The parser seems to lose track of the CDATA instruction by the second event.

So, has anyone else seen this? Does anyone have a better workaround than simply stripping ]]> off the end of my string? I didn't find anything significant online or here.


Solution

  • Instead of

    data = event.asCharacters().getData();
    

    you could go

    Characters characters = event.asCharacters();
    data = characters.getData();
    
    if(characters.isCData()) {
    /* handle CDATA */
    } else if (characters.isWhiteSpace()) {
    /* handle whitespace*/
    } else if (characters.isIgnorableWhiteSpace()) {
    /* handle ignorable whitespace*/
    }
    

    HTH, Max