Search code examples
javaxml-parsingstax

Java StAX parser fails to parse a valid xml


guys.

I spent quite some time trying to understand if it's a bug or my own lack of education. Basically, I'm trying to react on specific element and read its contents with Transformer using Java StAX API.

Everything works when XML is pretty formatted or has spaces between elements. However, as soon as it sees an XML with no whitespace characters between elements it breaks badly.

There's code and its output to illustrate the problem.

There are 3 sample XMLs and first 2 show 2 different break scenarios while last one shows proper processing:

  • In the first scenario with no spaces it skips some elements. In the example below it skips all but one "node" element. In the real world scenario though it skips every other node instead. Probably because of richer node content.

  • In the second scenario I added space between node elements only. As you can see it fails to handle end of the document properly.

  • In the last scenario I added space between last node and closing root element. Processing went as desired.

In my real world scenario I expect single-line-no-separators XML, so I need the scenario 1 to work properly and would also be happy to know that a valid change to XML such as adding a space between elements would not break the processing like in scenario 2.

Please help!!!

Complete code for single class application test.StAXTest:

package test;

import java.io.StringReader;
import java.io.StringWriter;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class StAXTest {
    private final static String XML1 = "<root><node></node><node></node></root>";
    private final static String XML2 = "<root><node></node> <node></node></root>";
    private final static String XML3 = "<root><node></node> <node></node> </root>";

    public static void main(String[] args) throws Exception {
        processXML(XML1);
        processXML(XML2);
        processXML(XML3);
    }

    private static void processXML(String xml) {
        try {
            System.out.println("XML Input:\n" + xml + "\nProcessing:");

            XMLInputFactory xif = XMLInputFactory.newInstance();
            XMLStreamReader reader = xif.createXMLStreamReader(new StringReader(xml));
            TransformerFactory tf = TransformerFactory.newInstance();

            int nodeCount = 0;

            while (reader.nextTag() == XMLStreamConstants.START_ELEMENT) {
                String localName = reader.getLocalName();
                if (localName.equals("node")) {
                    Transformer t = tf.newTransformer();
                    StringWriter st = new StringWriter();
                    t.transform(new StAXSource(reader), new StreamResult(st));
                    String xmlNode = st.toString();
                    System.out.println(nodeCount + ": " + xmlNode);
                    nodeCount++;
                }
            }
        } catch (Throwable t) {
            t.printStackTrace(System.out);
        }
        System.out.println("------------------------------------------------");
    }
}

Application output, which contains all 3 scenarios. Please note, that in the first scenario transformed XML portion contains 1 node, not 2. So the second node is completely "lost in translation".

XML Input:
<root><node></node><node></node></root>
Processing:
0: <?xml version="1.0" encoding="UTF-8"?><node/>
------------------------------------------------
XML Input:
<root><node></node> <node></node></root>
Processing:
0: <?xml version="1.0" encoding="UTF-8"?><node/>
1: <?xml version="1.0" encoding="UTF-8"?><node/>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[-1,-1]
Message: found: END_DOCUMENT, expected START_ELEMENT or END_ELEMENT
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.nextTag(XMLStreamReaderImpl.java:1247)
    at com.newedge.test.StAXTest.processXML(StAXTest.java:35)
    at com.newedge.test.StAXTest.main(StAXTest.java:21)
------------------------------------------------
XML Input:
<root><node></node> <node></node> </root>
Processing:
0: <?xml version="1.0" encoding="UTF-8"?><node/>
1: <?xml version="1.0" encoding="UTF-8"?><node/>
------------------------------------------------

Solution

  • The problem is that after using the transform method, the XMLStreamReader is left pointing at the next XML event to process (i.e. the second <node> opening tag or the </root> closing tag). However, when you call nextTag() at the top of the while loop, you are advancing the reader by one further event. This causes it to skip this event.

    In your examples where there was whitespace following the </node> closing tags, it was the whitespace character data event that was being skipped. In other cases, an XML start-element or end-element event was being skipped, and that's why you were getting unexpected results.

    After invoking the transformer, you should check whether the reader's current eventType is START_ELEMENT or END_ELEMENT. If so, the transformer has already advanced the reader, and you should not advance it any further. If the eventType is something else, or you didn't invoke the transformer, then you do call nextTag() to advance the reader to the next tag.

    I replaced your while loop with the following:

            int eventType = reader.nextTag();
            while (eventType == XMLStreamConstants.START_ELEMENT) {
                String localName = reader.getLocalName();
                if (localName.equals("node")) {
                    Transformer t = tf.newTransformer();
                    StringWriter st = new StringWriter();
                    t.transform(new StAXSource(reader), new StreamResult(st));
                    String xmlNode = st.toString();
                    System.out.println(nodeCount + ": " + xmlNode);
                    nodeCount++;
                    eventType = reader.getEventType();
                    if (eventType != XMLStreamConstants.START_ELEMENT && eventType != XMLStreamConstants.END_ELEMENT) {
                        eventType = reader.nextTag();
                    }
                } else {
                    eventType = reader.nextTag();
                }
    

    When I then ran your code, it gave me the following output:

    XML Input:
    <root><node></node><node></node></root>
    Processing:
    0: <?xml version="1.0" encoding="UTF-8"?><node/>
    1: <?xml version="1.0" encoding="UTF-8"?><node/>
    ------------------------------------------------
    XML Input:
    <root><node></node> <node></node></root>
    Processing:
    0: <?xml version="1.0" encoding="UTF-8"?><node/>
    1: <?xml version="1.0" encoding="UTF-8"?><node/>
    ------------------------------------------------
    XML Input:
    <root><node></node> <node></node> </root>
    Processing:
    0: <?xml version="1.0" encoding="UTF-8"?><node/>
    1: <?xml version="1.0" encoding="UTF-8"?><node/>
    ------------------------------------------------