Search code examples
javaxmlxmlstreamreader

XMLStreamReader how to work with nested elements of same type


I'm working with the XMLStreamReader and parsing the following XML:

<root>
    <element>
        <attribute>level0</attribute>
        <element>
            <attribute>level1</attribute>
            <element>
                <attribute>level2</attribute>
            </element>
        </element>
    </element>
</root>

I'm building out my XMLStreamReader:

XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
                new ByteArrayInputStream(document.getBytes()));

Unfortunately, when I get to the first closing element tag with reader.next();, I get the following exception:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[7,14]
Message: XML document structures must start and end within the same entity. 

Is there a way to override the default behavior of the XMLStreamReader to get around with this?

EDIT

Here is the code I am working with:

@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
        throws IOException, InterruptedException {
    String document = value.toString();
    System.out.println("'" + document + "'");
    try {
        XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
                new ByteArrayInputStream(document.getBytes()));
        String propertyName = "";
        String propertyValue = "";
        String currentElement = "";
        while (reader.hasNext()) {
            int code = reader.next();
            switch (code) {
            case START_ELEMENT:
                currentElement = reader.getLocalName();
                break;
            case CHARACTERS:
                if (currentElement.equalsIgnoreCase("element")) {
                    propertyName += reader.getText();
                } else if (currentElement.equalsIgnoreCase("attribute")) {
                    propertyValue += reader.getText();
                }
                break;
            }
        }
        reader.close();
        context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
    } catch (Exception e) {
        e.printStackTrace();
    }
}

Solution

  • There is nothing wrong with the example XML document and/or the StAX parser as can be checked with this code:

    @Test
    public void testSO_31815379() throws XMLStreamException, UnsupportedEncodingException {
        final String xml = 
            "<root>\n" +
            "    <element>\n" +
            "        <attribute>level0</attribute>\n" +
            "        <element>\n" +
            "            <attribute>level1</attribute>\n" +
            "            <element>\n" +
            "                <attribute>level2</attribute>\n" +
            "            </element>\n" +
            "        </element>\n" +
            "    </element>\n" +
            "</root>";
    
        final XMLStreamReader reader = XMLInputFactory.newInstance()
            .createXMLStreamReader(new ByteArrayInputStream(xml.getBytes("UTF-8")));
        LOG.info("Using XMLStreamReader implementation: %s", reader.getClass().getName());
    
        reader.require(XMLStreamConstants.START_DOCUMENT, null, null);
        int event;
        while ((event = reader.next()) != XMLStreamConstants.END_DOCUMENT) {
            LOG.info(StaxUtils.eventDescription(reader));
        }
        reader.require(XMLStreamConstants.END_DOCUMENT, null, null);
        reader.close();
    }
    

    Output (StaxUtils.eventDescription is a custom helper method)

    Using XMLStreamReader implementation: com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl
    START_ELEMENT<{}root>
    CHARACTERS=<whitespace>
    START_ELEMENT<{}element>
    CHARACTERS=<whitespace>
    START_ELEMENT<{}attribute>
    CHARACTERS='level0'
    END_ELEMENT<attribute>
    CHARACTERS=<whitespace>
    START_ELEMENT<{}element>
    CHARACTERS=<whitespace>
    START_ELEMENT<{}attribute>
    CHARACTERS='level1'
    END_ELEMENT<attribute>
    CHARACTERS=<whitespace>
    START_ELEMENT<{}element>
    CHARACTERS=<whitespace>
    START_ELEMENT<{}attribute>
    CHARACTERS='level2'
    END_ELEMENT<attribute>
    CHARACTERS=<whitespace>
    END_ELEMENT<element>
    CHARACTERS=<whitespace>
    END_ELEMENT<element>
    CHARACTERS=<whitespace>
    END_ELEMENT<element>
    CHARACTERS=<whitespace>
    END_ELEMENT<root>