I'm working with the XMLStreamReader and parsing the following XML:
<root>
<element>
<attribute>level0</attribute>
<element>
<attribute>level1</attribute>
<element>
<attribute>level2</attribute>
</element>
</element>
</element>
</root>
I'm building out my XMLStreamReader:
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
new ByteArrayInputStream(document.getBytes()));
Unfortunately, when I get to the first closing element tag with reader.next();
, I get the following exception:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[7,14]
Message: XML document structures must start and end within the same entity.
Is there a way to override the default behavior of the XMLStreamReader to get around with this?
EDIT
Here is the code I am working with:
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String document = value.toString();
System.out.println("'" + document + "'");
try {
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
new ByteArrayInputStream(document.getBytes()));
String propertyName = "";
String propertyValue = "";
String currentElement = "";
while (reader.hasNext()) {
int code = reader.next();
switch (code) {
case START_ELEMENT:
currentElement = reader.getLocalName();
break;
case CHARACTERS:
if (currentElement.equalsIgnoreCase("element")) {
propertyName += reader.getText();
} else if (currentElement.equalsIgnoreCase("attribute")) {
propertyValue += reader.getText();
}
break;
}
}
reader.close();
context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
} catch (Exception e) {
e.printStackTrace();
}
}
There is nothing wrong with the example XML document and/or the StAX parser as can be checked with this code:
@Test
public void testSO_31815379() throws XMLStreamException, UnsupportedEncodingException {
final String xml =
"<root>\n" +
" <element>\n" +
" <attribute>level0</attribute>\n" +
" <element>\n" +
" <attribute>level1</attribute>\n" +
" <element>\n" +
" <attribute>level2</attribute>\n" +
" </element>\n" +
" </element>\n" +
" </element>\n" +
"</root>";
final XMLStreamReader reader = XMLInputFactory.newInstance()
.createXMLStreamReader(new ByteArrayInputStream(xml.getBytes("UTF-8")));
LOG.info("Using XMLStreamReader implementation: %s", reader.getClass().getName());
reader.require(XMLStreamConstants.START_DOCUMENT, null, null);
int event;
while ((event = reader.next()) != XMLStreamConstants.END_DOCUMENT) {
LOG.info(StaxUtils.eventDescription(reader));
}
reader.require(XMLStreamConstants.END_DOCUMENT, null, null);
reader.close();
}
Output (StaxUtils.eventDescription
is a custom helper method)
Using XMLStreamReader implementation: com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl
START_ELEMENT<{}root>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level0'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level1'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level2'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<root>