I'm using StAX to parse an XML file and would like to know where each tag starts and ends. For that I'm trying to use getLocation().getCharacterOffset()
, but it returns incorrect values for every tag beyond first.
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader reader = factory.createXMLEventReader(
new StringReader("<root>txt1<tag>txt2</tag></root>"));
XMLEvent e;
e = reader.nextEvent(); // START_DOCUMENT
e = reader.nextEvent(); // START_ELEMENT "root"
e = reader.nextEvent(); // CHARACTERS "txt1"
e = reader.nextEvent(); // START_ELEMENT "tag"
The code above prints this:
<?xml version="null" encoding='null' standalone='no'?>
Line number = 1
Column number = 1
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 0
Line number = 1
Column number = 7
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 6
Line number = 1
Column number = 12
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 11
Line number = 1
Column number = 16
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 15
After <root>
the CharacterOffset
is correctly 6
, but then after txt1
it is 11
while I expect to see 10
. What offset exactly does it return?
This is probably a bug/feature of Sun/Oracle's StAX implementation.
With Woodstox, you get 0, 0, 6, 10
, which seems to be correct.
Download Woodstox from http://wiki.fasterxml.com/WoodstoxHome and
add the JARs (woodstox-core + stax2-api) to your class path. Then,
will automatically pick the Woodstox implementation.