Search code examples
javastax

XMLStreamReader: get character offset : XML from file


The XMLStreamReader->Location has a method called getCharacterOffset().

Unfortunately the Javadocs indicate this is an ambigously named method: it can also return a byte-offset (and this appears to be true in practice); unhelpfully this seems to occur when reading from files (for instance):

The Javadoc states :

Return the byte or character offset into the input source this location is pointing to. If the input source is a file or a byte stream then this is the byte offset into that stream, but if the input source is a character media then the offset is the character offset. (emphasis added)

I really need the character offset; and I'm pretty sure I'm being given the byte offset instead.

The (UTF-8 encoded) XML is contained in a (partially corrupt 1G) file. [Hence the need to use a lower-level API which doesn't complain about the lack of well-formedness until it really has no choice but to].

Question

What does the Javadoc mean when it says '...input source is a character media...' : how can I force it to think of my input file as 'character media' - so that I get an accurate (Character) offset rather than a byte offset?

Extra blah blah:

[ I'm pretty sure this is what is going on - when I strip the file apart (using certain known high-level tags) I get a few characters missing or extra - in a non-accumalating way - I'm putting the difference down to a few multi-byte characters throwing off the counter: also when I copy (using 'head'/'tail' for instance in Powershell - this tool appears to correctly recognize [or assume UTF-8] and does a good conversion to UTF-16 as far as I can see ]


Solution

  • The offset is in units of the underlying Source.

    The XMLStreamReader only knows how many units it has read from the Source so the offset is calculated in those units.

    A Stream works in units of byte and therefore you end up with a byte offset.

    A Reader works in units of char and therefore you end up with an offset in char.

    The docs for StreamSource are more explicit in what the terms "character media" means.

    Maybe try something like

    final Source source = new StreamSource(new InputStreamReader(new FileInputStream(new File("my.xml")), "UTF-8"));
    final XMLStreamReader xmlReader = XMLInputFactory.newFactory().createXMLStreamReader(source);