Search code examples
javaxmlstaxxerces

XMLStreamReader.getLocation() returns unexpected character offset


Given the following code:

import javax.xml.stream.XMLInputFactory;
import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;

class Scratch {
    public static void main(String[] args) throws Exception {
        var document = "<foo>bar</foo>";
        try (var is = new ByteArrayInputStream(document.getBytes(StandardCharsets.UTF_8))) {
            var reader = XMLInputFactory.newInstance().createXMLStreamReader(is);
            System.out.println(reader.getLocation());
        }
    }
}

I expect the output to be

Line number = 1
Column number = 1
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 0

but instead it's

Line number = 1
Column number = 1
System Id = null
Public Id = null
Location Uri= null
CharacterOffset = 4

I'm very curious about the CharacterOffset = 4 which makes no sense to me.

Can someone explain why it's 4?

Edit: It's always 4 regardless of the root element (<foo or <foobar both the same)

(And furthermore: i need a reliable info about Element/Tag positions of the xml to perform later some DataBuffer slicing based on that information)

Running using

  • Zulu JDK 17
  • com.sun.org.apache.xerces Implementation

Solution

  • Summary:

    I am using Adoptium's Temurin Java 21.

    As shown in Convert the string content into an XMLStreamReader, use a StringReader. This does not have the problems you are seeing with inaccurate offsets.


    More details regarding the results you are seeing...

    1) Explicitly declare the encoding

    When declaring your XMLStreamReader, if you use a ByteArrayInputStream, you should explicitly provide the encoding you want to use.

    So, instead of this:

    XMLStreamReader reader = XMLInputFactory
            .newDefaultFactory()
            .createXMLStreamReader(is);
    

    You can use this:

    XMLStreamReader reader = XMLInputFactory
            .newDefaultFactory()
            .createXMLStreamReader(is, "UTF-8");
    

    (Here, I replaced your var with the explicit class javax.xml.stream.XMLStreamReader.)

    You can see the JavaDoc for this method here: XMLInputFactory::createXMLStreamReader.

    The method requires you to use a string. I expect the class predates StandardCharsets - or otherwise does not use this preferred approach.

    You may think you have already declared the encoding when defining the byte array input stream:

    new ByteArrayInputStream(document.getBytes(StandardCharsets.UTF_8))
    

    And indeed you have - but the XMLInputFactory does not make use of this. It needs to be told explicitly, as shown above.


    2. Auto-detecting the encoding

    Behind the scenes, Xerces uses the following class:

    com.sun.org.apache.xerces.internal.impl.XMLEntityManager
    

    This class uses whatever explicit encoding you have defined to build an appropriate byte array handler (which knows how the bytes have been encoded).

    If you do not pass an explicit encoding (like the code in the question doesn't), then XMLEntityManager tries to auto-detect the encoding from the start of the byte stream. It first checks to see if a BOM has been used.

    You can see that in the source code here:

    // perform auto-detect of encoding if necessary
    if (encoding == null) {
        // read first four bytes and determine encoding
        final byte[] b4 = new byte[4];
        int count = 0;
        for (; count<4; count++ ) {
            b4[count] = (byte)rewindableStream.readAndBuffer();
        }
        .... // rest of code not shown here
    

    This consumes the first 4 bytes of your XML data stream.

    The stream is reset, if there is no BOM:

    stream.reset();
    

    However, the readAndBuffer() method shown above is called 4 times in that for loop - and that method increments the offset variable: fOffset++; - you can see that in the source code here.

    So, whereas the stream is reset, the offset state is not.

    That is where your result comes from:

    CharacterOffset = 4
    

    3. What if you provide an explicit charset of "UTF-8"?

    In that case, the code uses this info here.

    In this case, the code again checks for a BOM, but uses a 3-byte array for this (for a 3-byte UTF-8 BOM):

    final int[] b3 = new int[3];
    

    So, in this case my version of your code produces this output, for the same basic reason as mentioned above:

    CharacterOffset = 3
    

    This time, the offset is 3, not 4. As above, the stream is reset; but the offset tracker is not reset.


    4. What if you use StringReader?

    In this case, the org.xml.sax.InputSource class will use the following to manage your input source:

    public InputSource (Reader characterStream)
    {
        setCharacterStream(characterStream);
    }
    

    In fact you can use any of the Reader subclasses - including StringReader.

    In these cases, the byte checking logic (described above) is bypassed. Therefore the offset remains accurate.

    Here is an example with a buffered reader and a char array reader:

    var is = new BufferedReader(new CharArrayReader(document.toCharArray()));
    

    This gives:

    CharacterOffset = 0
    

    I don't know whether this rises to the level of a bug, or is just a feature of how Xerces works - and was designed to work.