Search code examples
javasax

How do I get the correct starting/ending locations of a xml tag with SAX?


There is a Locator in SAX, and it keep track of the current location. However, when I call it in my startElement(), it always returns me the ending location of the xml tag.

How can I get the starting location of the tag? Is there any way to gracefully solve this problem?


Solution

  • Unfortunately, the Locator interface provided by the Java system library in the org.xml.sax package does not allow for more detailed information about the documentation location by definition. To quote from the documentation of the getColumnNumber method (highlights added by me):

    The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.

    According to that specification, you will always get the position "of the first character after the text associated with the document event" based on best effort by the SAX driver. So the short answer to the first part of your question is: No, the Locator does not provide information about the start location of a tag. Also, if you are dealing with multi-byte characters in your documents, e.g., Chinese or Japanese text, the position you get from the SAX driver is probably not what you want.

    If you are after exact positions for tags, or want even more fine grained information about attributes, attribute content etc., you'd have to implement your own location provider.

    With all the potential encoding issues, Unicode characters etc. involved, I guess this is too big of a project to post here, the implementation will also depend on your specific requirements.

    Just a quick warning from personal experience: Writing a wrapper around the InputStream you pass into the SAX parser is dangerous as you don't know when the SAX parser will report it's events based on what it has already read from the stream.

    You could start by doing some counting of your own in the characters(char[], int, int) method of your ContentHandler by checking for line breaks, tabs etc. in addition to using the Locator information, which should give you a better picture of where in the document you actually are. By remembering the positions of the last event you could calculate the start position of the current one. Take into account though, that you might not see all line breaks, as those could appear inside tags which you would not see in characters, but you could deduce those from the Locator information.