Search code examples
javaxmlstreamsaxdom4j

Reading a single XML document from a stream using dom4j


I'm trying to read a single XML document from stream at a time using dom4j, process it, then proceed to the next document on the stream. Unfortunately, dom4j's SAXReader (using JAXP under the covers) keeps reading and chokes on the following document element.

Is there a way to get the SAXReader to stop reading the stream once it finds the end of the document element? Is there a better way to accomplish this?


Solution

  • I was able to get this to work with some gymnastics using some internal JAXP classes:

    • Create a custom scanner, a subclass of XMLNSDocumentScannerImpl
      • Create a custom driver, an implementation of XMLNSDocumentScannerImpl.Driver, inside the custom scanner that returns END_DOCUMENT when it sees an declaration or an element. Get the ScannedEntity from fElementScanner.getCurrentEntity(). If the entity has a PushbackReader, push back the remaining unread characters in the entity buffer onto the reader.
      • In the constructor, replaces the fTrailingMiscDriver with an instance of this custom driver.
    • Create a custom configuration class, a subclass of XIncludeAwareParserConfiguration, that replaces the stock DOCUMENT_SCANNER with an instance of this custom scanner in its constructor.
    • Install an instance of this custom configuration class as the "com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration" property so it will be instantiated when dom4j's SAXReader class tries to create a JAXP XMLReader.
    • When passing a Reader to dom4j's SAXReader.read() method, supply a PushbackReader with a buffer size considerably larger than the one-character default. At least 8192 should be enough to support the default buffer size of the XMLEntityManager inside JAXP's copy of Apache2.

    This isn't the cleanest solution, as it involves subclassing internal JAXP classes, but it does work.