Search code examples
javadomsax

org.w3c.dom.Document conversion to java.io.InputStream not read properly by SAXParser


I'm using the following code to parse a org.w3c.dom.Document with a javax.xml.parsers.SAXParser.

try
    {
        // --- Prepare our SAX parser ---
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setValidating(true);
        SAXParser parser = factory.newSAXParser();
        // parser.parse(xmlFile, xmlValidator); /* Does not validate unsaved changes */

        // --- Create a stream form our already parsed xml document ---
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        Source xmlSource = new DOMSource(xmlDocument);
        Result outputTarget = new StreamResult(outputStream);
        TransformerFactory.newInstance().newTransformer().transform(xmlSource, outputTarget);

        // --- Validate the xmlDocument ---
        parser.parse(new ByteArrayInputStream(outputStream.toByteArray()), xmlValidator);
    }
catch (ParserConfigurationException | SAXException | TransformerException | TransformerFactoryConfigurationError | IOException e)
    {
        e.printStackTrace();
    }

When the document is parsed I get the error message

Line 1: Document root element 'MyRootName' must match DOCTYPE root 'null'.

If I just parse the xmlFile which the xmlDocument is based on, everything works just fine.

I have ensured that the xmlDocument is initialised and valid, I've even tried passing in xmlDocument.getDocumentElement() to the DOMSource which I have also ensured is valid and what I am expecting it to be (i.e. the root node of the document which has the correct name)

Why isn't the javax.xml.parsers.SAXParser reading the java.io.InputStream in the same way it is reading the 'xmlFile` from the file system?

Edit

related question (I've tried all of these solutions to no avail): how to create an InputStream from a Document or Node

I have found the cause, detailed here: Parsing xml with DOM, DOCTYPE gets erased


Solution

  • So the issue wasn't with the parser, it was with the Transformer which was stripping out the <!DOCTYPE ...> line in the XML. To solve this, simply set a transformer property so it includes the DTD file.

        // --- Create a transformer and transform our Document into an InputStream ---
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        // By default the transformer strips out the DOCTYPE tag so we must re-add our DTD file declaration
        transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, xmlFile.getParent() + "\\" + xmlDocument.getDoctype().getSystemId());
        transformer.transform(xmlSource, outputTarget);
    

    If you simply pass in the DTD file name, the parser will search for it at the location the program was launched from, it is advisable to specify the direct path to the DTD file, as I have above.