Search code examples
javaxml-parsingjaxbstaxwoodstox

Validate and parse xml using woodstox with local dtd


I have seen multiple questions that relate to parsing xmls using woodstox and JAXB to unmarshal using the XMLStreamReader and validating against schemas.Reading though them hasn't helped. What I need is to validate an incoming xml with a local DTD and parse the entire contents into an object representation. The incoming xml can have a DOCTYPE which includes a DTD. This needs to be skipped and a local DTD needs to be used instead. The implementation should be very quick. Expected < 1ms to do the validation and parsing. I could manage to parse alone using the following in 5ms. Incorporating validation doesn't work with setting the schema (commented lines of code)

xmlif = XMLInputFactory2.newInstance();
    xmlif.setProperty(XMLInputFactory2.SUPPORT_DTD, false);
    JAXBContext ucontext;
    ucontext = JAXBContext.newInstance(XMLOuterElementClass.class);
    unmarshaller = ucontext.createUnmarshaller();
    /*SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.XML_DTD_NS_URI);
    Schema schema = sf.newSchema(new File("c:/resources/schma.dtd"));
    unmarshaller.setSchema(schema);*/

XMLStreamReader xsr = xmlif
                .createXMLStreamReader(new StringReader(xml));
        //xsr = new StreamReaderDelegate(xsr);
        long start = System.currentTimeMillis();

        try {
            while (xsr.hasNext()) {
                if (xsr.isStartElement()
                        && xsr.getLocalName() == "XMLOuterElementClass") {
                    break;
                }
                xsr.next();
            }
            JAXBElement<XMLOuterElementClass> jb = unmarshaller.unmarshal(xsr,
                XMLOuterElementClass.class);
            System.out.println("Total time taken in ms :" + (end - start));

        } finally {
            xsr.close();
        }

Solution

  • There are multiple ways to do it; and the best way to get an answer with more depth is to ask this on Woodstox user list (see https://groups.google.com/g/woodstox-user).

    But one thing to note is that JAXB knows nothing about Stax2 (Woodstox/Aalto extension over basic Stax), so you need to access it via Stax2 API, not JAXB. So, to enable "external" validation, you need to call:

    xmlStreamReader2.validateAgainst(schemaFromDTD);
    

    and you can do this right after constructing stream reader (needs to cast to XMLStreamReader2, or at least to Validatable). Note that you can validate when reading OR writing, both work similarly (in latter case you enable it via XMLStreamWriter).

    Another possibility is to define XMLResolver property (see XMLInputFactory.RESOLVER). It gets called when trying to read an external dtd, that is, when DOCTYPE contains reference to an external file. Custom XMLResolver can then redirect this read to use some other source.

    Note that the first approach (one you started with) is likely more efficient as it only needs to read and parse Schema once, assuming you read it once and reuse afterwards. Validation itself should be fast, and if parsing takes 4 milliseconds, should not take more than 1 millisecond; especially if you include JAXB processing in 4 milliseconds (that's technically data-binding, above lower level parsing).