Search code examples
javaxmlxsdsaxjaxp

Resolving which version of an XML Schema to use for XML documents with a version attribute


I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
<Junk xmlns="urn:com:initech:tps" 
    xmlns:xsi="http://www3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
    VersionAttribute="2.0">

There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:

LSInput resolveResource(String type,
                        String namespaceURI,
                        String publicId,
                        String systemId,
                        String baseURI)

Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?


Solution

  • I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.

    I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":

    The screening or classification of XML documents is a common problem, especially in XML middleware. Routing XML documents to specific processors may require analysis of both the document type and the document content. The problem here is obtaining the required information from the document with the least possible overhead. Traditional parsers such as DOM or SAX are not well suited to this task. DOM, for example, parses the whole document and constructs a complete document tree in memory before it returns control to the client. Even DOM parsers that employ deferred node expansion, and thus are able to parse a document partially, have high resource demands because the document tree must be at least partially constructed in memory. This is simply not acceptable for screening purposes.

    The code to get the version information will look like:

    def map = [:]
    def startElementCount = 0
    def inputStream = new File(inputFile).newInputStream()
    try {
        XMLStreamReader reader = 
            XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
        for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
            if (event == XMLStreamConstants.START_ELEMENT) {
                if (startElementCount > 0) return map
                startElementCount += 1
                map.rootElementName = reader.localName
                for (int i = 0; i < reader.attributeCount; i++) {
                    if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
                        map.versionIdentifier = reader.getAttributeValue(i).toString()
                        return map
                    }
                }
            }
        }   
    } finally {
        inputStream.close()
    }
    

    Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.