Search code examples
xmlxsdxsd-validationxsd-1.1xsd2code

XSD ambiguities, multiple compositors in the same complex type


XSD complex types are, well... complex

Some facts first:

  • elements can have minOccurs, maxOccurs and can be childs
  • compositors (choice/all/sequence) can have their own minOccurs/maxOccurs
  • multiple compositors can be used in a row in the same complex type
  • even more, compositors can be nested (e.g. a choice between 2 sequences)

Because of all that, I think this is possible in an xsd/xml (pseudo-xsd):

<someComplexType>
    <sequence minOccurs=0>
      <element A>
      <element B>
    </sequence>
    <sequence>
      <element A>
      <element C>
    </sequence>
</someComplexType>

Not sure it is legal, but it is possible.

Now, let's say a sax/stax parser sees an element A.

What happens next? How should I determine if it is an A from the first sequence or an A from the second sequence...before checking the next element(s)?

If the next element is B, this A should probably(!) be a match for the first sequence.

If the next element is C, the same A should probably(!) be a match for the second sequence.

("Probably" because the whole thing above can be an optional nested part of another sequence which makes me see this as a recursive issue, it might occur at any depth and to make a decision one might have to check multiple elements following, at multiple depths)

With a similar xsd, while parsing the xml it is (I think) impossible to tell if the current element belongs to a specific compositor or not, until....later, not even sure when.

Anyway, here are my two questions:

  1. Is such an xsd legal? deterministic? following the Unique Particle Attribution (UPA) rule? or some god knows what other terminology?
  2. If this is legal xsd, how can a sax/stax parser tell when the current element matches something, anything? Not the parser itself, he does not care, but the caller which has to interpret the current element. I guess it has to know also the next elements before it can make the right decision (choose the best match), which for sax/stax is not possible based only on the current elem.

ps: trying to create an xml mapping tool (don't ask why) and I don't know how to match the parsed contents with the xsd structure, under such circumstances.

Thanks a lot


Solution

  • As any XSD processor will tell you, your content model is not allowed because it is ambiguous, for exactly the reasons you describe. (It violates the UPA constraint). The content can (and must) be changed from (AB | AC) to A (B|C).

    Of course, there are similar technologies (such as regular expression processors, or RelaxNG) where ambiguous grammars are permitted: parsing them requires techniques such as look-ahead or backtracking.

    If you're planning to implement a processor that does this task, I would recommend reading up on some of the computer science. A standard textbook - very old, but still perfectly usable - is the "Dragon book" by Aho and Ullmann (can't remember the title, and I left my copy in the office). More specifically regarding XSD, there are a couple of good papers by Thompson and Tobin which can be found on Henry Thompson's web site at the University of Edinburgh.