Search code examples
xmldtd

How to enforce the presence of a certain element in the XML file?


I want to enforce a <a-special/> element to occur at least once in my document. For such a grammar, a document like this would be valid (since <a-special/> occurs):

<my-container>
    text <a id="1" type="B"/> text text <a-special/>
    text text <a id="5" type="B"/> text <a id="24" type="B"/>
    text <a id="5" type="C"/>
</my-container>

whereas this would be considered as invalid (since <a-special/> does not occur):

<my-container>
    <a id="1" type="B"/> text text
    text <a id="5" type="B"/> text <a id="24" type="B"/>
    text <a id="5" type="C"/>
</my-container>

I have tried different things with the grammar below but I can't seem to make it work the way I need it.

<!ELEMENT my-container ( #PCDATA | a | a-special | b )*>

<!ELEMENT a-special EMPTY>

<!ELEMENT a EMPTY>
    <!ATTLIST a id CDATA #REQUIRED>
    <!ATTLIST a type CDATA #REQUIRED>

<!ELEMENT b EMPTY>
    <!ATTLIST b id CDATA #REQUIRED> 
    <!ATTLIST a type CDATA #REQUIRED>

I know this is wrong but I was thinking about something like this:

<!ELEMENT my-container 
              a-special+ ( #PCDATA | a | b | a-special )*                           
            | ( #PCDATA | a | b )+ a-special+ ( #PCDATA | a | b | a-special )*
            >

The first part would parse anything that starts with a-special and the second parse would parse anything that expects either an a-special somewhere in between or at the end. Can this be done with a DTD grammar?


Solution

  • The constraint you want to enforce cannot be stated with an XML DTD.

    If your outermost element really is just a sequence of character data and empty children, the content-model-like expression you mention would (after supplying the missing commas) capture the constraint accurately:

    ((#PCDATA | a | b)*, a-special, (#PCDATA | a | b | a-special)*)
    

    This would be legal in SGML (or so I think, but I haven't checked). But the only allowable forms for mixed content in XML DTDs are

    (#PCDATA)
    (#PCDATA | x | y | ... |z)*
    (#PCDATA)*
    

    The constraint described would be expressible in XSD or in Relax NG.

    If any elements other than the document element are allowed to be non-empty, then the constraint is not expressible with content models in any schema language I know of: content models function as a sort of context-free grammar, and the requirement that there be an a-special element somewhere in the document entails a form of context-sensitivity.

    As @potame observed in a comment, Schematron could formulate the constraint; so could an assertion in XSD 1.1, attached to the declaration of the document element.

    One possible workaround: mark the specialness of the element in a different way, e.g. by pointing at some a elements in the document:

    <!ELEMENT my-container (#PCDATA|a|b)* >
    <!ATTLIST my-container specials IDREFS #REQUIRED >
    <!ELEMENT a EMPTY >
    <!ATTLIST a id ID #IMPLIED>
    <!ELEMENT b EMPTY>
    

    Since my-container/@specials is required, it must name at least one element in the document. Since the only element type for which IDs are defined is a, the elements named by specials are guaranteed to be a elements.