Search code examples
xmlxml-parsingdtdmixed

Place #PCDATA in DTD mixed Content


It is actually possible to specify that an element can contain both PCDATA and other elements. Such a content model is called mixed. To specify a mixed-content model, just list #PCDATA along with the child elements you want to allow:

<?xml version = "1.0" standalone="yes"?>
<!DOCTYPE DOCUMENT [
<!ELEMENT DOCUMENT (CUSTOMER)*>
<!ELEMENT CUSTOMER (NAME,DATE,ORDERS)>
<!ELEMENT NAME (LAST_NAME,FIRST_NAME)>
<!ELEMENT LAST_NAME (#PCDATA)>
<!ELEMENT FIRST_NAME (#PCDATA)>
<!ELEMENT DATE (#PCDATA)>
<!ELEMENT ORDERS (ITEM)*>
<!ELEMENT ITEM (PRODUCT, NUMBER, PRICE)>
<!ELEMENT PRODUCT (#PCDATA | PRODUCT_ID)*>
<!ELEMENT NUMBER (#PCDATA)>
<!ELEMENT PRICE (#PCDATA)>
<!ELEMENT PRODUCT_ID (#PCDATA)>
]>
<DOCUMENT>
    <CUSTOMER>
        <NAME>
            <LAST_NAME>Weber</LAST_NAME>
            <FIRST_NAME>Bill</FIRST_NAME>
        </NAME>
        <DATE>October 25, 2003</DATE>
        <ORDERS>
            <ITEM>
                <PRODUCT>Asparagus</PRODUCT>
                <NUMBER>12</NUMBER>
                <PRICE>$2.95</PRICE>
            </ITEM>
            <ITEM>
                <PRODUCT>Lettuce</PRODUCT>
                <NUMBER>6</NUMBER>
                <PRICE>$11.50</PRICE>
            </ITEM>
        </ORDERS>
    </CUSTOMER>
</DOCUMENT>

I noticed when checking the correctness of the file by using the validators (.NET XML Parser, MSXML SAX, MSXML DOM, Java build-in), if #PCDATA is on the top of the list - check passes. If before #PCDATA is a member - there are validation errors.

Why the mixed #PCDATA element should be necessarily the first place?


Solution

  • Yes, what you are specifying here is what is call a mixed content, as defined in the w3C specification, §3.2.2. Mixed-content Declaration

    [51] Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'

    And indeed the constraints for that are:

    1. #PCDATA must appear first;
    2. you can provide a list of tags afterwards, each tag can occur only once;
    3. and finally the only occurrence specification allowed is *.

    So basically the reason why #PCDATA must occur first is because the specification requires it.