Search code examples
xmlvalidationdtdpcdata

XML validation of #PCDATA


I have this simple XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE input[
<!ELEMENT input (#PCDATA)>
<!ELEMENT file (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT type (#PCDATA)>
]>
<input>
This is the content <file><name>test.png</name><type>Image</type></file>
</input>

I expect this to be valid but some online validators report that it is invalid because the input and file elements contain non-text nodes.

If I remove the file element within the input element then the resulting XML is reported to be valid, so I expect the "non-text nodes" are the child elements (file in input and name and type in file).

I expect this to be valid because the XML specification for an element specifies that an element is valid if it matches one of a set of conditions, one of which is:

The declaration matches Mixed, and the content (after replacing any entity references with their replacement text) consists of character data (including CDATA sections), comments, PIs and child elements whose types match names in the content model.

Note the "and child elements..." towards the end of that.

And the production for mixed is:

    Mixed      ::=      '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'  
            | '(' S? '#PCDATA' S? ')' 

The second case is what I have for input and file: (#PCDATA)

The validity requirement for mixed content is that there can be child elements as long as their names match names in the content model, which they do.

Am I misunderstanding the specification or are these validators incorrect?

If I remove the declarations of the file, name and type elements from the DTD but leave the child elements in the content of the input element, then I get additional validation errors indicating no declaration of those types. I expect these errors because the validation requirement is that the child element names match names in the content model and, with those declarations removed, they don't match names in the content model.

But there are other validators that report the XML is valid even without the declarations of the file, name and type elements in the DTD. This too seems to be a fault of the validators as the validation requirement clearly says that the child element names must match names in the content model, which they don't, when those element declarations are removed.

I know there are various XML validation implementations and they do not all work the same so they cannot all be strictly correct. I am most interested in having a strictly correct understanding of the specification.

In strict conformance to the validity requirements of an element with content (#PCDATA):

  1. Can the content of that element include child elements?
  2. If so, must the names of those elements match names of elements in the DTD?

The specification only appears to require that the names of child elements match names of elements in the DTD but I think reasonably the content and attributes of such elements should also match the declarations in the DTD, but the specification doesn't actually say this. So, again, in strict conformance with the validity requirements of the specification, must the content and attributes of a child element of an element with content (#PCDATA) match the declarations of these in the DTD? If so, where in the specification does it say so?

Finally, is there any easy to use (online or installable to Linux) XML validator that is strictly correct according to the specification that you can recommend?


Solution

  • Your element declaration,

    <!ELEMENT input (#PCDATA)>
    

    technically qualifies as allowing mixed content, but does not allow any elements to be mixed in.

    The section you cite says that mixed content may contain character data, optionally interspersed with child elements. This is supported by the production in that section. See ^^^ below which allows elements to be mixed in if provided by Name:

    Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'  
                               ^^^^^^^^^^^^^^^^^       
            | '(' S? '#PCDATA' S? ')' 
    

    However, your declaration does not actually allow elements. If you wish elements such as file to be allowed to be mixed in, instead declare input like this:

    <!ELEMENT input (#PCDATA|file)*>
    

    Update to address follow-up comments

    Any & and < characters that appear in parsed character data will be parsed: That is, interpreted as markup. Rules of well-formedness apply, and during validation the parsed markup must follow the grammar rules given by the schema. An element with only #PCDATA in its content model does not implicitly allow interspersed elements that aren't mentioned in the content model.

    Colloquially, mixed content typically implies the presence of interspersed elements; technically, mixed content may have zero or more elements1. Either way, the document is not valid if elements are interspersed with parsed data but not specified in the content model.


    1 Again, note the spec says optionally interspersed. Here is the full definition:

    3.2.2 Mixed Content

    [Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]