Search code examples
xmlcdatapcdata

CDATA inside PCDATA handling in XML


If we have the following XML element:

<x>a &lt; b</x>

and the another one

<y>a<![CDATA[ < ]]>b</y>

Do both elements x and y have the value of a < b? Is the second example valid, common, recommended or something like that?

AFAK y has three child elements - PCDATA a, CDATA < and PCDATA b and some libraries parse it exactly like that. On the other hand https://pugixml.org/ for one returns only a as value for x (helper function).


Solution

  • There is a fundamental difference between the two:

    CDATA means Character Data, while PCDATA means Parsed Character Data, which already gives us a hint into the right direction why parsers may behave differently, depending on their conformance level.

    CDATA sections are strict and pure escapes of anything in between the <![CDATA[ and ]]> tags. Nothing, that is written in between here, is supposed to be parsed by the XML processor at all! A conforming XML parser just ignores anything here and passes it through, unseen, to whatever application has requested the XML (which then is free to process it by itself). This is why we can place any wild character data in here, without the XML becoming invalid.

    &lt; is an Entity, more specifically a Character Entity. Entities are 'placeholders' or 'markers', that get substituted by content. This means, that an entity is also PCDATA (Parsed Character DATA). It gets parsed by the XML parser, which then interprets it (tries to resolve its contents) so it can substitute the entity with it.

    As of the value of the data, we may need to know more about the application, that requests the XML. Within the domain of XML processing tools (XSD, XSLT, XPath, XQuery, etc.), it should come out, in both cases, as any of the XPath datatypes of text(), xs:string() or xs:untypedAtomic, depending on what function you used to gain access to it. For example:

    let $t := <xml>Text <![CDATA[test]]> bla.</xml>
    return $t/data() instance of xs:untypedAtomic
    
    let $t := <xml>Text <![CDATA[test]]> bla.</xml>
    return $t/string() instance of xs:string
    
    let $t := <xml>Text <![CDATA[test]]> bla.</xml>
    return $t/text() instance of text()
    

    all result in true.

    For any application, that is not working with the XML Data Model, however, the result should be simply the text, that was in between the element tags.

    There is some interesting note here and a whole thread concerning this, and related, topics.