Search code examples
phpxmlsimplexmldtdcdata

Use DTD to Define an Element as CDATA?


In short, is it possible to use a DTD to define an element as containing CDATA?

I'm calling a third party API that produces some invalid characters inside an element. Specifically, the data contains some HTML entities like ’. When I attempt to parse this XML using SimpleXML, I of course get a parser error "Entity 'rsquo' not defined". Here's a simplistic example structure of what I'm dealing with:

<items>
    <item>
        <name>Jim Smith</name>
        <description>Jim&rsquo;s description breaks my parser</description>
    </item>
</items>

Since I don't have control to fix the API response... I've resorted to this dirty trick to inject a CDATA section inside the problem element just before I try to parse it:

$xml = str_replace("<description>", "<description><![CDATA[", $xml);
$xml = str_replace("</description>", "]]></description>", $xml);

This fixes the issue for me, but the overhead is probably too big, don't you think? The XML can be anywhere between 30K to 100K of data.

I'd rather use a DTD but for the life of me I can't find any specs that allow for defining CDATA (in the same way I can define PCDATA). Below is what I'd like to do, but of course, it's invalid because of the '#CDATA' definition I'm trying to do:

<!DOCTYPE ITEMS [
    <!ELEMENT ITEMS (ITEM)>
    <!ELEMENT ITEM (NAME, DESCRIPTION)>
    <!ELEMENT NAME (#PCDATA)>
    <!ELEMENT DESCRIPTION (#CDATA)>
]>

Thanks for any insights!


Solution

  • It is possible in SGML DTDs (e.g. the HTML 4.01 script element), but not in XML DTDs (hence the change for XHTML 1.0).