In short, is it possible to use a DTD to define an element as containing CDATA?
I'm calling a third party API that produces some invalid characters inside an element. Specifically, the data contains some HTML entities like ’
. When I attempt to parse this XML using SimpleXML, I of course get a parser error "Entity 'rsquo' not defined". Here's a simplistic example structure of what I'm dealing with:
<items>
<item>
<name>Jim Smith</name>
<description>Jim’s description breaks my parser</description>
</item>
</items>
Since I don't have control to fix the API response... I've resorted to this dirty trick to inject a CDATA section inside the problem element just before I try to parse it:
$xml = str_replace("<description>", "<description><![CDATA[", $xml);
$xml = str_replace("</description>", "]]></description>", $xml);
This fixes the issue for me, but the overhead is probably too big, don't you think? The XML can be anywhere between 30K to 100K of data.
I'd rather use a DTD but for the life of me I can't find any specs that allow for defining CDATA (in the same way I can define PCDATA). Below is what I'd like to do, but of course, it's invalid because of the '#CDATA' definition I'm trying to do:
<!DOCTYPE ITEMS [
<!ELEMENT ITEMS (ITEM)>
<!ELEMENT ITEM (NAME, DESCRIPTION)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT DESCRIPTION (#CDATA)>
]>
Thanks for any insights!
It is possible in SGML DTDs (e.g. the HTML 4.01 script element), but not in XML DTDs (hence the change for XHTML 1.0).