Search code examples
phpxmlsimplexmlcdatalibxml2

PHP, SimpleXML, decoding entities in CDATA


I'm experiencing the following behavior:

$xml_string1 = "<person><name><![CDATA[ Someone&#039;s Name ]]></name></person>";
$xml_string2 = "<person><name> Someone&#039;s Name </name></person>";

$person = new SimpleXMLElement($xml_string1);
print (string) $person->name; # Someone&#039;s Name

$person = new SimpleXMLElement($xml_string2);
print (string) $person->name; # Someone's Name

$person = new SimpleXMLElement($xml_string1, LIBXML_NOCDATA);
print (string) $person->name; # Someone&#039;s Name

The php docs say that NOCDATA "Merge[s] CDATA as text nodes". To me this means that CDATA will then be treated the same as text nodes - or that the behavior of the 3rd example will now be the same as the 2nd example.

I don't have control over the XML (it's a feed from an external source), otherwise I'd just remove the CDATA tag as it does nothing and ruins the behavior I want.

Why does the above example behave the way that it does? Is there any way to make SimpleXML handle the CDATA nodes in the same way that it handles text nodes? What does "Merge CDATA as text nodes" actually do, since I don't seem to be understanding that option?

I'm currently decoding after I pull out the data, but the above example still doesn't make sense to me.


Solution

  • The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &amp;.

    If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &amp;, or &somethingElse;).

    The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.

    What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:

    $xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";
    
    $person = new SimpleXMLElement($xml_string);
    echo 'CDATA retained: ', $person->asXML();
    // CDATA retained: <?xml version="1.0"?>
    // <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>
    
    $person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
    echo 'CDATA merged: ', $person->asXML();
    // CDATA merged: <?xml version="1.0"?>
    // <person><name>Welcome aboard this P&amp;O Cruises voyage!</name></person>
    

    If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:

    <Comment>
    <SubmittedBy>IMSoP</SubmittedBy>
    <Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
    </Comment>