Search code examples
phpsimplexmlcdata

SimpleXML: handle CDATA tag presence in node value


I need to save <![CDATA[]]> tag when I parse XML document.

For example, I have node:

<Dest><![CDATA[some text...]]></Dest>

In xml file may be present nodes without CDATA.

Then I process all the nodes in loop:

$dom = simplexml_load_file($path);
foreach($dom->children() as $child) {
 $nodeValue = (string) $child;
}

As a result, when I process node in example above - $nodeValue = some text...

But I need $nodeValue = <![CDATA[some text...]]>

There is any way to do this?

File example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
  <Params>
    <param>text</param>
    <anotherParam>text</anotherParam>
  </Params>
  <Content>
    <String>
      <Source>some another text</Source>
      <Dest>some another text 2</Dest>
    </String>
    <String>
      <Source>some another text 3</Source>
      <Dest><![CDATA[some text...]]></Dest>
    </String>
  </Content>
</Root>

Solution

  • As far as a parser like SimpleXML is concerned, the <![CDATA[ is not part of the text content of the XML element, it's just part of the serialization of that content. A similar confusion is discussed here: PHP, SimpleXML, decoding entities in CDATA

    What you need to look at is the "inner XML" of that element, which is tricky in SimpleXML (->asXML() will give you the "outer XML", e.g. <Dest><![CDATA[some text...]]></Dest>).

    Your best bet here is to use the DOM which gives you more access to the detailed structure of the document, rather than trying to give you the content, so distinguishes "text nodes" and "CDATA nodes". However, it's worth double-checking that you do actually need this, as for 99.9% of use cases, you shouldn't care whether somebody sent you <foo>bar &amp; baz</foo> or <foo><![CDATA[bar & baz]]></foo>, since by definition they represent the same string.