Search code examples
phpxmlsimplexml

Extracting HTML from XML with PHP/SimpleXML


I am trying to extract data from an XML file (the file format is not subject to alteration). The XML data includes both content and appeareance info, in the form of HTML tags, which are causing me grief. The relevant part of the XML looks like this:

<item>
    <p>Some text</p>
    <p> Some more text</p>
    <p><i>This</i> is important text.</p>
</item>

I need the contents of the node, as a string (for later insertion into a DB). The text is always wrapped in < p > tags, so I try to iterate over those, using this code:

$namediscussion = '';

foreach($sectionxml->xpath('//p') as $p)
{
     $namediscussion = $namediscussion . $p . '</br>';

}

echo $namediscussion

($sectionxml is the output of ximplexml_load_string() from a parent node).

The problem is that when I echo $namediscussion, what I get is:

Some text 
Some more text 
is important text.

Note the missing word that was in italics. How do I preserve this? I'd prefer to use SimpleXML, but if I have to go to DOM, that's fine too. Even direct string manipulation would work, but I can't seem to extract the entire string from the SimpleXML node.

Help greatly appreciated.


Solution

  • You are casting simplexmlelement, and this will discard the content of the element children as explained here simplexmlelement::__toString

    Does not return text content that is inside this element's children.
    

    To fix the missing word, you can use simplexmlelement::asXML instead of string cast as shown below

    $namediscussion = $namediscussion . strip_tags($p->asXML()) . '</br>';