I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well.
Here is an example:
<?php
$html = '
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
';
$xml = simplexml_load_string($html);
var_dump($xml);
?>
Result of var_dump() is:
object(SimpleXMLElement)#1 (2) {
["b"]=>
object(SimpleXMLElement)#2 (1) {
["span"]=>
string(10) "zot; zotz "
}
["span"]=>
string(39) "Nista; nula. Isto
"
}
As you can see - Simplexml kept text node inside tag but left out a child node and text inside.
I've also tried:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
with the same result.
As it looked to me that this is a common problem in parsing html I tried googling it out but only place that acknowledges this problem is this blog: https://hakre.wordpress.com/2013/07/09/simplexml-and-json-encode-in-php-part-i/ but does not offer any solution.
There is just too generalized posts and answers about parsing HTML in SO.
Is there a simple way of dealing with this? Or, should I change my strategy?
Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
This example shows that dom_import_simplexml
is used on the more specific <span>
element-node and the traversal is the done over the children of the according DOMElement object.
The output:
- DOMText : Nista; nula. Isto
- DOMElement : zilch; zip.
- DOMText :
The first entry is the first text-node within the <span>
element. It is followed by the <b>
element (which again contains some text) and then from another text-node that consists of whitespace only.
The dom_import_simplexml
function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.
The example in full:
$html = <<<HTML
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
HTML;
$xml = simplexml_load_string($html);
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}