Assume $html_dom
contains a page that has HTML entities like  
. In the output below, I get an output like this  
.
$html_dom = new DOMDocument();
@$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);
$query = '//div[@class="foo"]/div/p';
$my_foos = $xpath->query($query_abstract);
foreach ($my_foos as $my_foo)
{
echo html_entity_decode($my_foos->nodeValue);
die;
}
How do I handle this properly so that I don't get weird characters? I tried the following with no success:
$html_doc = mb_convert_encoding($html_doc, 'HTML-ENTITIES', 'UTF-8');
$html_dom = new DOMDocument();
$html_dom->resolveExternals = TRUE;
@$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);
$query = '//div[@class="foo"]/div/p';
$my_foos = $xpath->query($query);
foreach ($my_foos as $my_foo)
{
echo html_entity_decode($my_foos->nodeValue);
die;
}
mb_convert_encoding
was a good idea, but it does not work as expected because DOMDocument
seems to be a little big buggy when it comes to encoding.
Moving the mb_convert_encoding
to the actual node output did the trick.
$html_dom = new DOMDocument();
$html_dom->resolveExternals = TRUE;
@$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);
$query = '//div[@class="foo"]/div/p';
$my_foos = $xpath->query($query);
foreach ($my_foos as $my_foo)
{
echo mb_convert_encoding($my_foo->nodeValue, 'HTML-ENTITIES', 'UTF-8');
die;
}