I'm trying to extract the DOM from a website in PHP and then run some Xpath on it. The code should be simple but I keep getting encoding errors.
I've researched the error message already and tried to apply encoding (as detailed in other stack overflow posts) using mb_convert_encoding() but it doesn't fix the problem.
The website I was trying to extract uses UTF-8 so applying UTF-8 encoding using mb_convert_encoding() function to fix the issue doesn't make much sense as far as I can tell.
Here is my code, it should be possible to copy it elsewhere. As you can see I have tried both ways to apply the encoding at some point.
I think I'm using the correct function loadHTML() rather than loadHTMLFile(). Is it ok to extract the file using file_get_contents in order to feed it into this function?
<?php
$url = 'http://duckduckgo.com/';
if(! $file = file_get_contents($url) )
echo 'File get contents failed.';
$doc = new DOMDocument();
//$doc->loadHTML( mb_convert_encoding($file,'HTML-ENTITIES','UTF-8') );
$doc->loadHTML( '<?xml version="1.0" encoding="UTF-8"?>'.$file );
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/div[@id='logo_homepage_link']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
The error is:
Warning: DOMDocument::loadHTML(): htmlCheckEncoding: unknown encoding UTF-8;charset=utf-8 in Entity, line: 11 in C:\Websites\domxpath\index.php on line 10
Not sure if it's a bug or feature, but the code is objecting to the double encoding in the line...
<meta http-equiv="content-type" content="text/html; charset=UTF-8;charset=utf-8">
If you replace this with just the UTF-8 it will at least pass this part...
$file = str_replace("UTF-8;charset=utf-8", "UTF-8", $file);
Just put this before your loadHTML()
line.