Search code examples
phputf-8character-encoding

PHP DOMDocument loadHTML not encoding UTF-8 correctly


I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

echo $profile;

it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

What I see:

ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åº­ã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ã­ã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å­¦ã

What should be shown:

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I've simplified the code down to five lines so you can test it yourself.

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åº­ã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>

Solution

  • DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

    DOMDocument uses an HTML4 parser. If you're loading HTML5, you might want to look at alternative solutions.

    If you're dealing with simple snippets of (X)HTML, you could prepend an XML encoding declaration or a meta charset declaration to cause the string to be treated as UTF-8:

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    
    // This version preserves the original characters
    $contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
    $dom->loadHTML($contentType . $profile);
    echo $dom->saveHTML();
    
    // This version will HTML-encode high-ASCII bytes
    $dom->loadHTML('<meta charset="utf8">' . $profile);
    echo $dom->saveHTML();
    
    // This version will also HTML-encode high-ASCII bytes,
    // and won't work for LIBXML_DOTTED_VERSION >= 2.12.0
    $dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
    echo $dom->saveHTML();
    

    If you cannot know if the HTML will already contain declarations, there's a workaround in SmartDOMDocument which should help you:

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
    echo $dom->saveHTML();
    

    In PHP 8.2+, you'll get a deprecation warning, so the alternative would be:

    $profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
    $dom = new DOMDocument();
    $dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));
    echo $dom->saveHTML();
    

    (For a better explanation of that rather cryptic array, see here.)

    This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.