Search code examples
phpunicodedomdocument

What is DOMDocument doing to my string?


$dom = new DOMDocument('1.0', 'UTF-8');

$str = '<p>Hello®</p>';

var_dump(mb_detect_encoding($str)); 

$dom->loadHTML($str);

var_dump($dom->saveHTML()); 

View.

Outputs

string(5) "UTF-8"

string(158) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hello&Acirc;&reg;</p></body></html>
"

Why did my Unicode ® get converted to &Acirc;&reg; and how do I stop this?

Am I going crazy today?


Solution

  • Your text editor says "®" in UTF-8, but the bytes in the file say "®" in Latin-1 (or a similar encoding), which is what PHP is using to read it. Using the character entity reference will remove this ambiguity.

    >>> print u'®'.encode('utf-8').decode('latin-1')
    ®