Search code examples
phpxmlutf-8domdocument

PHP DOMDocument - why is en dash "–" converted to –


I am using DOMDocument to extract some paragraphs.

Here is how my initial htm file that I am impotrting looks like:

<html>
    <head>
        <title>Toxins</title>
    </head>

    <body>
        <p class=8reference><span>1.</span><span>Sivonen, K.; Jones, G. Cyanobacterial Toxins. In <i>Toxic Cyanobacteria in Water. A Guide to Their Public Health Consequences, Monitoring and Management</i>; Chorus, I., Bartram, J., Eds.; E. and F.N. Spon: London, UK, 1999; pp. 41–111.</span></p>
    </body>
</html>

When I am doing:

$dom_input = new \DOMDocument("1.0","UTF-8");
$dom_input->encoding = "UTF-8";
$dom_input->formatOutput = true;
$dom_input->loadHTMLFile($manuscript->getUploadRootDir().$manuscript->getFileName());

$paragraphs = $dom_input->getElementsByTagName('p');

foreach ($paragraphs as $paragraph) {
    if($paragraph->getAttribute('class') == "8reference") {
        var_dump($paragraph->nodeValue);
    }
}

The dash from "pp. 41–111" is converted to

pp. 41–111

Any idea why and how can I fix it in order to get utf8 unicode values?

Thank you in advance.


Solution

  • It looks to me like the data is correct, you're just displaying it incorrectly.

    Are you outputting in UTF-8?

    The à + thing is a classic "showing UTF-8 encoded data as if it was other than UTF-8.

    E.g. If you're outputting to a web browser, try setting the character set with a meta tag. E.g.

    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
    

    If you need to output in something other than UTF-8 you'll need to convert into the alternative character set first.