I am using DOMDocument to extract some paragraphs.
Here is how my initial htm file that I am impotrting looks like:
<html>
<head>
<title>Toxins</title>
</head>
<body>
<p class=8reference><span>1.</span><span>Sivonen, K.; Jones, G. Cyanobacterial Toxins. In <i>Toxic Cyanobacteria in Water. A Guide to Their Public Health Consequences, Monitoring and Management</i>; Chorus, I., Bartram, J., Eds.; E. and F.N. Spon: London, UK, 1999; pp. 41–111.</span></p>
</body>
</html>
When I am doing:
$dom_input = new \DOMDocument("1.0","UTF-8");
$dom_input->encoding = "UTF-8";
$dom_input->formatOutput = true;
$dom_input->loadHTMLFile($manuscript->getUploadRootDir().$manuscript->getFileName());
$paragraphs = $dom_input->getElementsByTagName('p');
foreach ($paragraphs as $paragraph) {
if($paragraph->getAttribute('class') == "8reference") {
var_dump($paragraph->nodeValue);
}
}
The dash from "pp. 41–111" is converted to
pp. 41–111
Any idea why and how can I fix it in order to get utf8 unicode values?
Thank you in advance.
It looks to me like the data is correct, you're just displaying it incorrectly.
Are you outputting in UTF-8?
The à + thing is a classic "showing UTF-8 encoded data as if it was other than UTF-8.
E.g. If you're outputting to a web browser, try setting the character set with a meta tag. E.g.
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
If you need to output in something other than UTF-8 you'll need to convert into the alternative character set first.