I'm building a small parser that scrapes web pages and logs the data on them. One of the things to log is the post title of forums. I'm using a XML parser to look through the DOM and get this information, and I'm storing it like this:
// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));
This works for the most part, but some posts have certain special HTML character codes like –
which is dash (-
). How would I go about converting these special character codes back into their original strings?
Thanks.
Use html_entity_decode. Here's a quick example.
$string = "hyphenated–words";
$new = html_entity_decode($string);
echo $new;
You should see...
hyphenated–words