Search code examples
phphtmlhtmlspecialchars

Converting special HTML characters back into their original strings


I'm building a small parser that scrapes web pages and logs the data on them. One of the things to log is the post title of forums. I'm using a XML parser to look through the DOM and get this information, and I'm storing it like this:

// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));

This works for the most part, but some posts have certain special HTML character codes like – which is dash (-). How would I go about converting these special character codes back into their original strings?

Thanks.


Solution

  • Use html_entity_decode. Here's a quick example.

    $string = "hyphenated&#8211words";
    
    $new = html_entity_decode($string);
    
    echo $new;
    

    You should see...

    hyphenated–words