Search code examples
phphtmlregexcharacter-encodingcharacter-entities

Turning HTML character entities to 'regular' letters... why is it only partially working?


I'm using all of the below to take a field called 'code' from my database, get rid of all the HTML entities, and print it 'as usual' to the site:

   <?php $code = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $code);
   $code = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $code); 
   $code = html_entity_decode($code); ?>

However the exported code still looks like this:

progid:DXImageTransform.Microsoft.AlphaImageLoader(src=’img/the_image.png’);

See what's going on there? How many other things can I run on the string to turn them into darn regular characters?!

Thanks!

Jack


Solution

  • ’ is what you get when you read the UTF-8 encoded character (RIGHT SINGLE QUOTATION MARK, U+2019) as if it were encoded as windows-1252. In other words, you have two problems: you're using the wrong encoding to read the wrong character.

    HTML attribute values are supposed to be enclosed in ASCII apostrophes or quotation marks, not curly quotes. The numeric entities you're converting should be &#39; or &#x27 (apostrophe) or &#34; or &#x22; (quotation mark). Instead, you appear to have &#146;, which represents the same character as &#x2019;, &#8217, or &rsquo;.

    As for the second problem, the resulting text seems to be encoded as UTF-8, but at some point it's being read as if it were windows-1252. In UTF-8, the character is represented by the three-byte sequence E2 80 99, but windows-1252 converts each byte separately, to â, , and . Wherever that's happening, it's not in the code you showed us.

    The good news is that your preg_replace code seems to be working correctly. ;) But I think the others are right when they say you can use html_entity_decode() alone for that part.