Search code examples
phpregexutf-8htmlspecialchars

removing special character encoding


I have a string coming from an XML file that appears to have been encoded with htmlspecialchars() twice:

$data = "string,s example";

I've tried replacing & with just an amphersand, then calling htmlspecialchars_decode(), and then replacing simple amphersands with the word "and", but the output comes out like stringand#44; example. I'm wondering if there is a way to correctly convert these character encodings, or perhaps a regex to strip them out entirely (as i could simply strip them and use this as a value to check against later)?


Solution

  • This particular string needs to be processed twice by html_entity_decode to get its "real" value (the first of these calls can also be htmlspecialchars_decode). The first pass will convert & to & and the second will convert the entity , to the corresponding character.

    You should be careful to pass in all relevant parameters to the decoding function calls (the defaults don't make sense in all cases) and of course be sure that all of the incoming data is encoded in this manner so that you don't get broken output by mistake.