Assuming my project is utf-8 throughout and has always been used with utf-8 encoding, is there anything legit that could possibly break if I change all occurrences of htmlspecialchars($var)
to htmlspecialchars($var, ENT_QUOTES, 'utf-8')
?
I do know one thing: Obviously, ENT_QUOTES
differs from ENT_COMPAT
in that it also escapes single quotation marks. Assuming I know that this alone won't break anything, is there anything else left over?
Differently worded:
Is there a conceivable result of htmlspecialchars() when used without the charset parameter, given data only from the charset, that would differ from htmlspecialchars() when used with the charset parameter?
(Is, at any point, htmlspecialchars($stringThatIsValidUTF8, ENT_QUOTES) !== htmlspecialchars($stringThatIsValidUTF8, ENT_QUOTES, 'utf-8')
?)
My understanding says no, never. Another question here on stackoverflow suggests no, too. So far, browsing my sandbox of the project with the changes also says no. However, I'm not sure if I'm overlooking something.
I think the quote from the PHP manual in the other question answers it definitely:
For the purposes of this function, the charsets ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R are effectively equivalent, as the characters affected by htmlspecialchars() occupy the same positions in all of these charsets.
"
&
>
and so on all have the same code in each of those encodings, and even in UTF-8 they require only one byte, because an UTF-8 character will occupy multiple bytes only when necessary. Therefore, even if you have been processing UTF-8 data with ISO-8859-1 until now, the output will be identical when you switch to explicit UTF-8 input.