Search code examples
phphtmlhtml-entitiessanitization

Is it safe to unescape ampersand for user input?


After a few hours of bug searching, I found out the cause of one of my most annoying bugs.

When users are typing out a message on my site, they can title it with plaintext and html entities.

This means in some instances, users will type a title with common html entity pictures like this face. ( ͡° ͜ʖ ͡°).

To prevent html injection, I use htmlspecialchars(); on the title, and annoyingly it would convert the picture its html entity format when outputted onto the page later on.

( ͡° ͜ʖ ͡°)

I realized the problem here was that the title was being encoded as the example above, and htmlspecialchar, as well as doing what I wanted and encoding possible html injection, was turning the ampersand in the entities to

&. 

By un-escaping all the ampersands, and changing them back to & this fixed my problem and the face would come out as expected.

However I am unsure if this is still safe from malicious html. Is it safe to decode the ampersands in user imputed titles? If not, how can I go about fixing this issue?


Solution

  • If your entities are displayed as text, then you're probably calling htmlspecialchars() twice.

    If you are not calling htmlspecialchars() twice explicitly, then it's probably a browser-side auto-escaping that may occur if the page containing the form is using an obsolete single-byte encoding like Windows-1252. Such automatic escaping is the only way to correctly represent characters not present in character set of the specific single-byte encoding. All current browsers (including Firefox, Opera, and IE) do this.

    Make sure you are using Unicode (UTF-8 in particular) encoding.

    To use Unicode as encoding, add the <meta charset="utf-8" /> element to the HEAD section of the HTML page that contains the form. And don't forget to save the HTML page itself in UTF-8 encoding. To use Unicode in PHP, it's typically enough to use multibyte (mb_ prefixed) string functions. Finally, database engines like MySQL do support UTF-8 long ago.

    As a temporary workaround, you can disable reencoding existing entities by setting 4th parameter ($double_encode) of the htmlspecialchars() function to false.