Search code examples
javascripthtmlregexwhitelistblacklist

How do I blacklist characters on a utf-8 string?


I have an HTML text input where users can write in a name for themselves. The name is just a user-friendly display name, it's not used to identify the user in the database or for anything on the back end.

I want to allow utf-8 characters, so that people can input characters of their native langugage, whether it's Chinese or Swedish or whatever.

However, I want to blacklist certain characters, like <,>, [, ], ?, *, and so on, to stop any potential script kiddies trying to exploit the input to make an SQL injection or whatever.

I thought this would be straightforward code that there would be lots of examples of on the web, but the answer, if it's out there, is buried among examples of how to use a whitelist to validate email addresses (only English alphanumeric characters, no Asian or other language specific characters), or, oddly enough, how to stop key presses for certain characters entirely.

I don't want to stop key presses entirely, as I think that might confuse the user in my case. Instead I'll output an error saying they can't use character X if they input a blacklisted one.

So, for a guy like me who sucks totally at regex, is there a straightforward way of blacklisting characters in Javascript?

I would also go for a whitelist solution if it didn't inhibit the ability for users to put in the funky characters from whatever language they're using.


Solution

  • To start you would want to clean the input data on both the client side and the server side. Anyone clever enough to be creating attacks will be clever enough to disable javascript long enough to get the data they want into your forms.

    Now as far as a javascript regex to prevent entry - there are lots of questions on SO that talk about this.

    //some technical stuff

    javascript regexp remove all special characters

    Does this set of regular expressions FULLY protect against cross site scripting?

    //simple js example to stop entry of unwanted characters.

    http://www.sitepoint.com/forums/showthread.php?t=142118

    From what I read the consensus seems to be that you need to whitelist and not blacklist. Perhaps someone from SO with more experience can point you towards the best way to handle your use case.