Search code examples
securityunicodeuniquestring-comparisonowasp

How to Protect Against Unicode Security Vulnerabilities


"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.

The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.

enter image description here

Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.

An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.

How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?


Solution

  • What you are talking about is called a homoglyph attack.

    There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.

    The important thing I think is to not have your own database - the library or service is easy to do on top of good data.

    As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.

    Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.