Search code examples
regexvalidationbusiness-logic

Practical user validation (sensitivity and specificity)?


When I was first learning how to use regular expressions we were taught how to parse things like phone numbers (obviously always 5 digits, an optional space and a further 6 digits), email addresses (obviously always alphanumerics, then a single '@', then alphanumerics followed by a '.' and three letters) which we should always do to validate the data that the user enters.

Of course as I've developed I've learned how silly the basic approach can be, but the more I look, the more I question the concept altogether, the most open careful correct validation of something like an email address through regexes ends up being hundreds if not thousands of characters long in order to both accept all the legal cases and correctly reject only the illegal ones. Even worse, all that effort does absolutely nothing for the actual validity, the user may have accidentally added an 'a', or may not use that email address at all, or even is using someone else's address, or may even use a '+' symbol which is being flagged inappropriately.

Yet at the same time seemingly every site I come across still does this kind of technical checking, preventing me from putting more obscure characters in an email address or name, or objecting to the idea that someone would have more or less than a single title, then a single firstname and a single lastname, all made purely from latin characters yet without any form of check that it's my real name.

Is there a benefit to this? Once injection attacks are handled (which should be through methods other than sterilizing the input) is there any other point to these checks?

Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?


Solution

  • is there any other point to these checks?

    Certainly. Knowing that your data is valid is very important. In the case of email addresses, for example, sending an email to an address you haven't validated will, at the very least, lead to bounces. Enough bounces and your mailhost might block you for spamming. Not validating a phone number could lead to unnecessary costs if your app tries to send SMS to them. The list goes on and on.

    Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?

    Yes, but regex is generally bad way to validate data. If a phone number is supposed to be "5 digits a space then 6 digits", then your check is going to fail if I type "5 digits two spaces then 6 digits" or "5 digits a dash then 6 digits" or "11 digits". Use common sense, and expect any crazy format the user provides. Know what the absolute minimal requirement is. For example, if you need 11 digits total, then strip everything that's not a digit first. Then formatting doesn't matter.

    Also, read the RFCs. I can't count the number of times my email address has been rejected because it has a plus sign in it. The amount of those that were large tech-oriented company with programmers that should know better was rather disappointing.