Search code examples
phpvalidationpreg-matchmultilingual

How to validate multilingual names in PHP?


I'm building a global website using PHP, I want to enable users to enter their first and last names in their own language and not only English, For example: Indian people will be able to enter their names in Indian letters, Russian people will be able to enter their names in Russian letters, and so on.

Now, I allow first and last names to consist only of letters. So my question is how should i validate the names? I mean: How should i check that those names are consisting only of letters? If i have only English names it will be like this: preg_match('/[^A-Za-z]/', $fname.$lname), but now i have not only English letters.

Note: I don't have the option to write this validation formula again and again for every different language with its letters.

Thanks for reading this question so far. Any ideas??


Solution

  • If you want to use regex for validating the names, you'll have to turn on Unicode mode using the /u modifier. When in Unicode mode, the PCRE character classes match not only e.g. ASCII letters, but include alphabetic characters in any language and script. Suppose you used the [:alpha:] class, or \p{L} which is what [:alpha:] class expands to with Unicode on:

    $fname = 'हिन्दी';
    $lname = 'Русский';
    
    preg_match('/[^[:alpha:]]/u', $fname.$lname));
    

    Here "Russkiy" validates as expected, however "Hindi" fails. But why? Hindi is an abugida script with e.g. vowel diacritics and inherent-vowel muters as a part of its construct. One might assume the "ि", "्" and "ी" above register as letters; however they don't. They belong to a different class, \p{M}, or characters combining with other characters. Then, to match abugida-alphabet languages (e.g. Indic scripts, incl. Myanmar, Thai, Tibetan, etc.), we should rather use:

    preg_match('/([^\p{L}\p{M}])/u', $fname.$lname));
    

    I've tentatively verified this combination as matching letter-and-combining-mark characters as expected in the following languages: Akkadian, Arabic, Armenian, Greek, Gujarati, Hebrew, Hindi, Japanese, Malayalam, Mandarin, Russian, Sinhalese, Sumerian, Tamil, Thai. More exhaustive tests pending, it's a fair bet to say the above would cover most of your alphabetic bases.

    Now, to a wholly unicode-unrelated matter on validating names. I notice you don't allow spaces in names. Fear the day when "Abraham Van Helsing" and "Osama bin Laden" try to sign up. Then, you don't allow periods. What about "V. S. Achuthanandan", people call him "Vee Es", because "Velikkakathu Sankaran" makes your mouth tired. And what about "J. K. Rowling"?

    Again, you don't allow dashes. What about "Kareem Abdul-Jabbar" and "Jean-Luc Picard". No pro basketball or warp drives for you. Again, not allowing apostrophes means "Count d'Artagnan" may challenge you to a duel, and the future may belong to Skynet now because "Sarah O'Connor" failed to register. She won't be back. Your site isn't that cool.

    And what about good old Bobby Tables aka. Robert'); DROP TABLE students;--, or Elon Musk's newborn "X Æ A-12". There, I've told you how you can match any letter or fragment thereof in any language. I'm also implying that if you allow all of the above, pretty much a baseline to avoid false positives, it's probably not very different from not checking to begin with. Give "x!1യ!! O'/nul1 W0W@本@?" his/her freedom to use a strange name, if that's what they really want.

    Further reading: