Search code examples
regexnames

Validate Title Case Full Name with Regex


To learn Regex, I was solving some problems to train and study. And this is the problem, i know it might not be the best way to do with Regex, and my Regex is a mess, but i liked the challenge.

Problem:

  • The names needs to be Title Case;
  • There are exceptions for some lowercase words inside;
  • And some Names, e.g.: McDonald, MacDuff, D'Estoile
  • Names with ' and - are accepted, and sometimes they are o'Brien, O'brien, O'Brien, O' Brien or 'Ehu Kali.
  • No whitespaces on the beggining and end of Name;
  • No more than one space between each Name of Full Name;
  • A . is accepted if not alone, e.g.: Dan . Ferdnand (isn't accepted) and Dan G. Ferdnand (is accepted)
  • Numbers and symbols are not accepted
  • However, Roman numbers are accepted and aren't Title Case, e.g.: Elizabeth II
  • Some names can be alone, e.g.: Akihito (Prince of Japan)
  • Some special characters common in some countries are accepted, e.g.: Valeh ßlÿsgÿroğlu, Lażżru Role, Alaksiej Taraškievič

Regex

The code is

^(?![ ])(?!.*(?:\d|[ ]{2}|[!$%^&*()_+|~=`\{\}\[\]:";<>?,\/]))(?:(?:e|da|do|das|dos|de|d'|la|las|el|los|l'|al|of|the|el-|al-|di|van|der|op|den|ter|te|ten|ben|ibn)\s*?|(?:[A-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð'][^\s]*\s*?)(?!.*[ ]$))+$

And the Regex101 with a validation list

References

What i tried so far was based on these:

Not working

I did this Regex and don't know how to make a way for it to not recognize the cases below, that are matching:

  • CAPITAL LETTER
  • AlTeRnAtE LeTtEr

And those aren't and should:

  • Urxan Əbűlhəsənzadə
  • İsmət Jafarov
  • Şükür Hagverdiyev
  • Űmid Abdurrahimov
  • Ġerardo Seralta
  • Ċikku Paris

Question

Is there a way to optimize this Regex (monster)?

And how do i fix the problems stated before on Not working?

p.s.: The list of names with examples for validation can be found on the link to Regex101.


Solution

  • Brief

    Seeing as how you're learning Regex and haven't specified a regex flavour to use, I've chosen PCRE as it has a wide variety of support in the regex world.


    Code

    See this regex in use here

    (?(DEFINE)
        (?# Definitions )
        (?<valid_nameChars>[\p{L}\p{Nl}])
        (?<valid_nonNameChars>[^\p{L}\p{Nl}\p{Zs}])
        (?<valid_startFirstName>(?![a-z])[\p{L}'])
        (?<valid_upperChar>(?![a-z])\p{L})
        (?<valid_nameSeparatorsSoft>[\p{Pd}'])
        (?<valid_nameSeparatorsHard>\p{Zs})
        (?<valid_nameSeparators>(?&valid_nameSeparatorsSoft)|(?&valid_nameSeparatorsHard))
        (?# Invalid combinations )
        (?<invalid_startChar>^[\p{Zs}a-z])
        (?<invalid_endChar>.*[^\p{L}\p{Nl}.\p{C}]$)
        (?<invalid_unaccompaniedSymbol>.*(?&valid_nameSeparatorsHard)(?&valid_nonNameChars)(?&valid_nameSeparatorsHard))
        (?<invalid_overTwoUpper>(?:(?&valid_nameChars)*\p{Lu}){3})
        (?<invalid>(?&invalid_startChar)|(?&invalid_endChar)|(?&invalid_unaccompaniedSymbol)|(?&invalid_overTwoUpper))
        (?# Valid combinations )
        (?<valid_name>(?:(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*(?&valid_nameChars)+(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*)+\.?)
        (?<valid_firstName>(?&valid_startFirstName)(?:\.|(?&valid_name)*))
        (?<valid_multipleName>(?&valid_firstName)(?=.*(?&valid_nameSeparators)(?&valid_upperChar))(?:(?&valid_nameSeparatorsHard)(?&valid_name))+)
        (?<valid>(?&valid_multipleName)|(?&valid_firstName))
    )
    ^(?!(?&invalid))(?&valid)$
    

    Results

    Input

    == 1NcOrrect N4M3S ==
    CAPITAL LETTER
    AlTeRnAtE LeTtEr
    Natalia maria
    Natalia aria
    Natalia orea
    Maria dornelas
    Samuel eto'
    Miguel lasagna
    Antony1 de Home Ap*ril
    Ap*ril Willians
    Antony_ de Home Apr+il
    Ant_ony de Home Apr#il
    Antony@ de Ho@me Apr^il
    Maria  Silva
    Maria silva
    maria Silva
     Maria Silva
    Maria Silva 
    Maria / Silva
    Maria . Silva
    John W8
    
    ==Correct Names==
    Urxan Əbűlhəsənzadə
    İsmət Jafarov
    Şükür Hagverdiyev
    Űmid Abdurrahimov
    Ġerardo Seralta
    Ċikku Paris
    Hind ibn Sheik
    Colop-U-Uichikin
    Lażżru Role
    Alaksiej Taraškievič
    Petruso Husoǔski
    Sumu-la-El
    Valeh ßlÿsgÿroğlu
    'Arab al-Rashayida
    Tariq al-Hashimi
    Nabeeh el-Mady
    Tariq Al-Hashimi
    Brian O'Conner
    Maria da Silva
    Maria Silva
    Maria G. Silva
    Maria McDuffy
    Getúlio Dornelles Vargas
    Maria das Flores
    John Smith
    John D'Largy
    John Doe-Smith
    John Doe Smith
    Hector Sausage-Hausen
    Mathias d'Arras
    Martin Luther King Jr.
    Ai Wong
    Chao Chang
    Alzbeta Bara
    Marcos Assunção
    Maria da Silva e Silva
    Juscelino Kubitschek de Oliveira
    Maria da Costa e Silva
    Samuel Eto'o
    María Antonieta de las Nieves
    Eugène
    Antòny de Homé April
    àntony de Home ùpril
    Antony de Home Aprìl
    Pierre de l'Estache
    Pierre de L'Estoile
    Akihito
    Nadine Schröder
    Anna A. Møller
    D. Pedro I
    Pope Benedict XVI
    Marsibil Ragnarsdóttir
    Natanaël Morel
    Isaac De la Croix
    Jean-Michel Bozonnet
    Qutaibah Mu'tazz Abadi
    Rushd Jawna' Kassab
    Khaldun Abdul-Qahhar Sabbag
    'Awad Bashshar Asker
    Al B. Zellweger
    Gunnleif Snæ-Ulfsson
    Käre Toresson
    Sorli Ærnmundsson
    Arnkel Øystæinsson
    Ástríður Dórey
    Åsmund Kåresson
    Yahatti-Il
    Ipqu-Annunitum
    Nabu-zar-adan
    Eskopas Cañaverri
    Botolph of Langchester
    Aelfhun the Cantrell
    Fraco di Natale
    Fraco Di Natale
    Iván de Luca
    Iván De Luca
    Man'nah
    Atabala Aüamusalü
    Ramiz Ağasəfalu
    Dadaş Aghakhanov
    Fÿrxad Mübarizlı
    Vaclaǔ Šupa
    Yakiv Volacič
    Flor Van Vaerenbergh
    Flor van Vaerenbergh
    Edwin van der Sar
    Husein Ekmečić
    Álvaro Guimarães Alencar
    Phone U Yaza Arkar
    Seocan MacGhille
    X'wat'e Tlekadugovy
    Albert-Jan Bootsveld
    Maurits-jan Kuipers op den Kollenstaart
    Elco ter Hoek
    Robbert te Poele
    Aad ten Have
    'Ehu Kali
    Ho'opa'a Loni
    Aukanai'i Mahi'ai
    Kalman ben Tal El
    Żytomir Roszkowski
    K'awai
    
    ==EXTRA== only if possible, strange ones
    Maol-Moire Mac'IlleBhuidh
    Tòmas MacIlleChruim
    Aindreas MacIllEathain
    Eanruig MacGilleBhreac
    Peadar MacGilleDhonaghart
    Maolmhuire MacGill-Eain
    Eanruig MacGilleBhreac
    Wim van 't Plasman
    

    Output

    Note: Shown below are only the strings that matched from the above Input

    Urxan Əbűlhəsənzadə
    İsmət Jafarov
    Şükür Hagverdiyev
    Űmid Abdurrahimov
    Ġerardo Seralta
    Ċikku Paris
    Hind ibn Sheik
    Colop-U-Uichikin
    Lażżru Role
    Alaksiej Taraškievič
    Petruso Husoǔski
    Sumu-la-El
    Valeh ßlÿsgÿroğlu
    'Arab al-Rashayida
    Tariq al-Hashimi
    Nabeeh el-Mady
    Tariq Al-Hashimi
    Brian O'Conner
    Maria da Silva
    Maria Silva
    Maria G. Silva
    Maria McDuffy
    Getúlio Dornelles Vargas
    Maria das Flores
    John Smith
    John D'Largy
    John Doe-Smith
    John Doe Smith
    Hector Sausage-Hausen
    Mathias d'Arras
    Martin Luther King Jr.
    Ai Wong
    Chao Chang
    Alzbeta Bara
    Marcos Assunção
    Maria da Silva e Silva
    Juscelino Kubitschek de Oliveira
    Maria da Costa e Silva
    Samuel Eto'o
    María Antonieta de las Nieves
    Eugène
    Antòny de Homé April
    àntony de Home ùpril
    Antony de Home Aprìl
    Pierre de l'Estache
    Pierre de L'Estoile
    Akihito
    Nadine Schröder
    Anna A. Møller
    D. Pedro I
    Pope Benedict XVI
    Marsibil Ragnarsdóttir
    Natanaël Morel
    Isaac De la Croix
    Jean-Michel Bozonnet
    Qutaibah Mu'tazz Abadi
    Rushd Jawna' Kassab
    Khaldun Abdul-Qahhar Sabbag
    'Awad Bashshar Asker
    Al B. Zellweger
    Gunnleif Snæ-Ulfsson
    Käre Toresson
    Sorli Ærnmundsson
    Arnkel Øystæinsson
    Ástríður Dórey
    Åsmund Kåresson
    Yahatti-Il
    Ipqu-Annunitum
    Nabu-zar-adan
    Eskopas Cañaverri
    Botolph of Langchester
    Aelfhun the Cantrell
    Fraco di Natale
    Fraco Di Natale
    Iván de Luca
    Iván De Luca
    Man'nah
    Atabala Aüamusalü
    Ramiz Ağasəfalu
    Dadaş Aghakhanov
    Fÿrxad Mübarizlı
    Vaclaǔ Šupa
    Yakiv Volacič
    Flor Van Vaerenbergh
    Flor van Vaerenbergh
    Edwin van der Sar
    Husein Ekmečić
    Álvaro Guimarães Alencar
    Phone U Yaza Arkar
    Seocan MacGhille
    X'wat'e Tlekadugovy
    Albert-Jan Bootsveld
    Maurits-jan Kuipers op den Kollenstaart
    Elco ter Hoek
    Robbert te Poele
    Aad ten Have
    'Ehu Kali
    Ho'opa'a Loni
    Aukanai'i Mahi'ai
    Kalman ben Tal El
    Żytomir Roszkowski
    K'awai
    Maol-Moire Mac'IlleBhuidh
    Tòmas MacIlleChruim
    Aindreas MacIllEathain
    Eanruig MacGilleBhreac
    Peadar MacGilleDhonaghart
    Maolmhuire MacGill-Eain
    Eanruig MacGilleBhreac
    Wim van 't Plasman
    

    Explanation

    I used a define block to create definitions. You can look at each definition to see how it works. In general, I use \p{.} where . is replaced with some pointer to a Unicode character group (i.e \p{L} is any letter from any language - this will not work in most flavours of regex, but it does allow the regex to be much more simplified if available, which is why I used it).

    If you need anything else explained, don't hesitate to ask me and I'll do my best, but regex101 should be able to explain anything you're wondering about regex.