Search code examples
javascriptphpregexregex-groupregex-look-ahead

RegEx to match either words separated by dash or just a single word


So, the requirement for this is to match last names of people, separated by a dash between each last name.

The base RegEx I am using for this is this one:

(?=\S*[-])([a-zA-ZÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù'-]+)

Basically I am limiting it to latin alphabet characters, including some accented characters.

This works perfectly fine if I use examples like:

  • Pérez-González
  • Domínguez-Díaz
  • Güemez-Martínez

But I forgot to contemplate the case when the person has only one last name.

I tried doing the following.

((?=\S*[-])([\ a-zA-ZÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù'-]+))|([A-Za-zÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù']+)

I added a \ or space in the allowed character for the fist match option. I added an or condition for a single word without spaces.

And while it works for some cases there are 2 issues.

  1. I don't think it's the most optimal RegEx for a use case like this.
  2. I stumbled upon the specific case with people who have complex last names.

Regarding point 2, I refer to something like:

  • Johnson-De Sosa

The RegEx matches it, but it no longer respects the dash as a separator.

I am not sure how to handle this.

Also since I added the space it no longer respects the requirement for the dash between words.

What I am thinking is maybe limit the number of spaces between names, something like allow at most 2 or 3 spaces between a last name so that examples like:

  • Pérez-De la Cruz - this works with my RegEx
  • Pérez De la Cruz-González - this doesn't

Can be valid matches.

I am no pro on RegEx so some help would be greatly appreciated.

UPDATE

I did fail to mention I need to be able to use this with JavaScript. PHP could be useful too, but I am doing some browser validation and the patterns need to be compatible.


Solution

  • Logically, you should match one or more letters, then allow a single occurrence of your chosen delimiting characters before allowing another string of one or more letters.

    PHP Code: (Demo)

    $names = [
        'Pérez-González',
        'Domínguez-Díaz',
        'Güemez-Martínez',
        'Johnson-De Sosa',
        'Pérez-De la Cruz',
        'smith',
        'Pérez De la Cruz-González',
        'de Gal-O\'Connell',
        'Johnson--Johnson'
    ];
    
    foreach ($names as $name) {
        echo "$name is " . (!preg_match("~^\pL+(?:[- ']\pL+)*$~u", $name) ? 'in' : '') . "valid\n";
    }
    

    Javascript Code: (snippet is runnable)

    let names = [
          'Pérez-González',
          'Domínguez-Díaz',
          'Güemez-Martínez',
          'Johnson-De Sosa',
          'Pérez-De la Cruz',
          'smith',
          'Pérez De la Cruz-González',
          'de Gal-O\'Connell',
          'Johnson--Johnson'
        ],
        i,
        name;
    
    for (i in names) {
        name = names[i];
        document.write("<div>" + name + " is " + (!name.match(/^\p{L}+(?:[- ']\p{L}+)*$/u) ? 'in' : '') + "valid</div>");
    }

    This will only allow a single delimiter between sequences of letters. This will fail if you someone's name is "Suzy 'Ng" because it has a space then an apostrophe (two consecutive delimiters). I don't know if this is possible/real, I just want to clarify.

    No lookarounds are necessary.