regexregex-groupregex-negationpcre2

More name parsing outliers


Names can be so difficult to parse into first, middle, last, suffix

This group of names (saved at regex.com) is giving me a migraine.

The desired parse is actually /^(.)(\b[Vv][ao]n\b\s\w+|\b[Dd][eu]\s\b\w+)(.)/ which groups 'De La', but I want to make sure that 'La name' is also included and grouped properly so I focused on the difference between 'De La name' and 'La name' to make sure the logic works.

Also not sure how to incorporate (De La \w+)|(La \w+) into the rest of the regex.

TIA

** Update (per @lemon's request) **

The name string Emile La Sére should return (Emile) (La Sére) without losing the diacritical on the "e"

Justin De Witt Bowersock should return (Justin) (De Witt) (Bowersock)

Monica De La Cruz should return (Monica) (De La Cruz)

Robert M. La Follette should return (Robert M.) (La Follette) or ideally (Robert) (M.) (La Follette)

Henry St. John should return (Henry) (St. John)

Edward St. Loe Livermore should return (Edward) (St. Loe) (Livermore)

Oscar L. Auf der Heide should return (Oscar) (L.) (Auf der Heide)

I've been able to successfully parse these in various groupings. I don't know if it is possible to parse the whole range in a single pattern.

The main pattern that partially works is (^.*)\b([Vv][ao]n\s\w+|[Dd][ue]\s\w+|[Dd]e\s[Ll]a\s\w+|St\.\s\w+)\s*(.*) however, the crossover between De Witt, [Dd]e [Ll]a Cruz and '[Ll]a Follette' is giving me a headache.

Also I am a novice regex wizard so there's that.

** Update 2 ** This pattern from @The fourth bird is almost perfect. I dressed it up with a couple of additions to catch the previously unmentioned outliers so it's almost airtight. (Assuming there are not other pattern outliers I've missed)

** Update **

Thanks to @The fourth bird this pattern is the one that works.


Solution

  • As you already pointed out, names can be really difficult to parse. See a nice read about Falsehoods Programmers Believe About Names.

    For the provided example data, you might use:

    ^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?La|[Dd][eu]|St\.|Auf\s+der)\s+\p{L}+)(.*)
    
    • ^ Start of string
    • (.*?) Capture group 1, match any character as few as possible
    • \b A word boundary
    • ( Capture group 2
      • (?: Non capture group for the alternatives
        • [Vv][ao]n Match one of V v, a o and then n
        • | Or
        • (?:[Dd][eu]\s+)?La Optionally match D d, e u and 1+ whitespace chars followed by La
        • | Or
        • [Dd][eu] Match one of D d, e u
        • | Or
        • St\. Match St.
        • | Or
        • Auf\s+der Match Auf der with 1+ whitespace chars in between
      • ) Close the non capture group
      • \s+ Match 1+ whitespace chars
      • \p{L}+ Match 1+ times any letter
    • ) Close group 2
    • (.*) Capture group 3, optionally capture any character

    See a regex demo.

    When using JavaScript including the \u flag for Unicode:

    const regex = /^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?La|[Dd][eu]|St\.|Auf\s+der)\s+\p{L}+)(.*)/gmu;
    

    Note that \s can also match a newline.

    When using pcre for example, you might replace \s with \h to match horizontal whitespace chars (no newlines), see this regex demo.