javascriptregexregex-groupregex-negation

Refining this name parse pattern


This pattern

/^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?[Ll]a|[Dd][eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)(.*)/gum

parses name tokens.

I had help deriving this pattern (ECMAScript Flavor) and have made small adjustments, but I'm stuck on the third name token in the test string.

Van H. Manning properly parses to Van H. Manning (just use trim() to remove extra space)

Lionel Van Deerlin properly parses to Lionel Van Deerlin

But Van Taylor does not parse to Van Taylor

Can this pattern be adjusted to properly parse Van Taylor along with the other instances of Van?

I'm still working out how this pattern works and how to understand this particular regex wizardry.

TIA

** Update **

Fools errand though it may be, I am doing the best possible version of a parse.

Per the comments, Van H. Manning is distinct because Van is a first name whereas Van Deerlin is a surname.

Similarly to Van H. Manning, Van Taylor consists of Van as a first name and Taylor as a surname.

I can see that part of the logic is that Van ocurring at the beginning of the string distinguishes between surname and last name, however, the pattern is properly grouping Van \w+ already so it seems like a small adjustment is needed.

As far as Van H. Manning being parsed as Van H. Manning, I am using a conditional to handle that. It's beyond me on how to regex that one with everything else and I've already asked for a lot of heavy lifting here.


Solution

  • I think it will get rather complicated to handle all cases because as everybody pointed out, you'll probably get the first name in front or behind the surname (last name or family name). In some countries I even think that your last name can come from your parent's first name, so imagine how complicated it can get to try and detect the order.

    But, if you want to stick to a regular expression, you could just use your assumption that if Van is at the beginning of the string then it's the first name. In this case, just add two alternatives to your regular expression and capture the parts in several groups. I've named them for easier access, compared to indexed groups. You'll then have to put some logic to see which group is filled or empty.

    I also used the i flag for case-insensitive instead of handling it with [Dd].

    I personally think that having several regular expressions or trying to find a library to handle that for you might be a better idea, especially if you also know the origin of the person, which could help to use specific rules by region of the planet.

    The PCRE regex :

    /^
    (?: # Where "Van" would be the first name:
      (?<firstname_van>Van)\s(?<lastname_van>.*)
    |
      # Other cases: the first name is probably first, but not sure.
      (?<firstname>.*?)\s*
      (?<lastname>
        \b
        (?:
          (?<!^)V[ao]n
          |(?:D[eu]\s+)?La
          |D[eu]
          |St\.
          |Le
          |Auf\s+der
        )
        \s+\p{L}+\.?
      )
      \h*
      (?:
        (?<senority>(?:[JS]r\.?|[IVX]+))
        |
        (?<more>.*)
      )
    )
    $/gumix
    

    The JavaScript version to enhance :

    const regexp = /^(?:(?<firstname_van>Van)\s(?<lastname_van>.*)|(?<firstname>.*?)\s*(?<lastname>\b(?:(?<!^)V[ao]n|(?:D[eu]\s+)?La|D[eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)[ \t]*(?:(?<senority>(?:[JS]r\.?|[IVX]+))|(?<more>.*)))$/gumi;
    
    const input = `Van H. Manning 
    Lionel Van Deerlin
    Van Taylor
    Emile La Sére
    George A. La Dow
    Gilbert De La Matyr
    Robert M. La Follette
    William Leroy La Follette
    Robert M. La Follette Sr.
    Robert M. La Follette Jr.
    Charles M. La Follette
    Monica De La Cruz
    David A. De Armond
    Justin De Witt Bowersock
    De Witt C. Giddings
    Julien de Lallande Poydras
    Henry St. John
    Edward St. Loe Livermore
    Oscar L. Auf der Heide
    Kika de la Garza
    Francis Celeste Le Blond
    Robert Le Roy Livingston`;
    
    let i = 1;
    while ((match = regexp.exec(input)) !== null) {
      console.log(`Match ${i++}`, match.groups);
    }