Search code examples
regexregex-groupknime

REGEX Name and any surname


in the example below, I want to make 2 groups in a regex:

Name FirtSurname SecondSurname ..

The first group would be Name

The second FirtSurname SecondSurname ...

^(\w+)(.*)$   - would capture all
\w+           - would make n groups (number of words). 

I want only 2 groups. First name and anything that follows on another.

Any help?


Solution

  • First, as someone with punctuation in my given name :-) PLEASE don't use \w to try to match names :-) … both - and ' are not uncommon.

    Using Perl, for example:

      if ("Bruce-Robert Fenn Pocock" =~ /^(\w+)(.*)$/) { print "First: $1    Rest: $2" }
    
      → First: Bruce    Rest: -Robert Fenn Pocock
    

    Perhaps just group all non-space characters, then skip the first occurrence of whitespace:

      if ("Bruce-Robert Fenn Pocock" =~ /^(\S+)\s*(.*)$/) { print "First: $1    Rest: $2" }
    
      → First: Bruce-Robert    Rest: Fenn Pocock
    

    Of course, if you run across people with middle names in your dataset, there's no way to tell them apart from matronym-patronym pairs or multi-part last names.

    I hope/assume you don't have honorifics in your input, either.

    First: Don         Rest: Juan de la Mancha
         *** wrong: Don is honorific
    First: Diego       Rest: de la Vega
    First: John        Rest: Jacob Smith
         *** wrong: Jacob is probably a middle name
    First: De'shawna   Rest: Cummings
    First: Wehrner     Rest: von Braun
    First: Oscar       Rest: Vazquez-Oliverez
    

    Ultimately, the only way to accurately break down a name into an honorific, given name, middle name(s), surnames (matronym, patronym), and suffix(es), is to ask.

    (EG. my own name, in Anglo circles, the "Fenn" is considered a "middle name," in Latino circles, it's interpreted as a matronym.)

    Honorifics and suffices can often be guessed-at from a list, but e.g. military titles and doctoral suffices are a long list ("Dr John Doe, Pharm.D", "Maj. Gen. Thomas Ts'o") and not unambiguous (e.g. "Don" is both a short form of "Donald" and an honorific).

    PS. Lovely article here:

    http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/