in the example below, I want to make 2 groups in a regex:
Name FirtSurname SecondSurname ..
The first group would be Name
The second FirtSurname SecondSurname ...
^(\w+)(.*)$ - would capture all
\w+ - would make n groups (number of words).
I want only 2 groups. First name and anything that follows on another.
Any help?
First, as someone with punctuation in my given name :-) PLEASE don't use \w
to try to match names :-) … both -
and '
are not uncommon.
Using Perl, for example:
if ("Bruce-Robert Fenn Pocock" =~ /^(\w+)(.*)$/) { print "First: $1 Rest: $2" }
→ First: Bruce Rest: -Robert Fenn Pocock
Perhaps just group all non-space characters, then skip the first occurrence of whitespace:
if ("Bruce-Robert Fenn Pocock" =~ /^(\S+)\s*(.*)$/) { print "First: $1 Rest: $2" }
→ First: Bruce-Robert Rest: Fenn Pocock
Of course, if you run across people with middle names in your dataset, there's no way to tell them apart from matronym-patronym pairs or multi-part last names.
I hope/assume you don't have honorifics in your input, either.
First: Don Rest: Juan de la Mancha
*** wrong: Don is honorific
First: Diego Rest: de la Vega
First: John Rest: Jacob Smith
*** wrong: Jacob is probably a middle name
First: De'shawna Rest: Cummings
First: Wehrner Rest: von Braun
First: Oscar Rest: Vazquez-Oliverez
Ultimately, the only way to accurately break down a name into an honorific, given name, middle name(s), surnames (matronym, patronym), and suffix(es), is to ask.
(EG. my own name, in Anglo circles, the "Fenn" is considered a "middle name," in Latino circles, it's interpreted as a matronym.)
Honorifics and suffices can often be guessed-at from a list, but e.g. military titles and doctoral suffices are a long list ("Dr John Doe, Pharm.D", "Maj. Gen. Thomas Ts'o") and not unambiguous (e.g. "Don" is both a short form of "Donald" and an honorific).
http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/