This pattern
/^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?[Ll]a|[Dd][eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)(.*)/gum
parses name tokens.
I had help deriving this pattern (ECMAScript Flavor) and have made small adjustments, but I'm stuck on the third name token in the test string.
Van H. Manning
properly parses to Van H.
Manning
(just use trim()
to remove extra space)
Lionel Van Deerlin
properly parses to Lionel
Van Deerlin
But Van Taylor
does not parse to Van
Taylor
Can this pattern be adjusted to properly parse Van Taylor
along with the other instances of Van
?
I'm still working out how this pattern works and how to understand this particular regex wizardry.
TIA
** Update **
Fools errand though it may be, I am doing the best possible version of a parse.
Per the comments, Van H. Manning
is distinct because Van
is a first name whereas Van Deerlin
is a surname.
Similarly to Van H. Manning
, Van Taylor
consists of Van
as a first name and Taylor
as a surname.
I can see that part of the logic is that Van
ocurring at the beginning of the string distinguishes between surname and last name, however, the pattern is properly grouping Van \w+
already so it seems like a small adjustment is needed.
As far as Van H. Manning
being parsed as Van H.
Manning
, I am using a conditional to handle that. It's beyond me on how to regex that one with everything else and I've already asked for a lot of heavy lifting here.
I think it will get rather complicated to handle all cases because as everybody pointed out, you'll probably get the first name in front or behind the surname (last name or family name). In some countries I even think that your last name can come from your parent's first name, so imagine how complicated it can get to try and detect the order.
But, if you want to stick to a regular expression, you could just use
your assumption that if Van
is at the beginning of the string then
it's the first name. In this case, just add two alternatives to your
regular expression and capture the parts in several groups. I've
named them for easier access, compared to indexed groups. You'll then
have to put some logic to see which group is filled or empty.
I also used the i
flag for case-insensitive instead of handling
it with [Dd]
.
I personally think that having several regular expressions or trying to find a library to handle that for you might be a better idea, especially if you also know the origin of the person, which could help to use specific rules by region of the planet.
The PCRE regex :
/^
(?: # Where "Van" would be the first name:
(?<firstname_van>Van)\s(?<lastname_van>.*)
|
# Other cases: the first name is probably first, but not sure.
(?<firstname>.*?)\s*
(?<lastname>
\b
(?:
(?<!^)V[ao]n
|(?:D[eu]\s+)?La
|D[eu]
|St\.
|Le
|Auf\s+der
)
\s+\p{L}+\.?
)
\h*
(?:
(?<senority>(?:[JS]r\.?|[IVX]+))
|
(?<more>.*)
)
)
$/gumix
The JavaScript version to enhance :
const regexp = /^(?:(?<firstname_van>Van)\s(?<lastname_van>.*)|(?<firstname>.*?)\s*(?<lastname>\b(?:(?<!^)V[ao]n|(?:D[eu]\s+)?La|D[eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)[ \t]*(?:(?<senority>(?:[JS]r\.?|[IVX]+))|(?<more>.*)))$/gumi;
const input = `Van H. Manning
Lionel Van Deerlin
Van Taylor
Emile La Sére
George A. La Dow
Gilbert De La Matyr
Robert M. La Follette
William Leroy La Follette
Robert M. La Follette Sr.
Robert M. La Follette Jr.
Charles M. La Follette
Monica De La Cruz
David A. De Armond
Justin De Witt Bowersock
De Witt C. Giddings
Julien de Lallande Poydras
Henry St. John
Edward St. Loe Livermore
Oscar L. Auf der Heide
Kika de la Garza
Francis Celeste Le Blond
Robert Le Roy Livingston`;
let i = 1;
while ((match = regexp.exec(input)) !== null) {
console.log(`Match ${i++}`, match.groups);
}