I have some strings in the following format:
"John Smith The Last"
"Jane Smith The Best"
From each of these strings, I want to extract the "names" (that is, "John Smith" and "Jane Smith") as well as the "honorifics" (that is, "The Last", "The Best"), but could not find a way of achieving this.
I have tried to use the str_extract_all()
and str_split()
functions from the stringr
package, as follows:
library(stringr)
name <- str_extract_all(my_str, boundary("word"))
This just returns a list of one element ("Jane Smith The Best")
.
I have also tried:
name <- str_split(my_str, " ", n=3)
This also seems to return a list of one element ("Jane" "Smith" "The Best")
.
I am looking for a base R or stringr
solution.
Edit: adding ignore.case = TRUE
and space in gsub
pattern
One way to extract the "names" and the "honorifics" is by using gsub()
and using the first word that identifies the honorific names, such as the
, The
, der
, or any other identifier as the pattern in gsub()
. Here is an example, which I edited based on good suggestions from @Mark and @L Tyrone:
strings <- c(
"John Smith The Last",
"Jane Smith The Best",
"Alfonso the Warrior",
"Ferdinand the Artist-King",
"Ludwig der Bayer",
"Theodore Thea the Great",
"John the Theologian"
)
honors <- gsub(".*(the |der )", "\\1", strings, ignore.case = TRUE, perl = TRUE)
names <- gsub("(?= the | der ).*", "\\1", ignore.case = TRUE, strings, perl = TRUE)
data.frame(names, honors)
# names honors
# 1 John Smith The Last
# 2 Jane Smith The Best
# 3 Alfonso the Warrior
# 4 Ferdinand the Artist-King
# 5 Ludwig der Bayer
# 6 Theodore Thea the Great
# 7 John the Theologian
.*(the|der)
pattern looks for texts that begin with the
or The
or der
.ignore.case = TRUE
ignore the variations in lowercase and uppercase in the pattern, so the
, The
, THE
, etc are all detected as valid patterns."\\1"
means to substitute the texts that match the above pattern with itself. This means to print the texts.?= the
means to look for texts that are followed by the
.This answer is inspired by https://stackoverflow.com/a/43012902/14812170 and the suggestions in the comments.