Search code examples
rstringr

Extracting "words" from string


I have some strings in the following format:

"John Smith The Last"
"Jane Smith The Best"

From each of these strings, I want to extract the "names" (that is, "John Smith" and "Jane Smith") as well as the "honorifics" (that is, "The Last", "The Best"), but could not find a way of achieving this.

I have tried to use the str_extract_all() and str_split() functions from the stringr package, as follows:

library(stringr)
name <- str_extract_all(my_str, boundary("word"))

This just returns a list of one element ("Jane Smith The Best").

I have also tried:

name <- str_split(my_str, " ", n=3)

This also seems to return a list of one element ("Jane" "Smith" "The Best").

I am looking for a base R or stringr solution.


Solution

  • Edit: adding ignore.case = TRUE and space in gsub pattern

    One way to extract the "names" and the "honorifics" is by using gsub() and using the first word that identifies the honorific names, such as the, The, der, or any other identifier as the pattern in gsub(). Here is an example, which I edited based on good suggestions from @Mark and @L Tyrone:

    strings <- c(
      "John Smith The Last",
      "Jane Smith The Best",
      "Alfonso the Warrior",
      "Ferdinand the Artist-King",
      "Ludwig der Bayer",
      "Theodore Thea the Great", 
      "John the Theologian"
    )
    
    
    
    
    honors <- gsub(".*(the |der )", "\\1", strings, ignore.case = TRUE, perl = TRUE)
    names <- gsub("(?= the | der ).*", "\\1", ignore.case = TRUE, strings, perl = TRUE)
    data.frame(names, honors)
    # names          honors
    # 1    John Smith        The Last
    # 2    Jane Smith        The Best
    # 3       Alfonso     the Warrior
    # 4     Ferdinand the Artist-King
    # 5        Ludwig       der Bayer
    # 6 Theodore Thea       the Great
    # 7          John  the Theologian
    
    • .*(the|der) pattern looks for texts that begin with the or The or der.
    • ignore.case = TRUE ignore the variations in lowercase and uppercase in the pattern, so the, The, THE, etc are all detected as valid patterns.
    • "\\1" means to substitute the texts that match the above pattern with itself. This means to print the texts.
    • ?= the means to look for texts that are followed by the.

    This answer is inspired by https://stackoverflow.com/a/43012902/14812170 and the suggestions in the comments.