I have the following vector of strings and I need to match the second repeated words, not the first ones, without spaces.
names_vec <- c("Linda Smith","Elizabeth Olsen Green Elizabeth Olsen Green",
"Eva Ferguson","Charlize Martinez White Charlize Martinez White",
"digital copy solutions digitals",
"honor security services honor")
The resulting vector should be: Edited desired Answer , my mistake there is a misunderstanding. When matching the second duplicate in "honor security services honor" the last word is the one which has to be matched and removed.
matched_vec <- c("Elizabeth Olsen Green","Charlize Martinez White","honor security services")
I have tried different lookarounds but have been unsuccessful. The following code matches the first occurrences but I need the last ones because I want to remove the last repeated words not the first ones.
The code below grabs what is inside the brackets, I want the other repeated words.
str_view_all(names_vec , "(\\b\\S+\\s*\\b)(?=.*\\b\\1\\b)", match = TRUE)
[Elizabeth Olsen Green] Elizabeth Olsen Green
[Charlize Martinez White] Charlize Martinez White
[honor] security services honor
The regex I used matches The FIRST repeated words of every string that have duplicates from the beginning of the string.
My expectation is to match the SECOND repeated words of the strings, so it has to search from the right to the left and match all of the second repeated words.
People will ask me, why bother. The issue is that I have some company names, and usually the repeated word I want to remove is the second one or the last one if you count from the left to the right.
You can use stringi::stri_reverse
to reverse the characters. Then I use your regex in gsub
and remove the first repetition, what is actually the last repetition, and reverse the result again.
stringi::stri_reverse(
gsub("(\\b\\S+\\s*\\b)(?=.*\\b\\1\\b)", "",
stringi::stri_reverse(names_vec), perl=TRUE) )
#[1] "Linda Smith" "Elizabeth Olsen Green "
#[3] "Eva Ferguson" "Charlize Martinez White "
#[5] "digital copy solutions digitals" "honor security services "
Or just using sub
:
sub("(\\b\\S.*\\b)(.*)\\b\\1\\b", "\\1\\2", names_vec)
#[1] "Linda Smith" "Elizabeth Olsen Green "
#[3] "Eva Ferguson" "Charlize Martinez White "
#[5] "digital copy solutions digitals" "honor security services "