Search code examples
rregexstringtextreplace

Replace string if it matches part of another string


I have a column of people's last names and their initials, in some instances a person may be listed twice, once with a single initial (e.g. SMITH C) and once with two initials (e.g. SMITH CD), I want to replace all instances of SMITH CD with SMITH C.

An example

df <- data.frame(
  id = 1:7,
  names = c("JOHN S", "JOHN SW", "JOHNSON S",
            "SMITH C", "SMITH WC", "SMITH CD",
            "HANK K"))
> df
  id     names
1  1    JOHN S
2  2   JOHN SW
3  3 JOHNSON S
4  4   SMITH C
5  5  SMITH WC
6  6  SMITH CD
7  7    HANK K

I can do this manually for a single name e.g.:

df$names <- gsub("SMITH CD", "SMITH C", df$names)

> df
  id     names
1  1    JOHN S
2  2   JOHN SW
3  3 JOHNSON S
4  4   SMITH C
5  5  SMITH WC
6  6   SMITH C
7  7    HANK K

But my real dataset contains ~4000 names, therefore I want to be able to:

  1. programmatically identify cases where last names and the first initial match, and
  2. make all the replacements in one go

For the small example df above the result would be:

> df
  id     names
1  1    JOHN S
2  2    JOHN S
3  3 JOHNSON S
4  4   SMITH C
5  5  SMITH WC
6  6   SMITH C
7  7    HANK K

Any help would be much appreciated.


Solution

  • Create a grouping column by removing the letter at the end when there are two letter words, then modify the names by selecting the element having the minimum number of characters

    library(dplyr)
    library(stringr)
    df %>% 
        group_by(grp = str_remove(names, "(?<= [A-Z])[A-Z]$")) %>% 
        mutate(names = names[which.min(nchar(names))]) %>%
        ungroup %>%
        select(-grp)
    

    -output

    # A tibble: 7 x 2
    #     id names    
    #  <int> <chr>    
    #1     1 JOHN S   
    #2     2 JOHN S   
    #3     3 JOHNSON S
    #4     4 SMITH C  
    #5     5 SMITH WC 
    #6     6 SMITH C  
    #7     7 HANK K