I have a column of people's last names and their initials, in some instances a person may be listed twice, once with a single initial (e.g. SMITH C) and once with two initials (e.g. SMITH CD), I want to replace all instances of SMITH CD with SMITH C.
An example
df <- data.frame(
id = 1:7,
names = c("JOHN S", "JOHN SW", "JOHNSON S",
"SMITH C", "SMITH WC", "SMITH CD",
"HANK K"))
> df
id names
1 1 JOHN S
2 2 JOHN SW
3 3 JOHNSON S
4 4 SMITH C
5 5 SMITH WC
6 6 SMITH CD
7 7 HANK K
I can do this manually for a single name e.g.:
df$names <- gsub("SMITH CD", "SMITH C", df$names)
> df
id names
1 1 JOHN S
2 2 JOHN SW
3 3 JOHNSON S
4 4 SMITH C
5 5 SMITH WC
6 6 SMITH C
7 7 HANK K
But my real dataset contains ~4000 names, therefore I want to be able to:
For the small example df above the result would be:
> df
id names
1 1 JOHN S
2 2 JOHN S
3 3 JOHNSON S
4 4 SMITH C
5 5 SMITH WC
6 6 SMITH C
7 7 HANK K
Any help would be much appreciated.
Create a grouping column by removing the letter at the end when there are two letter words, then modify the names
by selecting the element having the minimum number of characters
library(dplyr)
library(stringr)
df %>%
group_by(grp = str_remove(names, "(?<= [A-Z])[A-Z]$")) %>%
mutate(names = names[which.min(nchar(names))]) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 7 x 2
# id names
# <int> <chr>
#1 1 JOHN S
#2 2 JOHN S
#3 3 JOHNSON S
#4 4 SMITH C
#5 5 SMITH WC
#6 6 SMITH C
#7 7 HANK K