I have a dataframe with strings of names:
df =data.frame("names" = c("George Orwell; Ayn Rand; Adam Smith", "George Orwell; Rand; Orwell"))
df
names
1 George Orwell; Ayn Rand; Adam Smith
2 George Orwell; Rand; Orwell
I'd like to remove all the repeat names, so that each list has only the unique names. Where there is only a last name, I'd like to keep the last name if it's unique but remove it if there's an instance of the full name in the same string. So the result for the df above would be:
names
1 George Orwell; Ayn Rand; Adam Smith
2 George Orwell; Rand
I am able to keep only unique values with:
sapply(strsplit(df$names, ";\\s*"), function(z) paste(unique(z), collapse = "; "))
[1] "George Orwell; Ayn Rand; Adam Smith" "George Orwell; Rand; Orwell"
but this keeps "Orwell" in df$names[2].
You were close, we can gsub
away everything but last name and paste
only !duplicated
flagged parts. Also works with middle names (but will probably fail with two-part surnames without hyphens).
strsplit(x, ';\\s+') |> sapply(\(.) paste(.[!duplicated(gsub('.+\\s', '', .))], collapse='; '))
# [1] "George Orwell; Ayn Rand; Adam Smith" "George Orwell; Rand"
# [3] "Michael J Fox; Rand"
Data:
x <- c("George Orwell; Ayn Rand; Adam Smith", "George Orwell; Rand; Orwell",
"Michael J Fox; Rand; Fox")