Search code examples
rdataframesplitsapply

removing string where it exists elsewhere in the column r


I have a dataframe with strings of names:

df =data.frame("names" = c("George Orwell; Ayn Rand; Adam Smith", "George Orwell; Rand; Orwell"))
df
                                names
1 George Orwell; Ayn Rand; Adam Smith
2         George Orwell; Rand; Orwell

I'd like to remove all the repeat names, so that each list has only the unique names. Where there is only a last name, I'd like to keep the last name if it's unique but remove it if there's an instance of the full name in the same string. So the result for the df above would be:

                                names
1 George Orwell; Ayn Rand; Adam Smith
2                 George Orwell; Rand

I am able to keep only unique values with:

sapply(strsplit(df$names, ";\\s*"), function(z) paste(unique(z), collapse = "; "))
[1] "George Orwell; Ayn Rand; Adam Smith" "George Orwell; Rand; Orwell"   

but this keeps "Orwell" in df$names[2].


Solution

  • You were close, we can gsub away everything but last name and paste only !duplicated flagged parts. Also works with middle names (but will probably fail with two-part surnames without hyphens).

    strsplit(x, ';\\s+') |> sapply(\(.) paste(.[!duplicated(gsub('.+\\s', '', .))], collapse='; '))
    # [1] "George Orwell; Ayn Rand; Adam Smith" "George Orwell; Rand"                
    # [3] "Michael J Fox; Rand"       
    

    Data:

    x <- c("George Orwell; Ayn Rand; Adam Smith", "George Orwell; Rand; Orwell", 
    "Michael J Fox; Rand; Fox")