Search code examples
rduplicatesunique

Removing duplicate entries while retaining unique information


I have a dataset with multiple rows, which is a database of contact details. Some entries are repeated several times, but contain different information. Example:

> example
     Title             Name              Email         Phone      
[1,] "Chair"           "Oswald Gruber"   "[email protected]" "+33 12345"
[2,] "Respondent"      "Oswald Gruber"   NA            "+44 54321"
[3,] "Comm.mngr"       "Kaspar Villiger" "[email protected]" "+1 123456"
[4,] "Investment mngr" "Markus Urben"    "[email protected]" NA         
[5,] "Responsible"     "Markus Urben"    "[email protected]" "+1 33333"

So what I'd like is to have each person mentioned only once, but without losing the info. So, if I merge two first rows in this example, I'd like to retain the email address and both phone numbers, but only one kind of title.

I would appreciate if anybody would suggest any solutions.


Solution

  • And a data.table solution:

    library(data.table)
    mrg <- function(x) paste(unique(x[!is.na(x)]),collapse=", ")
    setDT(example)[,list(Title=head(Title,1), Email=mrg(Email), Phone=mrg(Phone)), by="Name"]
    
    #               Name           Title       Email                Phone
    # 1:   Oswald Gruber           Chair [email protected] +33 12345, +44 54321
    # 2: Kaspar Villiger       Comm.mngr [email protected]            +1 123456
    # 3:    Markus Urben Investment mngr [email protected]             +1 33333
    

    This returns the first Title in the natural order, which may or may not be what you want. It also consolidates emails properly and removes NAs.