Search code examples
rstringdataframedata.tablestrsplit

Merge columns in data.frame after removal of duplicate strings


I have a data.framedata of character vectors as follows.

x <- c("kal, Kon, Jor, Kara", "Bruce, Helena, Martha, Terry", "connor, oliver, Roy",  
       "Alan, Guy, Simon, Kyle")
y <- c("Mon, Cir, John, Jor", "Damian, Terry, Jason", "Mia, Roy", "John, Cary")
data <- data.frame(x,y, stringsAsFactors=FALSE)

I am trying to concatenate strings in the two columns x and y to a new column z. I want to remove the duplicates and sort the words separated by , before concatenating the strings in a row. I am able to achieve this as follows.

x <- strsplit(data$x, split=", ")
y <- strsplit(data$y, split=", ")
data$z <- sapply(1:length(x), function(i) paste(sort(union(x[[i]], y[[i]])), 
                                                collapse=", "))

Is there a faster way to do this without creating the intermediate lists, maybe using data.table?


Solution

  • To go further with the idea you had, you can do, without creating intermediate lists :

    data$z<-apply(data,1,function(vec){
                            paste(unique(strsplit(paste(vec[1],vec[2],sep=", "),", ")[[1]]),collapse=", ")
                          })
    
    > data
                                 x                    y                                           z
    1          kal, Kon, Jor, Kara  Mon, Cir, John, Jor         kal, Kon, Jor, Kara, Mon, Cir, John
    2 Bruce, Helena, Martha, Terry Damian, Terry, Jason Bruce, Helena, Martha, Terry, Damian, Jason
    3          connor, oliver, Roy             Mia, Roy                    connor, oliver, Roy, Mia
    4       Alan, Guy, Simon, Kyle           John, Cary          Alan, Guy, Simon, Kyle, John, Cary
    

    although slower, base R is not that bad, based on the 3e4-row dataset of @akrun :

    >  microbenchmark(cath(), akrun2(), unit='relative', times=100L)
    Unit: relative
         expr      min       lq     mean   median       uq      max neval cld
       cath() 1.429732 1.425991 1.427143 1.427015 1.435986 1.360235   100   b
     akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100  a