I have a data.framedata
of character vectors as follows.
x <- c("kal, Kon, Jor, Kara", "Bruce, Helena, Martha, Terry", "connor, oliver, Roy",
"Alan, Guy, Simon, Kyle")
y <- c("Mon, Cir, John, Jor", "Damian, Terry, Jason", "Mia, Roy", "John, Cary")
data <- data.frame(x,y, stringsAsFactors=FALSE)
I am trying to concatenate strings in the two columns x
and y
to a new column z
. I want to remove the duplicates and sort the words separated by ,
before concatenating the strings in a row. I am able to achieve this as follows.
x <- strsplit(data$x, split=", ")
y <- strsplit(data$y, split=", ")
data$z <- sapply(1:length(x), function(i) paste(sort(union(x[[i]], y[[i]])),
collapse=", "))
Is there a faster way to do this without creating the intermediate lists, maybe using data.table
?
To go further with the idea you had, you can do, without creating intermediate lists :
data$z<-apply(data,1,function(vec){
paste(unique(strsplit(paste(vec[1],vec[2],sep=", "),", ")[[1]]),collapse=", ")
})
> data
x y z
1 kal, Kon, Jor, Kara Mon, Cir, John, Jor kal, Kon, Jor, Kara, Mon, Cir, John
2 Bruce, Helena, Martha, Terry Damian, Terry, Jason Bruce, Helena, Martha, Terry, Damian, Jason
3 connor, oliver, Roy Mia, Roy connor, oliver, Roy, Mia
4 Alan, Guy, Simon, Kyle John, Cary Alan, Guy, Simon, Kyle, John, Cary
although slower, base R is not that bad, based on the 3e4-row dataset of @akrun :
> microbenchmark(cath(), akrun2(), unit='relative', times=100L)
Unit: relative
expr min lq mean median uq max neval cld
cath() 1.429732 1.425991 1.427143 1.427015 1.435986 1.360235 100 b
akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a