I am trying to find the common words between 2 columns for each row in a data frame. For example my input is:
C1 | C2
Roy goes to Japan | Roy goes to Australia
I go to Japan | You go to Japan
And I need a column appended as
C1 | C2 | Result
Roy goes to Japan | Roy goes to Australia | Roy goes to
I go to Japan | He goes to Japan | to Japan
I tried intersect
but it gives me intersection between C1 and C2, and not each element of C1 and C2. I guess I'll have to use something from stringr
or stringi
but not sure what. Also, my dataset is huge so something fast
would be nice.
You could split the string on whitespace and then use intersect
to find the common words.
df$result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "),
strsplit(df$C1, '\\s'), strsplit(df$C2, '\\s'))
df
# C1 C2 result
#1 Roy goes to Japan Roy goes to Australia Roy goes to
#2 I go to Japan He goes to Japan to Japan
You could also do this with tidyverse
:
library(tidyverse)
df %>%
mutate(result = map2_chr(str_split(C1, '\\s'), str_split(C2, '\\s'),
~str_c(intersect(.x, .y), collapse = " ")))
data
df <- structure(list(C1 = c("Roy goes to Japan", "I go to Japan"),
C2 = c("Roy goes to Australia", "He goes to Japan")), row.names = c(NA,
-2L), class = "data.frame")