Search code examples
rstringintersection

Find the intersection between strings in 2 columns


I am trying to find the common words between 2 columns for each row in a data frame. For example my input is:

C1                | C2
Roy goes to Japan | Roy goes to Australia 
I go to Japan     | You go to Japan

And I need a column appended as

C1                | C2                    | Result
Roy goes to Japan | Roy goes to Australia | Roy goes to
I go to Japan     | He goes to Japan      | to Japan

I tried intersect but it gives me intersection between C1 and C2, and not each element of C1 and C2. I guess I'll have to use something from stringr or stringi but not sure what. Also, my dataset is huge so something fast would be nice.


Solution

  • You could split the string on whitespace and then use intersect to find the common words.

    df$result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "),
                        strsplit(df$C1, '\\s'), strsplit(df$C2, '\\s'))
    df
    #                 C1                    C2      result
    #1 Roy goes to Japan Roy goes to Australia Roy goes to
    #2     I go to Japan      He goes to Japan    to Japan
    

    You could also do this with tidyverse :

    library(tidyverse)
    df %>%
      mutate(result = map2_chr(str_split(C1, '\\s'), str_split(C2, '\\s'), 
                               ~str_c(intersect(.x, .y), collapse = " ")))
    

    data

    df <- structure(list(C1 = c("Roy goes to Japan", "I go to Japan"), 
        C2 = c("Roy goes to Australia", "He goes to Japan")), row.names = c(NA, 
    -2L), class = "data.frame")