Search code examples
rstring-matchingdata-manipulation

R Exact match strings in two columns


I have a data frame of the following form:

Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)

enter image description here

What I want to do is to look for and count the exact match of words in column 1 and column 2. Each column can have multiple words separated by a comma.

For example in row 1, we see there are two common words a) Starship Enterprise and b) Elephant. However, in row 4, even though the word "more" appears in both columns, the exact string (Some more words and Even more words) do not appear. The expected output would be something like this.

enter image description here

Any help will be appreciated.


Solution

  • Split columns on comma and count the intersection of words

    mapply(function(x, y) length(intersect(x, y)), 
            strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
    #[1] 2 0 0 0
    

    Or a tidyverse way

    library(tidyverse)
    d1 %>%
      mutate(Common = map2_dbl(Column1, Column2, ~ 
          length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))
    
    
    #                           Column1                          Column2 Common
    #1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant      2
    #2                      Random word                            Ocean      0
    #3                             Word                               No      0
    #4 Some more words, Even more words                             more      0