Search code examples
rstringtextiterationcorpus

r compare text in two columns by row


I would like to compare text in column X1 with text in column X2 and Produce a list of words that appear in X1 but not X2, and vice versa. For example:

df <- data.frame("X1" = c("the fox ate grapes", "the cat ate"), "X2" = c("the fox ate watermelon", "the cat ate backwards"))

I'm trying to generate columns such as X3 - grapes watermelon X4 - backwards

The data frame has hundreds of rows, and the text in some cells in up to 50 words or so.


Solution

  • I dont understand how you want to organize the the output in X3 and X4 but maybe this helps:

    words_x1 <- (df$X1 %>% paste(collapse = " ") %>% str_split(" "))[[1]] %>% unique()
    words_x2 <- (df$X2 %>% paste(collapse = " ") %>% str_split(" "))[[1]] %>% unique()
    
    c(words_x1[!(words_x1 %in% words_x2)], words_x2[!(words_x2 %in% words_x1)])
    

    I think what you want to achieve is something like this (note that I am using a tibble as it does not seem to work with data.frame.

    library(dplyr)
    library(purrr)
    
    df <- tibble(
      X1 = c("the fox ate grapes", "the cat ate"),
      X2 = c("the fox ate watermelon", "the cat ate backwards")
    )
    myfunction <- function(x1, x2) {
      w1 <- strsplit(x1, " ")[[1]]
      w2 <- strsplit(x2, " ")[[1]]
      c(w1[!(w1 %in% w2)], w2[!(w2 %in% w1)])
    }
    
    map2(df$X1, df$X2, myfunction)