Search code examples
rtext-miningstrsplit

How do I find differing words in two strings, sentence-wise?


I am comparing two similar texts. x1 is the model text and x2 is the text with mistakes (e.g spelling, new characters etc.). I am trying to remove words found in both texts. Since my actual text is not in English I cannot use the dictionary.

What I have tried is to step through each character of x1 and if it is same character in x2 then delete from x2 and move to next character of x1.

Code I've been working on:

x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible." 
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."

library(tidyverse)
x1 <- str_split(x1, "(?<=\\.)\\s")
x1 <- lapply(x1,tolower)
x2 <- str_split(x2, "(?<=\\.)\\s")
x2 <- lapply(x2,tolower)

delete_a_from_b <- function(a,b) {

  a_as_list <- str_remove_all(a,"word") %>% 
    str_split(boundary("character")) %>% unlist

  b_n <- nchar(b)

  b_as_list <- str_remove_all(b,"word") %>% 
    str_split(boundary("character")) %>% unlist

  previous_j <-1

  for (i in 1:length(a_as_list)) {
    if(previous_j > length(b_as_list)) 
      break
    for (j in previous_j:length(b_as_list)){
      if(a_as_list[[i]]==b_as_list[[j]]){
        b_as_list[[j]] <- ""
        previous_j <- j+1
        break
      }
    }
  }

  print(paste0(b_as_list,collapse = ""))
  paste0(b_as_list,collapse = "")
}

x3 <- delete_a_from_b(x1,x2)
x3 <- strsplit(x3,"\\s")

Output:

x3
[[1]]
 [1] "text"       "this"       "i"          "i"          "d?am"       "clueless"   "thius"      "coing.\"," 
 [9] "\"ley’s"    "dythsssoon" "as"         "possibk"   

What I want as result is: 'text' 'this' 'id' 'thius' 'ley’s' 'ythis' 'possiblke'


Solution

  • I take it you want to compare the two strings x1 and x2 by sentence - not really clear in the question. The previous solutions do not take this into account. Try this:

    First split, both strings into sentences:

    x1_sentences <- unlist(strsplit(tolower(x1), split = "[.?!] "))
    x2_sentences <- unlist(strsplit(tolower(x2), split = "[.?!] "))
    length(x1_sentences)  == length(x2_sentences) # Make sure same number of resulting sentences
    

    Then, for each sentence, split the two vectors again and show difference in words:

    for (i in 1:length(x1_sentences)) {
      x1_vector <- unlist(strsplit(x1_sentences[i], split = "[ ]"))
      x2_vector <- unlist(strsplit(x2_sentences[i], split = "[ ]"))
      print(setdiff(x2_vector, x1_vector)) # The order here is important!
    }
    

    Gives (which you can easily turn into a new vector):

    [1] "text"
    [1] "this"
    [1] "id"
    [1] "thius"
    [1] "ley’s"      "ythis"      "possiblke."