Search code examples
rtextsimilarity

R Find similar sentences in texts


I have a problem where I´m struggling to find a solution or an approach to solve it.

I have some model sentences, e.g.

model_sentences = data.frame("model_id" = c("model_id_1", "model_id_2"), "model_text" = c("Company x had 3000 employees in 2016.",
                                                                                          "Google makes 300 dollar in revenue in 2018."))

and some texts

data = data.frame("id" = c("id1", "id2"), "text" = c("Company y is expected to employ 2000 employees in 2020. This is an increase of 10%. Some stupid sentences.",
                                                     "Amazon´s revenue is 400 dollar in 2020. That is twice as much as last year."))

and I would like to extract sentences from those texts which are similar to the model sentences.

Something like this would be my desired solution

result = data.frame("id" = c("id1", "id2"), "model_id" = c("model_id_1", "model_id_2"), "sentence_from_data" = c("Company y is expected to employ 2000 employees in 2020.", "Amazon´s revenue is 400 dollar in 2020."), "score" = c(0.5, 0.4))

Maybe it is possible to find kind of a 'similarity_score'.

I use this function to split texts by sentence:

split_by_sentence <- function (text) {

  result <-unlist(strsplit(text, "(?<=[[:alnum:]]{4}[?!.])\\s+", perl=TRUE))

  result <- stri_trim_both(result)
  result <- result [nchar (result) > 0]

  if (length (result) == 0)
    result <- ""

  return (result)
}

But I have no idea how to compare each sentence to a model sentence. I'm glad for any suggestions.


Solution

  • Check out this package stringdist

    Example:

    library(stringdist)
    mysent = "This is a sentence"
    apply(model_sentences, 1, function(row) {
      stringdist(row['model_text'], mysent, method="jaccard")
    })
    

    It will return jaccard distance from mysent to model_text variable. The smaller the value is, the sentences are more similar in terms of given distance measure.