If I have a set of sentences and I would like to extract the duplicates, I should work like in the following example:

sentences<-c("So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my",
         "So there I was at the mercy of three monstrous trolls",
         "Today is my One Hundred and Eleventh birthday",
         "I'm sorry I brought this upon you, my")


which returns:

[1] "So there I was at the mercy of three monstrous trolls"
[2] "Today is my One Hundred and Eleventh birthday"        
[3] "I'm sorry I brought this upon you, my"

But in my case I have sentences that are similar to each other (due to typos, for example) and I would like to select the ones that are more similar to each other. For example:

sentences<-c("So there I was at the mercy of three monstrous trolls",
             "Today is my One Hundred and Eleventh birthday",
             "I'm sorry I brrrought this upon, my",
             "So there I was at mercy of three monstrous troll",
             "Today is One Hundred Eleventh birthday",
             "I'm sorry I brought this upon you, my")

According to this example, I would like to select one between each of the following pairs:

I'm sorry I brought this upon you, my
I'm sorry I brrrought this upon, my

Today is One Hundred Eleventh birthday
Today is my One Hundred and Eleventh birthday

So there I was at the mercy of three monstrous trolls
So there I was at mercy of three monstrous troll

The levenshteinSim function in the RecordLinkage package could help me:




and so on, return values near 1 for the most similar sentences. I could write a double for loop and select, e.g., those pairs of sentences that have a Levenshtein edit distance greater than 0.7 (e.g.). But, isn't there a more simple way of doing this?


  • You could calculate an approximate string distance matrix using adist, which is based on a generalized Levenstein distance, and do hierarchical clustering afterwards using hclust.

    ld  <- adist(tolower(sentences))
    hc <- hclust(as.dist(ld))
    data.frame(x=sentences, cl=cutree(hc, h=10))
    #                                                       x cl
    # 1 So there I was at the mercy of three monstrous trolls  1
    # 2         Today is my One Hundred and Eleventh birthday  2
    # 3                   I'm sorry I brrrought this upon, my  3
    # 4      So there I was at mercy of three monstrous troll  1
    # 5                Today is One Hundred Eleventh birthday  2
    # 6                 I'm sorry I brought this upon you, my  3

    To find an appropriate value for h=eight in cutree we may plot the dendrogram.

    abline(h=10, col=2, lty=2)

