Search code examples
rstringdist

Using stringdist in R


Lets say I have the following words:

word1 = 'john lennon'
word2 = 'john lenon'
word3 = 'lennon john'

Its almost clear that these 3 words are reffering to the same person. Having the following code:

library(stringdist)
>stringdist('john lennon','john lenon',method = 'jw')
[1] 0.06363636
>stringdist('john lennon','lennon john',method = 'qgram')
[1] 0
>stringdist('john lennon','lennon john',method = 'jw')
[1] 0.33
>stringdist('john lennon','john lenon',method = 'qgram')
[1] 1

Its clear that in this example that qgram works better. But thats only that case. My question is how can I combine these two methods?

jw gives better results but cant 'catch' the reversed words ( in my case name-surname with surname-name). Any advice?


Solution

  • I had an idea which computationally seems to be costly, but at least it gives quite nice results.

    word1 = 'john lennon'
    word2 = 'john lenon'
    word3 = 'lennon john'
    

    Firstly remove spaces:

    word1b = gsub(' ','',word1)
    word2b = gsub(' ','',word2)
    word3b = gsub(' ','',word3)
    

    Order them alphabetically:

    word1c = paste(sort(unlist(strsplit(word1b, ""))), collapse = "")
    word2c = paste(sort(unlist(strsplit(word2b, ""))), collapse = "")
    word3c = paste(sort(unlist(strsplit(word3b, ""))), collapse = "")
    

    And finally use jw method:

    stringdist(word1c,word2c,method = 'jw')
    [1] 0.03333333
    stringdist(word1c,word3c,method = 'jw')
    [1] 0
    stringdist(word2c,word3c,method = 'jw')
    [1] 0.03333333
    

    Satisfactory results. Drawback: could have non wanted results in small length words.