Search code examples
rstring-comparisonedit-distance

Similarity scores based on string comparison in R (edit distance)


I am trying to assign similarity score based on comparison between 2 strings. Is there a function for the same in R. I am aware of such a function in SAS by the name of SPEDIS. Please let me know if there is such a function in R.


Solution

  • The function adist computes the Levenshtein edit distance between two strings. This can be transformed into a similarity metric as 1 - (Levenshtein edit distance / longer string length).

    The levenshteinSim function in the RecordLinkage package also does this directly, and might be faster than adist.

    library(RecordLinkage)
    > levenshteinSim("apple", "apple")
    [1] 1
    > levenshteinSim("apple", "aaple")
    [1] 0.8
    > levenshteinSim("apple", "appled")
    [1] 0.8333333
    > levenshteinSim("appl", "apple")
    [1] 0.8
    

    ETA: Interestingly, while levenshteinDist in the RecordLinkage package appears to be slightly faster than adist, levenshteinSim is considerably slower than either. Using the rbenchmark package:

    > benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
                                             test replications elapsed relative
    1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1
      user.self sys.self user.child sys.child
    1     3.583    0.452          0         0
    > benchmark(adist("applesauce", "aaplesauce"), replications=100000)
                                   test replications elapsed relative user.self
    1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707
      sys.self user.child sys.child
    1    0.461          0         0
    > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
                                            test replications elapsed relative
    1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1
      user.self sys.self user.child sys.child
    1      6.49    0.743          0         0
    

    This overhead is due simply to the code for levenshteinSim, which is just a wrapper around levenshteinDist:

    > levenshteinSim
    function (str1, str2) 
    {
        return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), 
            nchar(str2))))
    }
    

    FYI: if you are always comparing two strings rather than vectors, you can create a new version that uses max instead of pmax and shave ~25% off the running time:

    mylevsim = function (str1, str2) 
    {
        return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), 
            nchar(str2))))
    }
    > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
                                      test replications elapsed relative user.self
    1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987
      sys.self user.child sys.child
    1    0.627          0         0
    

    Long story short- there is little difference between adist and levenshteinDist in terms of performance, though the former is preferable if you don't want to add package dependencies. How you turn it into a similarity measure does have a bit of an effect on performance.