Search code examples
rlevenshtein-distance

Levenshtein edit distance with separator for multi-character units


I have searched through R functions adist, agrep, match, and stringdist, but have not found a method to calculate edit distance with a separator.

Existing edit distance:

“that” & ”fat” = 2  i.e., adist("that","fat")

Desired function would use a separator to denote multi-character units:

“th.a.t” & ”f.a.t” = 1

Solution

  • Levenshtein distance is easy to implement, so just calculate it. Here's a quick no guarantee version of the Wagner-Fischer algorithm (see Wikipedia)

    vecLeven <- function(s, t) {
      d <- matrix(0, nrow = length(s) + 1, ncol=length(t) + 1)
      d[, 1] <- (1:nrow(d)) - 1
      d[1,] <- (1:ncol(d))-1
      for (i in 1:length(s))  {
        for (j in 1:length(t)) {
          d[i+1, j+1] <- min(
            d[i, j+1] + 1, # deletion
            d[i+1, j] + 1, # insertion
            d[i, j] + if (s[i] == t[j]) 0 else 1 # substitution
          )
        }
      }
    
        d[nrow(d), ncol(d)]
    }
    
    sepLeven <- function(s, t, sep=".") {
      mapply(vecLeven, 
             strsplit(s, sep, fixed=TRUE), 
             strsplit(t, sep, fixed=TRUE))
    }
    
    sepLeven(c("th.a.t", "t.e.s.t"), c("f.a.t", "f.e.t"), sep=".")
    # output: [1] 1 2