Search code examples
rdata.tabledplyrstringdist

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio


I have a data.table dt with 3 columns:

  • id
  • name as string
  • threshold as num

A sample is:

dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6))

nid | rname  | maxr
n1  | apple  |  0.5
n2  | pear   |  0.8
n3  | banana |  0.7
n4  | kiwi   |  0.6

I have a second table dt.ref with 2 columns:

  • id
  • name as string

A sample is:

dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo", "kiwis"))

cid | cname
c1  | apple
c2  | maple
c3  | peer
c4  | dear
c5  | bonobo
c6  | kiwis

For each rname of dt, I would like to compute the Levenshtein ratio with each cname of dt.ref as such:

Lr = 1 - (stringdist(cname, rname, method = "lv") / pmax(nchar(cname),nchar(rname)))

Then, I would like to find max(Lr) over the cname for each rname of dt and get as an output the following data.table:

nid | rname  | maxr | maxLr | cid
n1  | apple  |  0.5 | 1     | c1
n2  | pear   |  0.8 | 0.75  | c3
n2  | pear   |  0.8 | 0.75  | c4
n3  | banana |  0.7 | 0.33  | c5
n4  | kiwi   |  0.6 | 0.8   | c6

Basically, we take dt and add 2 columns, the maximum Levenshtein ratio and the corresponding cid, knowing that ties are all added, 1 per row as for n2.

I use data.table but the solution can use dplyr or any other package.


Solution

  • You can try something like this:

    f1 <- function(x, y) {
      require(stringdist)
      require(matrixStats)
      dis  <- stringdistmatrix(x, y, method = "lv")
      mat <- sapply(nchar(y), function(i) pmax(i, nchar(x)))
      r <- 1 - dis / mat
      w <- apply(r, 1, function(x) which(x==max(x)))
      m <- rowMaxs(r)
      list(m = m, w = w)
    }
    
    r <- f1(dt[[2]], dt.ref[[2]])
    r
    $m
    [1] 1.0000000 0.7500000 0.3333333 0.8000000
    
    $w
    $w[[1]]
    [1] 1
    
    $w[[2]]
    [1] 3 4
    
    $w[[3]]
    [1] 5
    
    $w[[4]]
    [1] 6
    
    
    dt[, maxLr := r$m ]
    #dtnew <- dt[rep(1:.N, sapply(r$w, length)),]
    dtnew <- dt[rep(1:.N, lengths(r$w),] # thanks to Frank
    dtnew[, cid := dt.ref[unlist(r$w), 1]]
    

    Results:

    dtnew
       nid  rname maxr     maxLr cid
    1:  n1  apple  0.5 1.0000000  c1
    2:  n2   pear  0.8 0.7500000  c3
    3:  n2   pear  0.8 0.7500000  c4
    4:  n3 banana  0.7 0.3333333  c5
    5:  n4   kiwi  0.6 0.8000000  c6