Search code examples
rstringmatrixstringdist

R: producing a list of near matches with stringdist and stringdistmatrix


I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein distance.

I have extremely slow working code in a shell script, and I was able to load in stringdist and produce a matrix with metrics. Now I want to boil down that matrix into a smaller matrix that only has the near matches, e.g. where the metric is non-zero but less that some threshold.

kp <-  c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
> kpm
                     leaflet leafletr lego levenshtein-distance
leafletr                   1                                   
lego                       5        6                          
levenshtein-distance      16       16   18                     
logo                       6        7    1                   19
m = as.matrix(kpm)
close = apply(m, 1, function(x) x>0 & x<5)
>  close
                     leaflet leafletr  lego levenshtein-distance  logo
 leaflet                FALSE     TRUE FALSE                FALSE FALSE
 leafletr                TRUE    FALSE FALSE                FALSE FALSE
 lego                   FALSE    FALSE FALSE                FALSE  TRUE
 levenshtein-distance   FALSE    FALSE FALSE                FALSE FALSE
 logo                   FALSE    FALSE  TRUE                FALSE FALSE

OK, now I have a (big) dist, how do I reduce it back to a list where the output would be something like

leafletr,leaflet,1
logo,lego,1

for cases only where the metric is non-zero and less than n=5? I found "apply()" which lets me do the test, now I need to sort out how to use it.

The problem is not specific to stringdist and stringdistmatrix and is very elementary R, but still I'm stuck. I suspect the answer involves subset(), but I don't know how to transform a "dist" into something else.


Solution

  • You can do this:

    library(reshape2)
    d <- unique(melt(m))
    out <- subset(d, value > 0 & value < 5)
    

    Here, melt brings m into long form (2 columns with string names and one column with the value). However, since we've melted a symmetric matrix, we use unique for de-duplication.

    Another way is to use dplyr (since all the cool kids are using dplyr with pipes now):

    library(dlpyr)
    library(reshape2)
    library(magrittr)
    
    out <- melt(m) %>% distinct() %>% filter(value > 0 & value < 5)
    

    This second option is probably faster but I have not really timed it.