I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein distance.
I have extremely slow working code in a shell script, and I was able to load in stringdist and produce a matrix with metrics. Now I want to boil down that matrix into a smaller matrix that only has the near matches, e.g. where the metric is non-zero but less that some threshold.
kp <- c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
> kpm
leaflet leafletr lego levenshtein-distance
leafletr 1
lego 5 6
levenshtein-distance 16 16 18
logo 6 7 1 19
m = as.matrix(kpm)
close = apply(m, 1, function(x) x>0 & x<5)
> close
leaflet leafletr lego levenshtein-distance logo
leaflet FALSE TRUE FALSE FALSE FALSE
leafletr TRUE FALSE FALSE FALSE FALSE
lego FALSE FALSE FALSE FALSE TRUE
levenshtein-distance FALSE FALSE FALSE FALSE FALSE
logo FALSE FALSE TRUE FALSE FALSE
OK, now I have a (big) dist, how do I reduce it back to a list where the output would be something like
leafletr,leaflet,1
logo,lego,1
for cases only where the metric is non-zero and less than n=5? I found "apply()" which lets me do the test, now I need to sort out how to use it.
The problem is not specific to stringdist and stringdistmatrix and is very elementary R, but still I'm stuck. I suspect the answer involves subset(), but I don't know how to transform a "dist" into something else.
You can do this:
library(reshape2)
d <- unique(melt(m))
out <- subset(d, value > 0 & value < 5)
Here, melt
brings m
into long form (2 columns with string names and one column with the value). However, since we've melted a symmetric matrix, we use unique
for de-duplication.
Another way is to use dplyr
(since all the cool kids are using dplyr
with pipes now):
library(dlpyr)
library(reshape2)
library(magrittr)
out <- melt(m) %>% distinct() %>% filter(value > 0 & value < 5)
This second option is probably faster but I have not really timed it.