I have a very large dataset, which looks like this.
I have two types of data frames
ref=c("cake","brownies")
and my experimental data.frame
expr=c("cak","cakee","cake", "rownies","browwnies")
I want to match the ref and expr data.frames and find the levenstein distance between them. The output could look like this...
ref expr distance
cake cak 1
cake cakee 1
cake cake 0
cake rownies ...
after I have measured their levenstein distance I want to cluster any string that has distance less than 3 to one cluster and my data to maybe look like
ref expr distance cluster
cake cak 1 1
cake cakee 1 1
cake cake 0 1
brownies rownies 1 2
brownies browwnies 1 2
any help or advice on how to move on is appreciate it. At the moment I am trying a lot of R packages to find the distance between data.frame such as
library("DescTools")
but they do not seem to work well.
Here are 2 ways I'd approach it, one that's strictly supervised and more manual, and another that takes a less supervised route. The package stringdist
has a bunch of different distance metrics, where "lv"
is Levenshtein. I added an additional observation "poundcake" to test with a word that's too far from the reference words.
Get a matrix of the distances between each experimental string and one of the reference strings. This could have issues if you have 2 similar reference strings, or if an experimental word is equally close to 2 references, but it works for this simple case. Then reshape the matrix into a data frame, and count along reference words to get cluster numbers. Filter for cases where the distance is less than your threshold.
library(dplyr)
library(stringdist)
max_dist <- 3
ref <- c("cake", "brownies")
expr <- c("cak", "cakee", "cake", "poundcake", "rownies","browwnies")
mtx <- stringdistmatrix(ref, expr, method = "lv", useNames = "strings")
mtx
#> cak cakee cake poundcake rownies browwnies
#> cake 1 1 0 5 6 8
#> brownies 8 7 7 8 1 1
df1 <- as.data.frame(mtx) %>%
tibble::rownames_to_column("ref") %>%
tidyr::pivot_longer(-ref, names_to = "expr", values_to = "dist") %>%
mutate(clust = as.numeric(forcats::as_factor(ref))) # could also use data.table::rleid
df1 %>%
filter(dist <= max_dist)
#> # A tibble: 5 × 4
#> ref expr dist clust
#> <chr> <chr> <dbl> <dbl>
#> 1 cake cak 1 1
#> 2 cake cakee 1 1
#> 3 cake cake 0 1
#> 4 brownies rownies 1 2
#> 5 brownies browwnies 1 2
This might work for more complex cases. I've used it for correcting the spelling of people's names, where I have an incomplete set of correct labels to work from. Combine all the words into 1 vector, get a distance matrix (this time it will be square), then create clusters from hierarchical clustering using the threshold as the height to cut the tree. You can then match the reference for each word to get labels for the clusters.
The downside here is that you have rows for reference words that weren't experimental—note for example that "brownies" was never spelled correctly in the experimental strings, but now you have that observation.
all_words <- c(ref, expr)
hc <- hclust(stringdistmatrix(all_words, method = "lv", useNames = "strings"))
df2 <- data.frame(word = c(ref, expr),
clust = cutree(hc, h = max_dist)) %>%
mutate(r = ref[clust])
df2 %>%
filter(!is.na(r))
#> word clust r
#> 1 cake 1 cake
#> 2 brownies 2 brownies
#> 3 cak 1 cake
#> 4 cakee 1 cake
#> 5 cake 1 cake
#> 6 rownies 2 brownies
#> 7 browwnies 2 brownies