I am trying to make a way of presenting human-input words in a way that makes their groupings more easily recognisable as referring to the same thing. Essentially a spellchecker. I have gotten as far as making a large matrix (the actual one is 250 * 250 ish). The code for this matrix is identical to the reproducible example given below. (I have populated this with a random word generator, the actual values make much more sense but are confidential)
strings <- c("domineering","curl","axiomatic","root","gratis","secretary","lopsided","cumbersome","oval","mighty","thaw","troubled","furniture","round","soak","callous","melted","wealthy","sweltering","verdant","fence","eyes","ugliest","card","quickest","harm","brake","alarm","report","glue","eyes","hollow","quince","pack","twig","knot")
matrix <- stringdistmatrix(strings, strings, useNames = TRUE)
Now I want to create a new table with two variables, the first column must contain pairs of elements of 'strings' that satisfy the condition that their string-distance was lower than some number lets say for this example (stringdist<7, nonzero), the second column must contain the stringdist. Also the table should not show the reflection of the results present in the matrix e.g. (oval, curl: 3), (curl, oval: 3).
I've got a feeling that this will require an apply
function of some sort but I haven't a clue.
Cheers.
The following solution based in the tidyverse should do the trick.
Please note that the last line is in there to make for an easy view of the result. I would not think it to be necessary for your purposes. If you do want to keep it, I would advise to incorporate it in the initial making of 'pair'.
library(stringdist)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(stringr)
matrix %>%
as_tibble() %>%
mutate(X = colnames(.), .before = 1) %>%
pivot_longer(-X) %>%
filter(value %in% 1:7) %>%
transmute(pair = map2(X, name, ~ sort(c(.x, .y))),
stringDist = value) %>%
distinct(pair, stringDist) %>%
mutate(pair = map_chr(pair, ~ str_c(., collapse = '_')))
# A tibble: 451 x 2
# pair stringDist
# <chr> <dbl>
# 1 domineering_sweltering 6
# 2 curl_root 4
# 3 curl_gratis 6
# 4 curl_secretary 7
# 5 cumbersome_curl 7
# 6 curl_oval 3
# 7 curl_mighty 6
# 8 curl_thaw 4
# 9 curl_troubled 6
# 10 curl_furniture 7