Search code examples
rstringdist

How to return a list of pairs of strings from a large matrix that mutually satisfy a maximum stringdistance criterion?


I am trying to make a way of presenting human-input words in a way that makes their groupings more easily recognisable as referring to the same thing. Essentially a spellchecker. I have gotten as far as making a large matrix (the actual one is 250 * 250 ish). The code for this matrix is identical to the reproducible example given below. (I have populated this with a random word generator, the actual values make much more sense but are confidential)

strings <- c("domineering","curl","axiomatic","root","gratis","secretary","lopsided","cumbersome","oval","mighty","thaw","troubled","furniture","round","soak","callous","melted","wealthy","sweltering","verdant","fence","eyes","ugliest","card","quickest","harm","brake","alarm","report","glue","eyes","hollow","quince","pack","twig","knot")

matrix <- stringdistmatrix(strings, strings, useNames = TRUE)

Now I want to create a new table with two variables, the first column must contain pairs of elements of 'strings' that satisfy the condition that their string-distance was lower than some number lets say for this example (stringdist<7, nonzero), the second column must contain the stringdist. Also the table should not show the reflection of the results present in the matrix e.g. (oval, curl: 3), (curl, oval: 3).

I've got a feeling that this will require an apply function of some sort but I haven't a clue.

Cheers.


Solution

  • The following solution based in the tidyverse should do the trick.

    Please note that the last line is in there to make for an easy view of the result. I would not think it to be necessary for your purposes. If you do want to keep it, I would advise to incorporate it in the initial making of 'pair'.

    library(stringdist)
    library(dplyr)
    library(tibble)
    library(tidyr)
    library(purrr)
    library(stringr)
    
    matrix %>%
      as_tibble() %>%
      mutate(X = colnames(.), .before = 1) %>%
      pivot_longer(-X) %>%
      filter(value %in% 1:7) %>%
      transmute(pair = map2(X, name, ~ sort(c(.x, .y))),
                stringDist = value) %>%
      distinct(pair, stringDist) %>%
      mutate(pair = map_chr(pair, ~ str_c(., collapse = '_')))
    
    # A tibble: 451 x 2
    #   pair                   stringDist
    #   <chr>                       <dbl>
    # 1 domineering_sweltering          6
    # 2 curl_root                       4
    # 3 curl_gratis                     6
    # 4 curl_secretary                  7
    # 5 cumbersome_curl                 7
    # 6 curl_oval                       3
    # 7 curl_mighty                     6
    # 8 curl_thaw                       4
    # 9 curl_troubled                   6
    # 10 curl_furniture                 7