Search code examples
rlistfuzzy-comparisonfuzzyjoin

fuzzy joining a column with a list


The data is as follows:

library(fuzzyjoin)
nr <- c(1,2)
col2 <- c("b","a")

dat <- cbind.data.frame(
  nr, col2
)

thelist <- list(
aa=c(1,2,3),
bb=c(1,2,3)
)

I would like to the following:

stringdist_left_join(dat, thelist, by = "col2", method = "lcs", max_dist = 1)

But this (unsurprisingly) gives an error:

Error in `group_by_prepare()`:
! Must group by variables found in `.data`.
* Column `col` is not found.
Run `rlang::last_error()` to see where the error occurred.

What would be the best way to do this?

Desired output:

nr col2 thelist list_col
1  b    bb      c(1,2,3)
2  a    aa      c(1,2,3)

Solution

  • This is a bit of a hack. Not sure if there is a more elegant solution.

    Create a data.frame of the transposed list and pivot this into a data.frame with all the names of the list in a column named "col2". Then use fuzzy join to merge the data. With the resulting out data.frame, you can drop the columns you don't need.

    library(fuzzyjoin)
    library(tidyr)
    
    dat <- data.frame(
      nr = c(1,2), col2 = c("b","a")
    )
    
    thelist <- list(
      aa=c(1,2,3),
      bb=c(1,2,3,4)
    )
    
    # create data.frame with list info 
    a <- tibble(col2 = names(thelist), value = thelist)
    a
    # A tibble: 2 x 2
      col2  value       
      <chr> <named list>
    1 aa    <dbl [3]>   
    2 bb    <dbl [4]>   
    
    # merge data
    out <- stringdist_left_join(dat, a, by = "col2", method = "lcs", max_dist = 1)
    out
      nr col2.x col2.y      value
    1  1      b     bb 1, 2, 3, 4
    2  2      a     aa    1, 2, 3