Search code examples
rmatchingfuzzy

Very Fast string fuzzy matching in R


I have a set of 40.000 rows x 4 columns and I need to compare each column to itself in order to find the most closest result or the minimum levenshtein distance. The idea is to get an "almost duplicate" for every row. I have calculated with "adist" but seems too slow. For example, for only one column, 5.000 rows compared to all column dataset, 40.000 rows, takes almost 2 hours. This is, for 4 columns, 8 hours, and for the entire dataset, 32 hours. Is there any faster way to achieve the same? I need it to be in 1 or 2 hours if possible. This is an example of what have I done so far:


    #vector example
    a<-as.character(c("hello","allo","hola"))
    b<-as.character(c("hello","allo","hola"))
    
    #execution time
    start_time <- Sys.time()
    
    #Matrix with distance
    dist.name<-adist(a,b, partial = TRUE, ignore.case = TRUE)
    
    #time elapsed
    end_time <- Sys.time()
    end_time - start_time
    
    Output:
    Time difference of 5.873202 secs
    
    #result
    dist.name
          [,1] [,2] [,3]
    [1,]    0    4    5
    [2,]    2    0    2
    [3,]    5    4    0

desired output (minimum distance for every row, but no for the same row), but I need it faster.

[1,] 4
[2,] 2
[3,] 4

Solution

  • You could try stringsdist-package.

    It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.

    library(stringdist)
    
    a<-as.character(c("hello","allo","hola"))
    b<-as.character(c("hello","allo","hola"))
    
    start_time <- Sys.time()
    res <- stringdistmatrix(a,b, method = "lv")
    end_time <- Sys.time()
    
    > end_time - start_time
    Time difference of 0.006981134 secs
    > res
         [,1] [,2] [,3]
    [1,]    0    2    3
    [2,]    2    0    3
    [3,]    3    3    0
    
    
    diag(res) <- NA
    apply(res, 1, FUN = min, na.rm = T)
    [1] 2 2 3