Search code examples
rstring-matchingsimilarity

R: count how many letters match from beginning of string


I am writing some bigger ML script to detect synonyms and abbreviations in database. One of the metrics of similarity of words is how many of first letters in two strings match. So i have 2 vectors:

v1 <- c("rejtan", "reiki","rejon")
v2 <- c("rejtan", "rejtan", "beiki")

and I want to have this result (% of letters from the beggining of a word that match):

       rejtan     reiki rejon
rejtan      1 0.3333333   0.5
rejtan      1 0.3333333   0.5
beiki       0 0.0000000   0.0

I came up with this function:

count.first.character.matches <- function(vec1,vec2) { 
  sapply(X = vec1 , FUN= function(x) {
    sapply(X = vec2, FUN = function(y) {
      ny <- nchar(y)
      nx <- nchar(x)
      shorter_length <- ifelse(nx > ny, nx, ny)
      matches <- sum(sapply( 1:shorter_length, FUN=function(i,x,y) {  substr(x,1,i) == substr(y,1,i)}, x,y ))
      matches / shorter_length
    })
  })

My question is: How one can improve performance of this function? I have set of 65K of vector pairs, each 700-1K words, so I end up with calculating this metric a lot, and according to Rprof this takes approx. 25% of time.


Solution

  • Using your approach as is, there are some things you could change to make it more efficient.

    1) nchar is a function that, unlike length, has to compute the number of characters of its arguments and not get an attribute. You re computing nchar for "v2" for every "v1" but, also, nchar for "v1" for every "v2". You could put nchar(x) outside of the second sapply or, even better, utilize the vectorised nature of nchar and compute everything once and before any sapply. In particular, having

    x = replicate(1e3, paste0(sample(letters, sample(4:10, 1), TRUE), collapse = "")) 
    y = replicate(1e3, paste0(sample(letters, sample(4:10, 1), TRUE), collapse = ""))
    

    instead of

    system.time({
        nx = nchar(x)
        ny = nchar(y)
    })
    #user  system elapsed 
    #   0       0       0
    

    you use

    system.time({
    sapply(x, function(X) 
                sapply(y, function(Y) {
                            nX = nchar(X)
                            nY = nchar(Y)
                          }))
    })  
    #user  system elapsed 
    #8.08    0.00    8.27
    

    2)substring is vectorised, so the third sapply can be avoided. (Also, in checking single characters of a string, strsplit might be faster and, being itself vectorised, can be computed outside of any loop.)

    3)A block of if else is faster than ifelse when comparing 'length == 1' vectors. That is completely minor of course, but after two nested sapplys it adds extra computational time without needing to:

    microbenchmark::microbenchmark(replicate(1e4, if(2 < 3 && 5 > 3) 1 else 0), 
                                   replicate(1e4, ifelse(2 < 3 && 5 > 3, 1, 0)))
    #Unit: milliseconds
    #                                           expr      min       lq   median       uq      max neval
    # replicate(10000, if (2 < 3 && 5 > 3) 1 else 0) 14.22543 14.85759 15.09545 15.78781 56.84884   100
    # replicate(10000, ifelse(2 < 3 && 5 > 3, 1, 0)) 29.77642 31.44824 36.20305 37.85782 65.72473   100
    

    So, having these in mind:

    OP2 = function(v1, v2) 
    { 
        nc1 = nchar(v1)
        nc2 = nchar(v2)
        sv2 = seq_along(v2)
    
        sapply(seq_along(v1), 
               function(i) {
                  sapply(sv2, 
                         function(j) {
                           len = if(nc1[[i]] > nc2[[j]]) nc1[[i]] else nc2[[j]]
                           ind = seq_len(len)
                           sum(substring(v1[[i]], 1, ind) == substring(v2[[j]], 1, ind)) / len
                         })
               })
    }
    

    And compare with yours:

    set.seed(007)          
    v1b = replicate(1e2, paste0(sample(letters[1:5], sample(4:10, 1), TRUE), collapse = ""))
    v2b = replicate(1e2, paste0(sample(letters[1:5], sample(4:10, 1), TRUE), collapse = ""))
    
    sum(count.first.character.matches(v1b, v2b) != OP2(v1b, v2b))
    #[1] 0
    microbenchmark::microbenchmark(count.first.character.matches(v1b, v2b), OP2(v1b, v2b), times = 20)
    #Unit: milliseconds
                                        expr      min       lq   median       uq       max neval
    # count.first.character.matches(v1b, v2b) 932.2840 949.3697 969.6321 985.2237 1081.2882    20
    #                           OP2(v1b, v2b) 161.7503 185.1102 192.3019 197.5060  213.6818    20
    

    Another idea, besides your approach, could be (after changing "OP2" to keep the shortest length):

    ff = function(x, y)
    {
        sx = strsplit(x, "", fixed = TRUE)
        sy = strsplit(y, "", fixed = TRUE)
        array(mapply(function(X, Y) { 
                       slen = seq_len(min(length(X), length(Y)))
                       wh = X[slen] == Y[slen]
                       if(all(wh)) return(1) else (which.min(wh) - 1) / length(slen)
                     }, 
                     rep(sx, each = length(sy)), sy), 
              c(length(x), length(y)), list(y, x))
    }  
    sum(ff(v1b, v2b) != OP2(v1b, v2b))
    #[1] 0
    microbenchmark::microbenchmark(ff(v1b, v2b), OP2(v1b, v2b), times = 20)
    #Unit: milliseconds
    #          expr       min        lq    median        uq      max neval
    #  ff(v1b, v2b)  72.72661  80.43703  85.85113  89.16066 110.5722    20
    # OP2(v1b, v2b) 165.13991 168.15051 176.01596 182.11389 213.9557    20