Search code examples
rperformancejoinpositionmatch

Millions of tiny matches in R : need performance


I have a one million length vector of words called WORDS. I got a 9 millions objects list called SENTENCES. Each object of my list is a sentence which is represented by a 10-50 length vector of words. Here is an example :

head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"

SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"

I want to convert every sentence of my list into a numeric vector whose elements correspond to the position of the sentence's word in the WORDS big vector. Actually, I know how to do it with that command :

convert <- function(sentence){
  return(which(WORDS %in% sentence))
}

SENTENCES_NUM <- lapply(SENTENCES, convert)

The problem is that it takes way too long time. I mean my RStudio blows up although i got a 16Go RAM computer. So the question is do you have any ideas to speed up the computation?


Solution

  • fastmatch, a small package by an R core person, hashes the lookups so the initial and especially subsequent searches are faster.

    What you are really doing is making a factor with predefined levels common to each sentence. The slow step in his C code is sorting the factor levels, which you can avoid by providing the (unique) list of factor levels to his fast version of the factor function.

    If you just want the integer positions, you can easily convert from factor to integer: many do this inadvertently.

    You don't actually need a factor at all for what you want, just match. Your code also generates a logical vector, then recalculates positions from it: match just goes straight to the positions.

    library(fastmatch)
    library(microbenchmark)
    
    WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
    
    words_factor <- as.factor(WORDS)
    
    # generate 100 sentences of between 5 and 15 words:
    SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
    
    bench_fun <- function(fun)
      lapply(SENTENCES, fun)
    
    # poster's slow solution:
    hg_convert <- function(sentence)
      return(which(WORDS %in% sentence))
    
    jw_convert_match <- function(sentence) 
      match(sentence, WORDS)
    
    jw_convert_match_factor <- function(sentence) 
      match(sentence, words_factor)
    
    jw_convert_fastmatch <- function(sentence) 
      fmatch(sentence, WORDS)
    
    jw_convert_fastmatch_factor <- function(sentence)
      fmatch(sentence, words_factor)
    
    message("starting benchmark one")
    print(microbenchmark(bench_fun(hg_convert),
                         bench_fun(jw_convert_match),
                         bench_fun(jw_convert_match_factor),
                         bench_fun(jw_convert_fastmatch),
                         bench_fun(jw_convert_fastmatch_factor),
                         times = 10))
    
    # now again with big samples
    # generating the SENTENCES is quite slow...
    SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
    message("starting benchmark two, compare with factor vs vector of words")
    print(microbenchmark(bench_fun(jw_convert_fastmatch),
                         bench_fun(jw_convert_fastmatch_factor),
                         times = 10))
    

    I put this on https://gist.github.com/jackwasey/59848d84728c0f55ef11

    The results don't format very well, suffice to say, fastmatch with or without factor input is dramatically faster.

    # starting benchmark one
    Unit: microseconds
                                       expr         min          lq         mean      median          uq         max neval
                      bench_fun(hg_convert)  665167.953  678451.008  704030.2427  691859.576  738071.699  777176.143    10
                bench_fun(jw_convert_match)  878269.025  950580.480  962171.6683  956413.486  990592.691 1014922.639    10
         bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764    10
            bench_fun(jw_convert_fastmatch)     203.031     220.134     462.1246     289.647     305.070    2196.906    10
     bench_fun(jw_convert_fastmatch_factor)     251.474     300.729    1351.6974     317.439     362.127   10604.506    10
    
    # starting benchmark two, compare with factor vs vector of words
    Unit: seconds
                                       expr      min       lq     mean   median       uq      max neval
            bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648    10
     bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907    10
    

    And therefore I wouldn't go to the trouble of a parallel implementation just yet.