I have a one million length vector of words called WORDS. I got a 9 millions objects list called SENTENCES. Each object of my list is a sentence which is represented by a 10-50 length vector of words. Here is an example :
head(WORDS)
[1] "aba" "accra" "ada" "afrika" "afrikan" "afula" "aggamemon"
SENTENCES[[1]]
[1] "how" "to" "interpret" "that" "picture"
I want to convert every sentence of my list into a numeric vector whose elements correspond to the position of the sentence's word in the WORDS big vector. Actually, I know how to do it with that command :
convert <- function(sentence){
return(which(WORDS %in% sentence))
}
SENTENCES_NUM <- lapply(SENTENCES, convert)
The problem is that it takes way too long time. I mean my RStudio blows up although i got a 16Go RAM computer. So the question is do you have any ideas to speed up the computation?
fastmatch, a small package by an R core person, hashes the lookups so the initial and especially subsequent searches are faster.
What you are really doing is making a factor with predefined levels common to each sentence. The slow step in his C code is sorting the factor levels, which you can avoid by providing the (unique) list of factor levels to his fast version of the factor function.
If you just want the integer positions, you can easily convert from factor to integer: many do this inadvertently.
You don't actually need a factor at all for what you want, just match
. Your code also generates a logical vector, then recalculates positions from it: match
just goes straight to the positions.
library(fastmatch)
library(microbenchmark)
WORDS <- read.table("https://dotnetperls-controls.googlecode.com/files/enable1.txt", stringsAsFactors = FALSE)[[1]]
words_factor <- as.factor(WORDS)
# generate 100 sentences of between 5 and 15 words:
SENTENCES <- lapply(c(1:100), sample, x = WORDS, size = sample(c(5:15), size = 1))
bench_fun <- function(fun)
lapply(SENTENCES, fun)
# poster's slow solution:
hg_convert <- function(sentence)
return(which(WORDS %in% sentence))
jw_convert_match <- function(sentence)
match(sentence, WORDS)
jw_convert_match_factor <- function(sentence)
match(sentence, words_factor)
jw_convert_fastmatch <- function(sentence)
fmatch(sentence, WORDS)
jw_convert_fastmatch_factor <- function(sentence)
fmatch(sentence, words_factor)
message("starting benchmark one")
print(microbenchmark(bench_fun(hg_convert),
bench_fun(jw_convert_match),
bench_fun(jw_convert_match_factor),
bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
# now again with big samples
# generating the SENTENCES is quite slow...
SENTENCES <- lapply(c(1:1e6), sample, x = WORDS, size = sample(c(5:15), size = 1))
message("starting benchmark two, compare with factor vs vector of words")
print(microbenchmark(bench_fun(jw_convert_fastmatch),
bench_fun(jw_convert_fastmatch_factor),
times = 10))
I put this on https://gist.github.com/jackwasey/59848d84728c0f55ef11
The results don't format very well, suffice to say, fastmatch with or without factor input is dramatically faster.
# starting benchmark one
Unit: microseconds
expr min lq mean median uq max neval
bench_fun(hg_convert) 665167.953 678451.008 704030.2427 691859.576 738071.699 777176.143 10
bench_fun(jw_convert_match) 878269.025 950580.480 962171.6683 956413.486 990592.691 1014922.639 10
bench_fun(jw_convert_match_factor) 1082116.859 1104331.677 1182310.1228 1184336.810 1198233.436 1436600.764 10
bench_fun(jw_convert_fastmatch) 203.031 220.134 462.1246 289.647 305.070 2196.906 10
bench_fun(jw_convert_fastmatch_factor) 251.474 300.729 1351.6974 317.439 362.127 10604.506 10
# starting benchmark two, compare with factor vs vector of words
Unit: seconds
expr min lq mean median uq max neval
bench_fun(jw_convert_fastmatch) 3.066001 3.134702 3.186347 3.177419 3.212144 3.351648 10
bench_fun(jw_convert_fastmatch_factor) 3.012734 3.149879 3.281194 3.250365 3.498593 3.563907 10
And therefore I wouldn't go to the trouble of a parallel implementation just yet.