I have an issue when using the tm package and parallel computation in R and I'm not sure if I'm doing something silly or if it is a bug.
I created a small reproducible example:
# Load the libraries
library(tm)
library(snow)
# Create a Document Term Matrix
test_sentence = c("this is a test", "this is another test")
test_corpus = VCorpus(VectorSource(test_sentence))
test_TM = DocumentTermMatrix(test_corpus)
# Define a simple function that returns the matrix for the i-th document
test_function = function(i, TM){ TM[i, ] }
If I run a simple lapply using this example I get what expected without any problem:
# This returns the expected list containing the rows of the Matrix
res1 = lapply(1:2, test_function, test_TM)
But if I run it in parallel I get the error:
first error: incorrect number of dimensions
# This should return the same thing of the lapply above but instead it stops with an error
cl = makeCluster(2)
res2 = parLapply(cl, 1:2, test_function, test_TM)
stopCluster(cl)
The problem is that the different nodes do not automatically have the tm
package loaded. Loading the package is necessary, however, because it defines the [
method for the the relevant object class.
The code below does the following:
tm
package in all nodescl <- makeCluster(rep("localhost",2), type="SOCK")
clusterEvalQ(cl, library(tm))
clusterExport(cl, list=ls())
res <- parLapply(cl, as.list(1:2), test_function, test_TM)
stopCluster(cl)