Search code examples
rtmsnow

Incorrect number of dimensions - parallel R computation


I have an issue when using the tm package and parallel computation in R and I'm not sure if I'm doing something silly or if it is a bug.

I created a small reproducible example:

# Load the libraries
library(tm)
library(snow)

# Create a Document Term Matrix
test_sentence = c("this is a test", "this is another test")
test_corpus = VCorpus(VectorSource(test_sentence))
test_TM = DocumentTermMatrix(test_corpus)

# Define a simple function that returns the matrix for the i-th document
test_function = function(i, TM){ TM[i, ] }

If I run a simple lapply using this example I get what expected without any problem:

# This returns the expected list containing the rows of the Matrix
res1 = lapply(1:2, test_function, test_TM)

But if I run it in parallel I get the error:

first error: incorrect number of dimensions

# This should return the same thing of the lapply above but instead it stops with an error
cl = makeCluster(2)
res2 = parLapply(cl, 1:2, test_function, test_TM)
stopCluster(cl)

Solution

  • The problem is that the different nodes do not automatically have the tm package loaded. Loading the package is necessary, however, because it defines the [ method for the the relevant object class.

    The code below does the following:

    1. start a cluster
    2. load the tm package in all nodes
    3. export all objects to all nodes
    4. run the function
    5. stop the cluster

    cl <- makeCluster(rep("localhost",2), type="SOCK")
    clusterEvalQ(cl, library(tm))
    clusterExport(cl, list=ls())
    res <- parLapply(cl, as.list(1:2), test_function, test_TM)
    stopCluster(cl)