Search code examples
rforeachcluster-computingsnow

Passing information between threads (foreach with %dopar%)


I'm using doSNOW- package for parallelizing tasks, which differ in length. When one thread is finished, I want

  • some information generated by old threads passed to the next thread
  • start the next thread immediatly (loadbalancing like in clusterApplyLB)

It works in singlethreaded (see makeClust(spec = 1 ))

#Register Snow and doSNOW
require(doSNOW)

#CHANGE spec to 4 or more, to see what my problem is
registerDoSNOW(cl <- makeCluster(spec=1,type="SOCK",outfile=""))

numbersProcessed <- c() # init processed vector
x <- foreach(i = 1:10,.export=numbersProcessed)  %dopar% {

    #Do working stuff
    cat(format(Sys.time(), "%X"),": ","Starting",i,"(Numbers processed so far:",numbersProcessed, ")\n")
    Sys.sleep(time=i)

    #Appends this number to general vector
    numbersProcessed <- append(numbersProcessed,i)

    cat(format(Sys.time(), "%X"),": ","Ending",i,"\n")
    cat("--------------------\n")
}

#End it all
stopCluster(cl)

Now change the spec in "makeCluster" to 4. Output is something like this:

[..]
Type: EXEC 
18:12:21 :  Starting 9 (Numbers processed so far: 1 5 )
18:12:23 :  Ending 6 
--------------------
Type: EXEC 
18:12:23 :  Starting 10 (Numbers processed so far: 2 6 )
18:12:25 :  Ending 7 

At 18:12:21 thread 9 knew, that thread 1 and 5 have been processed. 2 seconds later thread 6 ends. The next thread has to know at least about 1, 5 and 6, right?. But thread 10 only knows about 6 and 2.

I realized, this has to do something with the cores specified in makeCluster. 9 knows about 1, 5 and 9 (1 + 4 + 4), 10 knows about 2,6 and 10 (2 + 4 + 4).

Is there a better way to pass "processed" stuff to further generations of threads?

Bonuspoints: Is there a way to "print" to the master- node in parallel processing, without having these "Type: EXEC" etc messages from the snow package? :)

Thanks! Marc


Solution

  • My bad. Damn.

    I thought, foreach with %dopar% is load-balanced. This isn't the case, and makes my question absolete, because there can nothing be executed on the host-side while parallel processing. That explains why global variables are only manipulated on the client side and never reach the host.