Search code examples
rmemorybigdataffsnow

How to work with a large multi type data frame in Snow R?


I have a large data.frame of 20M lines. This data frame is not only numeric, there is characters as well. Using a split and conquer concept, I want to split this data frame to be executed in a parallel way using snow package (parLapply function, specifically). The problem is that the nodes run out of memory because the data frame parts are worked in RAM. I looked for a package to help me with this problem and I found just one (considering the multi type data.frame): ff package. Another problem comes from the use of this package. The split result of a ffdf is not equal to a split of a commom data.frame. Thus, it is not possible to run the parLapply function.

Do you know other packages for this goal? Bigmemory only supports matrix.


Solution

  • I've benchmarked some ways of splitting the data frame and parallelizing to see how effective they are with large data frames. This may help you deal with the 20M line data frame and not require another package.

    The results are here. The description is below. enter image description here

    This suggests that for large data frames the best option is (not quite the fastest, but has a progress bar):

    library(doSNOW)
    library(itertools)
    
    # if size on cores exceeds available memory, increase the chunk factor
    chunk.factor <- 1 
    chunk.num <- kNoCores * cut.factor
    tic()
    # init the cluster
    cl <- makePSOCKcluster(kNoCores)
    registerDoSNOW(cl)
    # init the progress bar
    pb <- txtProgressBar(max = 100, style = 3)
    progress <- function(n) setTxtProgressBar(pb, n)
    opts <- list(progress = progress)
    # conduct the parallelisation
    travel.queries <- foreach(m=isplitRows(coord.table, chunks=chunk.num),
                              .combine='cbind',
                              .packages=c('httr','data.table'),
                              .export=c("QueryOSRM_dopar", "GetSingleTravelInfo"), 
                              .options.snow = opts) %dopar% {
                                QueryOSRM_dopar(m,osrm.url,int.results.file)
                              }
    # close progress bar
    close(pb)
    # stop cluster
    stopCluster(cl) 
    toc()
    

    Note that

    • coord.table is the data frame/table
    • kNoCores ( = 25 in this case) is the number of cores

      1. Distributed memory. Sends coord.table to all nodes
      2. Shared memory. Shares coord.table with nodes
      3. Shared memory with cuts. Shares subset of coord.table with nodes.
      4. Do par with cuts. Sends subset of coord.table to nodes.
      5. SNOW with cuts and progress bar. Sends subset of coord.table to nodes
      6. Option 5 without progress bar

    More information about the other options I compared can be found here.

    Some of these answers might suit you, although they doesn't relate to distributed parlapply and I've included some of them in my benchmarking options.