Search code examples
rdplyrparallel-processingmultidplyr

Send different dplyr::mutate cols to different cores with multdplyr?


I have a function that I'm applying to different sets of coordinates to create four new columns in my tibble. This function has a pretty long start-up time (loads the genome into RAM, converts tibble to GRanges, and retrieves sequences) but is relatively fast, so that there's not much difference between 100 and 1,000,000 sequences. Is there any way to send each col in the mutate to a different core so they can be processed at the same time? I thought about using pivot_long and then group+partition but this got me thinking about whether there was a different way to accomplish this. A multi_mutate of sorts?
(I don't actually expect the multiplyr partition/collect to be that time-saving in my case given the small cost to additional coordinates, but if I could avoid the time cost of pivoting, which is still relatively small, and mess in my code, that'd be cool.)


Solution

  • I know you were looking for an existing package, but I couldn't find anything on that. Other similar questions (like here or here) appear not to provide a package either..

    However, what about you hack it out yourself... Look at this example with furrr.

    ### libraries
    library(dplyr)
    library(furrr)
    
    ### data complaint with your example
    d <- replicate(8, rnorm(100))
    colnames(d) <- apply(expand.grid(letters[1:2], 1:4), 1, paste0, collapse = "")
    d <- as_tibble(d)
    
    ### a function that take more than a second to finish..
    long_f <- function(x1, x2){
      
      Sys.sleep(1)
      x1+x2
      
    }
    
    ### multimutate!
    multimutate <- function(.data, ..., .options = future_options()){
      
      dots <- enquos(..., .named = TRUE)
      .data[names(dots)] <- future_map(dots, ~rlang::eval_tidy(., data = .data, env = parent.frame()), .options = .options)
      .data
      
    }
    
    
    # no future strategy implemented
    tictoc::tic()
    d %>%
      multimutate(c1 = long_f(a1,b1), 
                  c2 = long_f(a2,b2),
                  c3 = long_f(a3,b3), 
                  c4 = long_f(a4,b4))  
    tictoc::toc()
    # 4.34 sec elapsed
    
    # future strategy
    plan(multiprocess)
    tictoc::tic()
    d %>%
      multimutate(c1 = long_f(a1,b1), 
                  c2 = long_f(a2,b2),
                  c3 = long_f(a3,b3), 
                  c4 = long_f(a4,b4),
                  .options = future_options(globals = "long_f"))  
    tictoc::toc()
    # 1.59 sec elapsed
    

    It needs some testing a guess.. and It would need to be improved.. for example using the same methods available for mutate. But it's a start.

    Notice that I need to use future_options..