Search code examples
rdplyrpurrrmultidplyr

multidplyr error with pmap_dfr: Error: Element 5 is not a vector (environment)


[ This is also reported on the multidplyr github page ]

I'm trying to use multidplyr_0.0.0.9000 with dplyr_0.7.4.9000 and pmap_dfr from purrr_0.2.4.9000. The following code (without using multidplyr) works fine:

grid1 = as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff = function(m1, m2) { return(tribble(~m3, ~m4, m1+1, m2+2)) }
pmap_dfr(grid1, retstuff)

When I try to partition the grid with multidplyr:

grid2 = partition(grid1, m1)
pmap_dfr(grid2, retstuff)

I get the error Error: Element 5 is not a vector (environment) from pmap_dfr()

I also get the following warning from partition() as also reported on github: group_indices_.grouped_df ignores extra arguments. Not sure if that's related or not.


Solution

  • A few issues:

    • You need to load any necessary packages (beyond dplyr) on each node,
    • You need to copy your function to each node, and
    • You can only call dplyr verbs on the partitioned data frame, so you need to wrap the pmap_dfr call in dplyr::do

    after which it works:

    library(tidyverse)
    library(multidplyr)
    
    grid1 <- as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
    retstuff <- function(m1, m2) { 
        tribble(   ~m3,    ~m4, 
                m1 + 1, m2 + 2)
    }
    
    grid2 <- partition(grid1, m1)
    #> Initialising 7 core cluster.
    #> Warning: group_indices_.grouped_df ignores extra arguments
    cluster_library(grid2, 'tidyverse')    # load packages on each node
    cluster_copy(grid2, retstuff)    # copy function to each node
    
    grid2 %>% do(pmap_dfr(., retstuff))    # wrap call in dplyr::do
    #> Source: party_df [110 x 3]
    #> Groups: m1
    #> Shards: 7 [11--22 rows]
    #> 
    #> # S3: party_df
    #>       m1    m3    m4
    #>    <int> <dbl> <dbl>
    #>  1     9    10    22
    #>  2     9    10    23
    #>  3     9    10    24
    #>  4     9    10    25
    #>  5     9    10    26
    #>  6     9    10    27
    #>  7     9    10    28
    #>  8     9    10    29
    #>  9     9    10    30
    #> 10     9    10    31
    #> # ... with 100 more rows
    

    ...but for this particular case, while multidplyr is a little faster, plain dplyr::mutate is quite a lot faster yet, and a lot easier to write:

    grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2)
    #> # A tibble: 110 x 4
    #>       m1    m2    m3    m4
    #>    <int> <int> <dbl> <dbl>
    #>  1     1    20     2    22
    #>  2     2    20     3    22
    #>  3     3    20     4    22
    #>  4     4    20     5    22
    #>  5     5    20     6    22
    #>  6     6    20     7    22
    #>  7     7    20     8    22
    #>  8     8    20     9    22
    #>  9     9    20    10    22
    #> 10    10    20    11    22
    #> # ... with 100 more rows
    
    all.equal(grid2 %>% do(pmap_dfr(., retstuff)) %>% collect, 
              grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2))
    #> [1] TRUE
    
    microbenchmark::microbenchmark(
        multidplyr_pmap = grid2 %>% do(pmap_dfr(., retstuff)) %>% collect(), 
        multidplyr_mutate = grid2 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% collect(),
        pmap = grid1 %>% pmap_dfr(retstuff),
        mutate = grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2)
    )
    #> Unit: milliseconds
    #>               expr        min        lq       mean    median         uq       max neval
    #>    multidplyr_pmap 113.896646 117.18365 122.656286 119.75652 125.874450 182.53330   100
    #>  multidplyr_mutate  12.419918  12.84528  16.271337  13.68441  15.092482 177.77372   100
    #>               pmap 372.512544 387.49371 397.844622 394.71971 402.640281 551.78633   100
    #>             mutate   7.014426   7.49689   8.499588   7.66554   8.654478  32.22647   100