multidplyr error with pmap_dfr: Error: Element 5 is not a vector (environment)

[ This is also reported on the multidplyr github page ]

I'm trying to use multidplyr_0.0.0.9000 with dplyr_0.7.4.9000 and pmap_dfr from purrr_0.2.4.9000. The following code (without using multidplyr) works fine:

grid1 = as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff = function(m1, m2) { return(tribble(~m3, ~m4, m1+1, m2+2)) }
pmap_dfr(grid1, retstuff)

When I try to partition the grid with multidplyr:

grid2 = partition(grid1, m1)
pmap_dfr(grid2, retstuff)

I get the error Error: Element 5 is not a vector (environment) from pmap_dfr()

I also get the following warning from partition() as also reported on github: group_indices_.grouped_df ignores extra arguments. Not sure if that's related or not.

Solution

A few issues:

You need to load any necessary packages (beyond dplyr) on each node,
You need to copy your function to each node, and
You can only call dplyr verbs on the partitioned data frame, so you need to wrap the pmap_dfr call in dplyr::do

after which it works:

library(tidyverse)
library(multidplyr)

grid1 <- as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff <- function(m1, m2) { 
    tribble(   ~m3,    ~m4, 
            m1 + 1, m2 + 2)
}

grid2 <- partition(grid1, m1)
#> Initialising 7 core cluster.
#> Warning: group_indices_.grouped_df ignores extra arguments
cluster_library(grid2, 'tidyverse')    # load packages on each node
cluster_copy(grid2, retstuff)    # copy function to each node

grid2 %>% do(pmap_dfr(., retstuff))    # wrap call in dplyr::do
#> Source: party_df [110 x 3]
#> Groups: m1
#> Shards: 7 [11--22 rows]
#> 
#> # S3: party_df
#>       m1    m3    m4
#>    <int> <dbl> <dbl>
#>  1     9    10    22
#>  2     9    10    23
#>  3     9    10    24
#>  4     9    10    25
#>  5     9    10    26
#>  6     9    10    27
#>  7     9    10    28
#>  8     9    10    29
#>  9     9    10    30
#> 10     9    10    31
#> # ... with 100 more rows

...but for this particular case, while multidplyr is a little faster, plain dplyr::mutate is quite a lot faster yet, and a lot easier to write:

grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2)
#> # A tibble: 110 x 4
#>       m1    m2    m3    m4
#>    <int> <int> <dbl> <dbl>
#>  1     1    20     2    22
#>  2     2    20     3    22
#>  3     3    20     4    22
#>  4     4    20     5    22
#>  5     5    20     6    22
#>  6     6    20     7    22
#>  7     7    20     8    22
#>  8     8    20     9    22
#>  9     9    20    10    22
#> 10    10    20    11    22
#> # ... with 100 more rows

all.equal(grid2 %>% do(pmap_dfr(., retstuff)) %>% collect, 
          grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2))
#> [1] TRUE

microbenchmark::microbenchmark(
    multidplyr_pmap = grid2 %>% do(pmap_dfr(., retstuff)) %>% collect(), 
    multidplyr_mutate = grid2 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% collect(),
    pmap = grid1 %>% pmap_dfr(retstuff),
    mutate = grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2)
)
#> Unit: milliseconds
#>               expr        min        lq       mean    median         uq       max neval
#>    multidplyr_pmap 113.896646 117.18365 122.656286 119.75652 125.874450 182.53330   100
#>  multidplyr_mutate  12.419918  12.84528  16.271337  13.68441  15.092482 177.77372   100
#>               pmap 372.512544 387.49371 397.844622 394.71971 402.640281 551.78633   100
#>             mutate   7.014426   7.49689   8.499588   7.66554   8.654478  32.22647   100