Search code examples
rdplyrtidyversemagrittrpurrr

Implementing map() on a column of nested data frames


I am teaching myself the R tidyverse purr() package and am having trouble implementing map() on a column of nested data frames. Could someone explain what I'm missing?

Using the base R ChickWeight dataset as an example I can easily get the number of observations for each timepoint under diet #1 if I first filter for diet #1 like so:

library(tidyverse) 
ChickWeight %>%
  filter(Diet == 1) %>% 
  group_by(Time) %>% 
  summarise(counts = n_distinct(Chick))

This is great but I would like to do it for each diet at once and I thought nesting the data and iterating over it with map() would be a good approach. This is what I did:

example <- ChickWeight %>% 
  nest(-Diet) 

Implementing this map function then achieves what I'm aiming for:

map(example$data, ~ .x %>% group_by(Time) %>% summarise(counts = n_distinct(Chick))) 

However when I try and implement this same command using a pipe to put it in another column of the original data frame it fails.

example %>% 
   mutate(counts = map(data, ~ .x %>% group_by(Time) %>%  summarise(counts = n_distinct(Chick))))
Error in eval(substitute(expr), envir, enclos) : 
  variable 'Chick' not found

Why does this occur?


I also tried it on the data frame split into a list and it didn't work.

ChickWeight %>% 
  split(.$Diet) %>% 
  map(data, ~ .x %>% group_by(Time) %>%  summarise(counts = n_distinct(Chick)))

Solution

  • Because you're using dplyr non-standard evaluation inside of dplyr NSE, it's getting confused about what environment to search for Chick. It's probably a bug, really, but it can be avoided with the development version's new .data pronoun, which specifies where to look:

    library(tidyverse)
    
    ChickWeight %>% 
        nest(-Diet) %>% 
        mutate(counts = map(data, 
                            ~.x %>% group_by(Time) %>% 
                                summarise(counts = n_distinct(.data$Chick))))
    #> # A tibble: 4 × 3
    #>     Diet               data            counts
    #>   <fctr>             <list>            <list>
    #> 1      1 <tibble [220 × 3]> <tibble [12 × 2]>
    #> 2      2 <tibble [120 × 3]> <tibble [12 × 2]>
    #> 3      3 <tibble [120 × 3]> <tibble [12 × 2]>
    #> 4      4 <tibble [118 × 3]> <tibble [12 × 2]>
    

    To pipe it through a list, leave the first parameter of map blank to pass in the list over which to iterate:

    ChickWeight %>% 
        split(.$Diet) %>% 
        map(~ .x %>% group_by(Time) %>%  summarise(counts = n_distinct(Chick))) %>% .[[1]]
    
    #> # A tibble: 12 × 2
    #>     Time counts
    #>    <dbl>  <int>
    #> 1      0     20
    #> 2      2     20
    #> 3      4     19
    #> 4      6     19
    #> 5      8     19
    #> 6     10     19
    #> 7     12     19
    #> 8     14     18
    #> 9     16     17
    #> 10    18     17
    #> 11    20     17
    #> 12    21     16
    

    A simpler option would be to just group by both columns:

    ChickWeight %>% group_by(Diet, Time) %>% summarise(counts = n_distinct(Chick))
    
    #> Source: local data frame [48 x 3]
    #> Groups: Diet [?]
    #> 
    #>      Diet  Time counts
    #>    <fctr> <dbl>  <int>
    #> 1       1     0     20
    #> 2       1     2     20
    #> 3       1     4     19
    #> 4       1     6     19
    #> 5       1     8     19
    #> 6       1    10     19
    #> 7       1    12     19
    #> 8       1    14     18
    #> 9       1    16     17
    #> 10      1    18     17
    #> # ... with 38 more rows