Search code examples
rnesteddplyrtibble

Applying simple function via across within nested data on each group


Background

Given nested data, I would like to apply a simple function using across on an arbitrary selection of columns. Using across I want to iterate over the selection of columns passed to one argument of the function and keep the second argument constant.


Example

# Using across within nested data frame

# Gapminder data from gapminder package
library("tidyverse")
data("gapminder", package = "gapminder")

# Sample function
sample_function <- function(.data, var_a, var_b) {
    var_a <- enquo(var_a)
    var_b <- enquo(var_b)
    .data %>%
        mutate(some_res = log(!!var_a) + !!var_b) %>%
        pull(some_res)
}


# Basic example, not working
gapminder %>%
    group_by(country, continent) %>%
    nest() %>%
    mutate(sample_res = map(
        .x = data,
        .f = across(
            .cols = vars(year, lifeExp, pop),
            .fns = ~ sample_function(var_a = .x),
            var_b = gdpPercap
        )
    )) %>%
    unnest(sample_res)

The example fails with the following error:

Error: Problem with mutate() input sample_res. x Must subset columns with a valid subscript vector. x Subscript has the wrong type quosures. ℹ It must be numeric or character. ℹ Input sample_res is map(...). ℹ The error occured in group 1: country = "Afghanistan", continent = "Asia". Run rlang::last_error() to see where the error occurred.

Desired results

I can iterate over the selected columns always passing a different argument in var_a. In this case the values reflect year, lifeExp and gdpPercap variables.

gapminder %>%
    group_by(country, continent) %>%
    nest() %>%
    mutate(
        res_year = map(.x = data, 
                       .f = sample_function, var_a = year, var_b = gdpPercap),
        res_lifeExp = map(.x = data, 
                          .f = sample_function, var_a = lifeExp, 
                          var_b = gdpPercap),
        res_pop = map(.x = data, 
                      .f = sample_function, var_a = pop, var_b = gdpPercap)
    )

Sought solution

The solution obtained in desired results is rather impractical and error-prone as forces new row for each variable. I would like to find a combination of using across and map so I can run different variations of the mapping function only by adding variables to across.


Solution

  • Final update (using nest_by & across)

    Inspired by @Brunos answer, I revised my approach to use nest_by / rowwise instead of map (which is, I guess, the new recommended way of wrangling nested tibbles).

    The result of my original answer can be easily reproduced using nest_by:

    gapminder %>%
      nest_by(country, continent) %>%
      mutate(sample_res = list(transmute(data,
                                         across(c(year, lifeExp, pop),
                                                ~ sample_function(data, var_a = .x, var_b = gdpPercap))
      ))
      ) 
    

    However, it returns one list-column containing tibbles. If the output were normal vectors we could just remove sample_res = list() and new columns would be added to your existing tibble. However, in this example the output of each new column is a list-column containing vectors. I have not managed to produce this output in one call to mutate(across(...)).

    It is possible though to use unnest and then another call to summarise(across(...)) to get the job done.

    gapminder %>%
      nest_by(country, continent) %>%
      mutate(sample_res = list(transmute(data,
                                 across(c(year, lifeExp, pop),
                                        ~ sample_function(data, var_a = .x, var_b = gdpPercap))
                          ))
             ) %>% 
      unnest(cols = sample_res) %>%
      summarise(across(c(year, lifeExp, pop), list, .names = "res_{col}"))
    



    Original answer (using group_by, nest, map & across)

    You misspecified sample_function in your across call. It should be

    function(x) sample_function(.x, var_a = x, var_b = gdpPercap)
    

    instead of

    ~ sample_function(var_a = .x),
                    var_b = gdpPercap
    

    Since you are nesting map and mutate(across(...)), I prefer to have at least one "normal" anonymous functions instead of the lamda ~ notation. Otherwise, things can get confusing with two .xs.

    Further across should be called inside its own separate mutate.

    This should work:

    library("tidyverse")
    data("gapminder", package = "gapminder")
    
    # Sample function
    sample_function <- function(.data, var_a, var_b) {
      var_a <- enquo(var_a)
      var_b <- enquo(var_b)
    
      .data %>%
        mutate(some_res = log(!!var_a) + !!var_b) %>%
        pull(some_res)
    }
    
    gapminder %>%
      group_by(country, continent) %>%
      nest() %>%  
      mutate(sample_res = map(
        data,
        ~ mutate(.x, across(c(year, lifeExp, pop),
                           function(x) { 
                             sample_function(.x, var_a = x, var_b = gdpPercap)
                            }
                           )
        )
       )
      )
    #> # A tibble: 142 x 4
    #> # Groups:   country, continent [142]
    #>    country     continent data              sample_res       
    #>    <fct>       <fct>     <list>            <list>           
    #>  1 Afghanistan Asia      <tibble [12 × 4]> <tibble [12 × 4]>
    #>  2 Albania     Europe    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  3 Algeria     Africa    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  4 Angola      Africa    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  5 Argentina   Americas  <tibble [12 × 4]> <tibble [12 × 4]>
    #>  6 Australia   Oceania   <tibble [12 × 4]> <tibble [12 × 4]>
    #>  7 Austria     Europe    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  8 Bahrain     Asia      <tibble [12 × 4]> <tibble [12 × 4]>
    #>  9 Bangladesh  Asia      <tibble [12 × 4]> <tibble [12 × 4]>
    #> 10 Belgium     Europe    <tibble [12 × 4]> <tibble [12 × 4]>
    #> # … with 132 more rows
    

    Created on 2020-06-03 by the reprex package (v0.3.0)

    When using map with custom functions to loop over tibbles in a list-column it is quite helpful to build a first version outside of the loop.

    test_dat <- gapminder %>%
      nest_by(country, continent) 
    
    test_dat$data[[1]] %>% 
      mutate(across(
        c(year, lifeExp, pop),
        ~ sample_function(test_dat$data[[1]], var_a = .x, var_b = gdpPercap)
        )
        )
    

    Once this works, then the final step is to replace the object you want to loop over with .x.

    Another approach (part of original answer)

    Another approach would be to rewrite your original sample_function and include across in your mutate call there. We could make it take a string vector of variable names that will be passed to across. I might like this approach more, because its more flexible. Now you could have another list-column containing different variable names for different subsets of your data and loop over them and your data column with map2.

    library("tidyverse")
    data("gapminder", package = "gapminder")
    
    sample_function2 <- function(.data, .vars, var_b) {
      .vars <- syms(.vars)
      var_b <- enquo(var_b)
    
      .data %>%
        mutate(across(c(!!!.vars), function(y) log(y) + !!var_b))
    }
    
    
    gapminder %>%
      group_by(country, continent) %>%
      nest() %>% 
      mutate(sample_res = map(
        data,
        ~ sample_function2(.x,
                           .vars = c("year", "lifeExp", "pop"),
                           var_b = gdpPercap)
      )
      )
    
    #> # A tibble: 142 x 4
    #> # Groups:   country, continent [142]
    #>    country     continent data              sample_res       
    #>    <fct>       <fct>     <list>            <list>           
    #>  1 Afghanistan Asia      <tibble [12 × 4]> <tibble [12 × 4]>
    #>  2 Albania     Europe    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  3 Algeria     Africa    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  4 Angola      Africa    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  5 Argentina   Americas  <tibble [12 × 4]> <tibble [12 × 4]>
    #>  6 Australia   Oceania   <tibble [12 × 4]> <tibble [12 × 4]>
    #>  7 Austria     Europe    <tibble [12 × 4]> <tibble [12 × 4]>
    #>  8 Bahrain     Asia      <tibble [12 × 4]> <tibble [12 × 4]>
    #>  9 Bangladesh  Asia      <tibble [12 × 4]> <tibble [12 × 4]>
    #> 10 Belgium     Europe    <tibble [12 × 4]> <tibble [12 × 4]>
    #> # … with 132 more rows
    

    Created on 2020-06-04 by the reprex package (v0.3.0)

    Add on (to original answer)

    As @Bruno points out that the approaches above are not in the format specified by the OP, here is an alternative solution building on my second approach above, which should yield the desired output.

    library("tidyverse")
    data("gapminder", package = "gapminder")
    
    sample_function2 <- function(.data, .vars, var_b) {
      .vars <- syms(.vars)
      var_b <- enquo(var_b)
    
      .data %>%
        transmute(across(c(!!!.vars), function(y) log(y) + !!var_b)) %>% 
        unlist()
    
    }
    
    my_vars <- c("year", "lifeExp", "pop")
    
    gapminder %>%
      group_by(country, continent) %>%
      nest() %>% 
      crossing(vars = my_vars) %>% 
      mutate(sample_res = map2(
        data,
        vars, 
        ~ sample_function2(.x,
                           .vars = .y,
                           var_b = gdpPercap)
      )
      ) %>% 
      pivot_wider(names_from = vars,
                  names_prefix = "res_",
                  values_from = sample_res) 
    
    #> # A tibble: 142 x 6
    #>    country     continent data              res_lifeExp res_pop    res_year  
    #>    <fct>       <fct>     <list>            <list>      <list>     <list>    
    #>  1 Afghanistan Asia      <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  2 Albania     Europe    <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  3 Algeria     Africa    <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  4 Angola      Africa    <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  5 Argentina   Americas  <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  6 Australia   Oceania   <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  7 Austria     Europe    <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  8 Bahrain     Asia      <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #>  9 Bangladesh  Asia      <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #> 10 Belgium     Europe    <tibble [12 × 4]> <dbl [12]>  <dbl [12]> <dbl [12]>
    #> # … with 132 more rows
    

    Created on 2020-06-04 by the reprex package (v0.3.0)