Search code examples
rdplyrnon-standard-evaluation

What effect does setting the attribute of a vector have in dplyr::summarize()?


I just ran into some weird behavior of dplyr where summarize kept referring to objects from a previous group.

Here is a simple reproducible example to illustrate the surprising behavior:

library(dplyr, warn.conflicts = FALSE)
tibble(x = rep(letters[1:3], times = 4),
       y = rnorm(12)) %>%
  group_by(x) %>%
  summarize(z1 = sum(y),
            z2 = {
              attr(y, "test") <- "test"
              sum(y)
            })
#> # A tibble: 3 × 3
#>   x         z1    z2
#>   <chr>  <dbl> <dbl>
#> 1 a      0.602 0.602
#> 2 b      1.22  0.602
#> 3 c     -0.310 0.602

Created on 2022-10-31 by the reprex package (v2.0.1)

I expected that z1 and z2 are identical. I don't understand why setting an attribute for the vector y means that in later iterations, the reference to the ''correct'' elements of y is shadowed.

The problem can be easily fixed by using sum(.data$y) in the last line, but I would like to understand the scoping rules within the non-standard evaluation of summarize. Any pointers to helpful documentation or explanations why the current behavior makes sense in the tidyverse non-standard evaluation framework makes sense is appreciated.


I am using R 4.1.1 with dplyr 1.0.7.


Solution

  • This is a problem related to scoping. If you write to the variable y inside summarize, then the first grouping of your data's y variable is copied into a local variable called y that is distinct from the y in your data frame. Because it is a local variable, it is found on the search path before the y in the passed data frame. Since the same environment is used for subsequent groups' calculations inside summarize, this local variable persists for each group.

    We can see this if we do:

    library(dplyr, warn.conflicts = FALSE)
    
    set.seed(1)
    
    tibble(x = rep(letters[1:3], times = 4),
           y = rnorm(12)) %>%
      group_by(x) %>% 
      summarize(z1 = sum(y),
                z2 = {
                  y <- y
                  sum(y)
                }) 
    #> # A tibble: 3 x 3
    #>   x         z1    z2
    #>   <chr>  <dbl> <dbl>
    #> 1 a      1.15   1.15
    #> 2 b      2.76   1.15
    #> 3 c     -0.690  1.15
    

    As long as we remove the local copy of the y variable from the local frame, this doesn't happen:

    library(dplyr, warn.conflicts = FALSE)
    
    set.seed(1)
    
    tibble(x = rep(letters[1:3], times = 4),
           y = rnorm(12)) %>%
      group_by(x) %>% 
      summarize(z1 = sum(y),
                z2 = {
                  attr(y, "test") <- "test"
                  x <- sum(y)
                  rm(y)
                  x
                }) 
    #> # A tibble: 3 x 3
    #>   x         z1     z2
    #>   <chr>  <dbl>  <dbl>
    #> 1 a      1.15   1.15 
    #> 2 b      2.76   2.76 
    #> 3 c     -0.690 -0.690
    

    Or better still, don't write to a local variable with the same name as a variable in your data frame:

    tibble(x = rep(letters[1:3], times = 4),
           y = rnorm(12)) %>%
      group_by(x) %>% 
      summarize(z1 = sum(y),
                z2 = {
                  new_y <- y
                  attr(new_y, "test") <- "test"
                  sum(new_y)
                }) 
    #> # A tibble: 3 x 3
    #>   x         z1     z2
    #>   <chr>  <dbl>  <dbl>
    #> 1 a      1.15   1.15 
    #> 2 b      2.76   2.76 
    #> 3 c     -0.690 -0.690
    

    Created on 2022-10-31 with reprex v2.0.2