Search code examples
rdataframedplyrgroup-summaries

Get group mean with multiple grouping variables and excluding own group value


I'm looking for a faster way to calculate a group mean with multiple grouping variables while excluding own group values. A thought experiment would be finding average value (e.g. price) for a county from the counties in the same state in the same year excluding own county's value. Here's a toy data set.

df <- data_frame(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)

df %>%
  group_by(state, county, year) %>%
  summarise(q = mean(df$value[df$state == state & df$county != county & df$year == year]))

# Groups:   state, county [6]
   state county  year     q
   <chr> <chr>  <int> <dbl>
 1 AL    a       2011  56  
 2 AL    a       2012  46  
 3 AL    b       2011  50.5
 4 AL    b       2012  52  
 5 AL    c       2011  55.5
 6 AL    c       2012  29  
 7 CA    d       2011  52.5
 8 CA    d       2012  32  
 9 CA    e       2011  68.5
10 CA    e       2012  31.5
11 CA    f       2011  32  
12 CA    f       2012  42.5

The above code gives me a desired result, but when I apply this to a larger dataset (with more grouping variables) it gets really slow. Do you have any suggestion on how to speed this up?

If the original approach is incorrect, please point that out as well.


Solution

  • library(dplyr)
    
    df %>%
      group_by(state, year) %>%
      mutate(q = (sum(value) - value) / (n()-1))
    
    #> # A tibble: 12 x 5
    #> # Groups:   state, year [4]
    #>    state county  year value     q
    #>    <chr> <chr>  <int> <int> <dbl>
    #>  1 AL    a       2011    68  30.5
    #>  2 AL    a       2012    63  42  
    #>  3 AL    b       2011    53  38  
    #>  4 AL    b       2012    56  45.5
    #>  5 AL    c       2011     8  60.5
    #>  6 AL    c       2012    28  59.5
    #>  7 CA    d       2011     7  40  
    #>  8 CA    d       2012    69  41  
    #>  9 CA    e       2011    39  24  
    #> 10 CA    e       2012    79  36  
    #> 11 CA    f       2011    41  23  
    #> 12 CA    f       2012     3  74
    

    Data:

    #data_frame is deprecate!
    df <- tibble(
      state = rep(c("AL", "CA"), each = 6),
      county = rep(letters[1:6], each = 2),
      year = rep(c(2011:2012), 6),
      value = sample.int(100, 12)
    )