Search code examples
rtidyversedata-manipulationrolling-computation

Rolling mean per group in tidyverse


I aggregate data per group and calculate means per group to ease visualization. Unfortunately, some of my groups are very large, some are rather empty. I like to have a rolling mean calculation to smooth the picture further. Here is similar data:

# load package
library(haven)
# read dta file from github
soep <- read_dta("https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/soep_lebensz_en/soep_lebensz_en.dta?raw=true")

soep %>% 
  group_by(education, sex) %>% 
  summarise(across(satisf_org, mean, na.rm = TRUE),
            n = n()) %>% 
  ggplot(aes(x = education, y = satisf_org, col = as.factor(sex))) +
  geom_point() +
  labs(title = "Mean Satisfaction per Education Level by Gender",
       x = "Education", y = "Mean Satisfaction", color = "Gender")

enter image description here

The mean satisfaction at education 8.5 for females looks like an outlier. At every year of education, I assume that people are not too different to be summarized, i.e. calculate the mean satisfaction of all people at education 7, 8.5 and 9 (grouped by sex) and store it as rolling mean at 8.5 (grouped by sex).

Starting from standard grouped means:

soep %>% 
  group_by(education, sex) %>% 
  summarise(across(satisf_org, mean, na.rm = TRUE),
            n = n())

# A tibble: 28 × 4
# Groups:   education [14]
   education sex        satisf_org     n
       <dbl> <dbl+lbl>       <dbl> <int>
 1       7   0 [male]         6.16    73
 2       7   1 [female]       6.59   113
 3       8.5 0 [male]         7.16    37
 4       8.5 1 [female]       8.56    18
 5       9   0 [male]         6.88   430
 6       9   1 [female]       7.00   633
 7      10   0 [male]         7.19   144
 8      10   1 [female]       7.36   221
 9      10.5 0 [male]         6.96  1538
10      10.5 1 [female]       7.02  1493
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows

Here are the numbers that I expect

soep %>% 
  filter(sex == 1) %>%  # only looks at females
  filter(education %in% c(7, 8.5, 9)) %>%  # take education level before and after
  summarise(mean(satisf_org)) # calculate the "rolling mean" per group 

# A tibble: 1 × 1
  `mean(satisf_org)`
               <dbl>
1               6.97

This is the kind of rolling mean per group that I expect per value, i.e. 6.97 instead of 8.56.

PS: In my real data, I investigate age in years and I usually have at least some people at all ages. So the rolling window can be -1 to +1 (numeric) instead of lead / lag neighbours.


Solution

  • You can group_by sex and do a rolling average there:

    library(dplyr)
    library(slider)
    soep %>% 
      group_by(education, sex) %>% 
      summarise(across(satisf_org, mean, na.rm = TRUE),
                n = n()) %>% 
      group_by(sex) %>%
      mutate(rolling_mean = slide_dbl(satisf_org, mean, .before = 1, .after = 1))
    

    output

    # A tibble: 28 × 5
    # Groups:   sex [2]
       education sex        satisf_org     n rolling_mean
           <dbl> <dbl+lbl>       <dbl> <int>        <dbl>
     1       7   0 [male]         6.16    73         6.66
     2       7   1 [female]       6.59   113         7.57
     3       8.5 0 [male]         7.16    37         6.73
     4       8.5 1 [female]       8.56    18         7.38
     5       9   0 [male]         6.88   430         7.08
     6       9   1 [female]       7.00   633         7.64
     7      10   0 [male]         7.19   144         7.01
     8      10   1 [female]       7.36   221         7.13
     9      10.5 0 [male]         6.96  1538         7.14
    10      10.5 1 [female]       7.02  1493         7.20
    # … with 18 more rows
    # ℹ Use `print(n = ...)` to see more rows