I aggregate data per group and calculate means per group to ease visualization. Unfortunately, some of my groups are very large, some are rather empty. I like to have a rolling mean calculation to smooth the picture further. Here is similar data:
# load package
library(haven)
# read dta file from github
soep <- read_dta("https://github.com/MarcoKuehne/marcokuehne.github.io/blob/main/data/SOEP/soep_lebensz_en/soep_lebensz_en.dta?raw=true")
soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n()) %>%
ggplot(aes(x = education, y = satisf_org, col = as.factor(sex))) +
geom_point() +
labs(title = "Mean Satisfaction per Education Level by Gender",
x = "Education", y = "Mean Satisfaction", color = "Gender")
The mean satisfaction at education 8.5 for females looks like an outlier. At every year of education, I assume that people are not too different to be summarized, i.e. calculate the mean satisfaction of all people at education 7, 8.5 and 9 (grouped by sex) and store it as rolling mean at 8.5 (grouped by sex).
Starting from standard grouped means:
soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n())
# A tibble: 28 × 4
# Groups: education [14]
education sex satisf_org n
<dbl> <dbl+lbl> <dbl> <int>
1 7 0 [male] 6.16 73
2 7 1 [female] 6.59 113
3 8.5 0 [male] 7.16 37
4 8.5 1 [female] 8.56 18
5 9 0 [male] 6.88 430
6 9 1 [female] 7.00 633
7 10 0 [male] 7.19 144
8 10 1 [female] 7.36 221
9 10.5 0 [male] 6.96 1538
10 10.5 1 [female] 7.02 1493
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows
Here are the numbers that I expect
soep %>%
filter(sex == 1) %>% # only looks at females
filter(education %in% c(7, 8.5, 9)) %>% # take education level before and after
summarise(mean(satisf_org)) # calculate the "rolling mean" per group
# A tibble: 1 × 1
`mean(satisf_org)`
<dbl>
1 6.97
This is the kind of rolling mean per group that I expect per value, i.e. 6.97 instead of 8.56.
PS: In my real data, I investigate age in years and I usually have at least some people at all ages. So the rolling window can be -1 to +1 (numeric) instead of lead / lag neighbours.
You can group_by
sex and do a rolling average there:
library(dplyr)
library(slider)
soep %>%
group_by(education, sex) %>%
summarise(across(satisf_org, mean, na.rm = TRUE),
n = n()) %>%
group_by(sex) %>%
mutate(rolling_mean = slide_dbl(satisf_org, mean, .before = 1, .after = 1))
output
# A tibble: 28 × 5
# Groups: sex [2]
education sex satisf_org n rolling_mean
<dbl> <dbl+lbl> <dbl> <int> <dbl>
1 7 0 [male] 6.16 73 6.66
2 7 1 [female] 6.59 113 7.57
3 8.5 0 [male] 7.16 37 6.73
4 8.5 1 [female] 8.56 18 7.38
5 9 0 [male] 6.88 430 7.08
6 9 1 [female] 7.00 633 7.64
7 10 0 [male] 7.19 144 7.01
8 10 1 [female] 7.36 221 7.13
9 10.5 0 [male] 6.96 1538 7.14
10 10.5 1 [female] 7.02 1493 7.20
# … with 18 more rows
# ℹ Use `print(n = ...)` to see more rows