I have paneldata, i.e. repeated observations per household. A unit (household) is measured through time and exhibits a characteric (e.g. variable
). I can calculate a group sum per year with group_by(id, year)
. How can I have a cummulative sum over time as in the goal
column? I need the result to preserve 10 rows in this example, i.e. not collapse data to the years. How can I pick just one group-sum per year per unit to sum up?
set.seed(1234)
data <- data.frame(id = rep(100, 10),
year = c(rep(2022, 5), rep(2023, 5)),
variable = rbinom(10, 1, 0.5))
library(tidyverse)
data <- data %>%
group_by(id, year) %>%
mutate(group_sum_per_year = sum(variable))
data$goal <- c(4,4,4,4,4,7,7,7,7,7)
data
# A tibble: 10 × 5
# Groups: id, year [2]
id year variable group_sum_per_year goal
<dbl> <dbl> <int> <int> <dbl>
1 100 2022 0 4 4
2 100 2022 1 4 4
3 100 2022 1 4 4
4 100 2022 1 4 4
5 100 2022 1 4 4
6 100 2023 1 3 7
7 100 2023 0 3 7
8 100 2023 0 3 7
9 100 2023 1 3 7
10 100 2023 1 3 7
You could first create a tempory column hlp
which is equal to group_sum_per_year
for only the first entry per group.
Then you could group by id
and use cumsum on hlp
:
data %>%
group_by(id, year) %>%
mutate(group_sum_per_year = sum(variable)) %>%
mutate(hlp = if_else(1:n() == 1, group_sum_per_year, 0)) %>%
group_by(id) %>%
mutate(goal = cumsum(hlp))
# A tibble: 10 × 6
# Groups: id [1]
id year variable group_sum_per_year hlp goal
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 100 2022 0 4 4 4
2 100 2022 1 4 0 4
3 100 2022 1 4 0 4
4 100 2022 1 4 0 4
5 100 2022 1 4 0 4
6 100 2023 1 3 3 7
7 100 2023 0 3 0 7
8 100 2023 0 3 0 7
9 100 2023 1 3 0 7
10 100 2023 1 3 0 7