I am kind of a beginner with R and I try to calculate the difference between rows in a dataframe based on two conditions in other columns.
My current df looks kinda like this (with alot more data and variations):
date id ... sum grouped_sum
<int> <chr> <dbl> <dbl>
202310 1 500 500
202311 1 300 600
202311 1 300 600
202311 2 200 200
202310 3 25 50
202310 3 25 50
202311 4 75 100
202310 4 75 75
202311 4 25 100
I have tried the following code with dplyr adn case_when, but received only NA instead of values in the new column. I also tried to find similar cases (also on this forum) and found some but couln't put the pieces I found together and seem to be missing some others, maybe you have got solution for my example code and what I want to achieve or something similar (preferably with dplyr).
df %>% group_by(id, date) %>%
mutate(grouped_sum = sum(sum)) %>%
mutate(
diff = case_when(
id == lag(id) & date != lag(date) ~ grouped_sum - lag(grouped_sum),
id != lag(id) & date == 202311 ~ grouped_sum,
id != lag(id) & date == 202310 ~ -grouped_sum
)
)
I have searched for and tried out many things with mostly dplyr, e.g. with pivot_wider with names date and value from grouped_sum, but R creates a column in format list, and reduces the no. of observations each time I have tried this, also I dont know how to continue with that when building differences...
What I like to have, is that the code builds the difference based on having the same id, and different dates or only one date. To summarize, in diff I would like to show the difference between the grouped_sum between 202311 and 202310, if there is only a grouped_sum for 202311 I would like to have this as my diff, if there is only a grouped_sum for 202310 (assuming there is no more value in 202311), I would like to have (-1)*grouped_sum. Given my numbers this could be, in theory, calculated as follows: => so it calculates when id is same and dates different: 600 - 500 = 100, OR if we got only a value for 202311 and one id it calculates: 600 - 0 = 600, OR if we got only a value for 202310 and one id, it calculates: 0 - 50 = -50.
I would like to have something like this for an output (edited output, hope its more clear):
date id ... sum grouped_sum diff
<int> <chr> <dbl> <dbl> <dbl>
202310 1 500 500 NA
202311 1 300 600 100
202311 1 300 600 100
202311 2 200 200 200
202310 3 25 50 -50
202310 3 25 50 -50
202311 4 75 100 25
202310 4 75 75 NA
202311 4 25 100 25
The logic is still not entirely clear, but this may help in moving forward.
I would consider grouping by id
, and then using case_when
to evaluate the sums for a given id
together.
You can check within a group (or id
) if there is any
date for 202310 or 202311 and use the appropriate grouped_sum
in those cases.
It's unclear how to handle multiple 202310 (or if that's possible) - if it is possible, the code below assumes you want the first one to be used in the difference.
library(tidyverse)
df |>
mutate(
diff = case_when(
!any(date == 202310) & any(date == 202311) ~ grouped_sum,
!any(date == 202311) & any(date == 202310) ~ -grouped_sum,
date == 202311 ~ grouped_sum - grouped_sum[date == 202310][1],
.default = NA
), .by = id
)
Output
date id sum grouped_sum diff
1 202310 1 500 500 NA
2 202311 1 300 600 100
3 202311 1 300 600 100
4 202311 2 200 200 200
5 202310 3 25 50 -50
6 202310 3 25 50 -50
7 202311 4 75 100 25
8 202310 4 75 75 NA
9 202311 4 25 100 25