Search code examples
rdplyr

How to calculate the difference between rows in a dataframe based on two conditions in other columns?


I am kind of a beginner with R and I try to calculate the difference between rows in a dataframe based on two conditions in other columns.

My current df looks kinda like this (with alot more data and variations):

date     id     ...  sum     grouped_sum
<int>   <chr>       <dbl>     <dbl>
202310    1          500       500
202311    1          300       600
202311    1          300       600
202311    2          200       200
202310    3           25        50
202310    3           25        50
202311    4           75       100
202310    4           75        75
202311    4           25       100

I have tried the following code with dplyr adn case_when, but received only NA instead of values in the new column. I also tried to find similar cases (also on this forum) and found some but couln't put the pieces I found together and seem to be missing some others, maybe you have got solution for my example code and what I want to achieve or something similar (preferably with dplyr).

     df %>% group_by(id, date) %>% 
        mutate(grouped_sum = sum(sum)) %>% 
        mutate(
        diff = case_when(
         id == lag(id) & date != lag(date) ~ grouped_sum - lag(grouped_sum),
         id != lag(id) & date == 202311 ~ grouped_sum,
         id != lag(id) & date == 202310 ~ -grouped_sum
        )
      )

I have searched for and tried out many things with mostly dplyr, e.g. with pivot_wider with names date and value from grouped_sum, but R creates a column in format list, and reduces the no. of observations each time I have tried this, also I dont know how to continue with that when building differences...

What I like to have, is that the code builds the difference based on having the same id, and different dates or only one date. To summarize, in diff I would like to show the difference between the grouped_sum between 202311 and 202310, if there is only a grouped_sum for 202311 I would like to have this as my diff, if there is only a grouped_sum for 202310 (assuming there is no more value in 202311), I would like to have (-1)*grouped_sum. Given my numbers this could be, in theory, calculated as follows: => so it calculates when id is same and dates different: 600 - 500 = 100, OR if we got only a value for 202311 and one id it calculates: 600 - 0 = 600, OR if we got only a value for 202310 and one id, it calculates: 0 - 50 = -50.

I would like to have something like this for an output (edited output, hope its more clear):


date     id     ...  sum     grouped_sum   diff
<int>   <chr>       <dbl>     <dbl>        <dbl>
202310    1          500       500         NA
202311    1          300       600         100
202311    1          300       600         100
202311    2          200       200         200
202310    3           25        50         -50
202310    3           25        50         -50
202311    4           75       100          25
202310    4           75        75          NA
202311    4           25       100          25

Solution

  • The logic is still not entirely clear, but this may help in moving forward.

    I would consider grouping by id, and then using case_when to evaluate the sums for a given id together.

    You can check within a group (or id) if there is any date for 202310 or 202311 and use the appropriate grouped_sum in those cases.

    It's unclear how to handle multiple 202310 (or if that's possible) - if it is possible, the code below assumes you want the first one to be used in the difference.

    library(tidyverse)
    
    df |>
      mutate(
        diff = case_when(
          !any(date == 202310) & any(date == 202311) ~ grouped_sum,
          !any(date == 202311) & any(date == 202310) ~ -grouped_sum,
          date == 202311 ~ grouped_sum - grouped_sum[date == 202310][1],
          .default = NA
        ), .by = id
      )
    

    Output

        date id sum grouped_sum diff
    1 202310  1 500         500   NA
    2 202311  1 300         600  100
    3 202311  1 300         600  100
    4 202311  2 200         200  200
    5 202310  3  25          50  -50
    6 202310  3  25          50  -50
    7 202311  4  75         100   25
    8 202310  4  75          75   NA
    9 202311  4  25         100   25