Search code examples
sqlrdplyrtransactions

Aggregate debit and credit amounts with different dates on a daily basis and group by accounts


I have a table with debit amounts, credit amounts, debit date, credit date and account ids. Where ever there is a debit amount entry, the credit amount would be null and vice versa. I need to have the sum of debits and credits for each day.

id Debit_date Debit_amount Credit_date Credit_amount
1 2018-10-21 20000 NA NA
1 NA NA 2018-10-21 50000
2 2019-1-2 10000 NA NA
2 2019-1-3 20000 NA NA
4 NA NA 2019-1-4 30000
1 2019-1-5 1000 NA NA

I need to get the output below:

id Trans_date Total_debit Total_credit
1 2018-10-21 20000 50000
1 2019-1-5 1000 NA
2 2019-1-2 30000 NA
4 2019-1-4 NA 30000

I tried the following code:

df_db = df %>%  group_by(id,debit_date) %>% summarise(total_debit=sum(debit_amount))
df_cr = df %>%  group_by(id,credit_date) %>% summarise(total_credit=sum(credit_amount))

I then proceed to join these two dataframes but it just blows it up since i have millions of transactions. Can anyone please guide me on how to get the data in the output above. Would really appreciate.


Solution

  • You may use coalesce to group by date:

    df %>% 
      group_by(id, Trans_date = coalesce(Debit_date, Credit_date)) %>% 
      summarise(Total_debit = sum(Debit_amount, na.rm = T),
                Total_credit = sum(Credit_amount, na.rm = T))
    
         id Trans_date Total_debit Total_credit
    1     1 2018-10-21       20000        50000
    2     1 2019-1-5          1000            0
    3     2 2019-1-2         30000            0
    4     4 2019-1-4             0        30000
    

    data (I tweaked Date of the fifth row to match expected output)

    structure(list(id = c(1L, 1L, 2L, 2L, 4L, 1L), Debit_date = c("2018-10-21", 
    NA, "2019-1-2", "2019-1-2", NA, "2019-1-5"), Debit_amount = c(20000L, 
    NA, 10000L, 20000L, NA, 1000L), Credit_date = c(NA, "2018-10-21", 
    NA, NA, "2019-1-4", NA), Credit_amount = c(NA, 50000L, NA, NA, 
    30000L, NA)), class = "data.frame", row.names = c(NA, -6L))