I have a table with debit amounts, credit amounts, debit date, credit date and account ids. Where ever there is a debit amount entry, the credit amount would be null and vice versa. I need to have the sum of debits and credits for each day.
id | Debit_date | Debit_amount | Credit_date | Credit_amount |
---|---|---|---|---|
1 | 2018-10-21 | 20000 | NA | NA |
1 | NA | NA | 2018-10-21 | 50000 |
2 | 2019-1-2 | 10000 | NA | NA |
2 | 2019-1-3 | 20000 | NA | NA |
4 | NA | NA | 2019-1-4 | 30000 |
1 | 2019-1-5 | 1000 | NA | NA |
I need to get the output below:
id | Trans_date | Total_debit | Total_credit |
---|---|---|---|
1 | 2018-10-21 | 20000 | 50000 |
1 | 2019-1-5 | 1000 | NA |
2 | 2019-1-2 | 30000 | NA |
4 | 2019-1-4 | NA | 30000 |
I tried the following code:
df_db = df %>% group_by(id,debit_date) %>% summarise(total_debit=sum(debit_amount))
df_cr = df %>% group_by(id,credit_date) %>% summarise(total_credit=sum(credit_amount))
I then proceed to join these two dataframes but it just blows it up since i have millions of transactions. Can anyone please guide me on how to get the data in the output above. Would really appreciate.
You may use coalesce
to group by date:
df %>%
group_by(id, Trans_date = coalesce(Debit_date, Credit_date)) %>%
summarise(Total_debit = sum(Debit_amount, na.rm = T),
Total_credit = sum(Credit_amount, na.rm = T))
id Trans_date Total_debit Total_credit
1 1 2018-10-21 20000 50000
2 1 2019-1-5 1000 0
3 2 2019-1-2 30000 0
4 4 2019-1-4 0 30000
data (I tweaked Date
of the fifth row to match expected output)
structure(list(id = c(1L, 1L, 2L, 2L, 4L, 1L), Debit_date = c("2018-10-21",
NA, "2019-1-2", "2019-1-2", NA, "2019-1-5"), Debit_amount = c(20000L,
NA, 10000L, 20000L, NA, 1000L), Credit_date = c(NA, "2018-10-21",
NA, NA, "2019-1-4", NA), Credit_amount = c(NA, 50000L, NA, NA,
30000L, NA)), class = "data.frame", row.names = c(NA, -6L))