Sometimes when joining or matching with dates it does not matter if two dates are only a day or two apart. The most recent example occurred to me while using Quicken financial management software. Something got messed up in my Quicken reconciliation. I now have a table of transactions from Quicken and another table of transactions downloaded from my bank. I made one data.table from the two separate tables using rbindlist
. I want to find which transactions have no counterpart in the other table. That would be easy if the dates matched as exactly as the number of dollars and cents. But sometimes the dates are off by 2 OR 3 OR 4 OR 5 days but not by 2 months. Using data.table or base R (or if I really must, tidyverse), how do I allow a match if the dates are only a few days off?
rep.exmpl <- data.table(date = as.IDate(c("2022-10-23", "2022-10-23", "2022-10-20", "2022-10-20", "2022-10-15", "2022-10-12", "2022-10-05", "2022-10-01", "2022-09-30", "2022-08-05")), amount = c(10.00, 10.00, 1.53, 1.53, 78.00, 78.00, 89.56, 65.01, 65.01, 65.01), source = c("bank", "quicken", "bank", "quicken","bank", "quicken", "bank", "bank", "quicken", "quicken"))
rep.exmpl
rep.exmpl stands for "reproducible example".
date amount source
1: 2022-10-23 10.00 bank
2: 2022-10-23 10.00 quicken
3: 2022-10-20 1.53 bank
4: 2022-10-20 1.53 quicken
5: 2022-10-15 78.00 bank
6: 2022-10-12 78.00 quicken
7: 2022-10-05 89.56 bank
8: 2022-10-01 65.01 bank
9: 2022-09-30 65.01 quicken
10: 2022-08-05 65.01 quicken
I used fsetdiff
to see what was in one source but not the other.
fsetdiff(x = rep.exmpl[source=="bank", .(date, amount)], y = rep.exmpl[source=="quicken", .(date, amount)])# in bank only
fsetdiff(x = rep.exmpl[source=="quicken", .(date, amount)], y = rep.exmpl[source=="bank", .(date, amount)])# in quicken only
Row 5 and 6 should be seen as the same since they are only 3 days apart and hence should not be shown to me as a transaction that appeared in only one source. Row 7 should be shown to me since it only appears in bank. Row 10 should be shown to me since it is clearly a different instance to row 8 and 9 without anything matching it in bank. But that is not what happens. Some of it works but much does not.
date amount
1: 2022-10-15 78.00
2: 2022-10-05 89.56
3: 2022-10-01 65.01
Row 1 and 3 should not be shown.
date amount
1: 2022-10-12 78.00
2: 2022-09-30 65.01
3: 2022-08-05 65.01
Row 1 and 2 should not be shown. Only row 3 should be shown.
How would you solve this problem elegantly? I thought of trying to create a couple of columns +/- 7 days in the Quicken table and then using rolling joins (roll = TRUE
) or perhaps fast overlap joins (foverlaps
). One of my problems is that the fuzzy date match could be backwards or forwards.
You could use roll="nearest"
:
bank <- rep.exmpl[source=="bank", .(date, amount)]
quicken <- rep.exmpl[source=="quicken", .(date, amount)]
maxdays <- 7
bank[quicken,.(bank.date=x.date,quicken.date=date,bank.amount=x.amount,quicken.amount=amount)
,on=.(amount,date)
,roll='nearest'][abs(difftime(bank.date,quicken.date,unit="days"))>maxdays|is.na(bank.date)]
# bank.date quicken.date bank.amount quicken.amount
# <IDat> <IDat> <num> <num>
#1: 2022-10-01 2022-08-05 65.01 65.01
quicken[bank,.(bank.date=date,quicken.date=x.date,bank.amount = amount,quicken.amount = x.amount)
,on=.(amount,date)
,roll='nearest'][abs(difftime(bank.date,quicken.date,unit="days"))>maxdays]
# bank.date quicken.date bank.amount quicken.amount
# <IDat> <IDat> <num> <num>
#1: 2022-10-05 <NA> 89.56 NA