Search code examples
rdata.tablefuzzy

Matching dates but providing a few days leeway using data.table in R


Sometimes when joining or matching with dates it does not matter if two dates are only a day or two apart. The most recent example occurred to me while using Quicken financial management software. Something got messed up in my Quicken reconciliation. I now have a table of transactions from Quicken and another table of transactions downloaded from my bank. I made one data.table from the two separate tables using rbindlist. I want to find which transactions have no counterpart in the other table. That would be easy if the dates matched as exactly as the number of dollars and cents. But sometimes the dates are off by 2 OR 3 OR 4 OR 5 days but not by 2 months. Using data.table or base R (or if I really must, tidyverse), how do I allow a match if the dates are only a few days off?

example

rep.exmpl <- data.table(date = as.IDate(c("2022-10-23", "2022-10-23", "2022-10-20",  "2022-10-20", "2022-10-15", "2022-10-12", "2022-10-05", "2022-10-01", "2022-09-30", "2022-08-05")), amount = c(10.00, 10.00, 1.53, 1.53, 78.00, 78.00, 89.56, 65.01, 65.01, 65.01), source = c("bank", "quicken", "bank", "quicken","bank", "quicken", "bank", "bank", "quicken", "quicken"))
rep.exmpl  

rep.exmpl stands for "reproducible example".

          date amount  source
 1: 2022-10-23  10.00    bank
 2: 2022-10-23  10.00 quicken
 3: 2022-10-20   1.53    bank
 4: 2022-10-20   1.53 quicken
 5: 2022-10-15  78.00    bank
 6: 2022-10-12  78.00 quicken
 7: 2022-10-05  89.56    bank
 8: 2022-10-01  65.01    bank
 9: 2022-09-30  65.01 quicken
10: 2022-08-05  65.01 quicken

I used fsetdiff to see what was in one source but not the other.

fsetdiff(x = rep.exmpl[source=="bank", .(date, amount)], y = rep.exmpl[source=="quicken", .(date, amount)])# in bank only
fsetdiff(x = rep.exmpl[source=="quicken", .(date, amount)], y = rep.exmpl[source=="bank", .(date, amount)])# in quicken only

Row 5 and 6 should be seen as the same since they are only 3 days apart and hence should not be shown to me as a transaction that appeared in only one source. Row 7 should be shown to me since it only appears in bank. Row 10 should be shown to me since it is clearly a different instance to row 8 and 9 without anything matching it in bank. But that is not what happens. Some of it works but much does not.

In bank only

         date amount
1: 2022-10-15  78.00
2: 2022-10-05  89.56
3: 2022-10-01  65.01  

Row 1 and 3 should not be shown.

In Quicken only

         date amount
1: 2022-10-12  78.00
2: 2022-09-30  65.01
3: 2022-08-05  65.01

Row 1 and 2 should not be shown. Only row 3 should be shown.

How would you solve this problem elegantly? I thought of trying to create a couple of columns +/- 7 days in the Quicken table and then using rolling joins (roll = TRUE) or perhaps fast overlap joins (foverlaps). One of my problems is that the fuzzy date match could be backwards or forwards.


Solution

  • You could use roll="nearest":

    bank <- rep.exmpl[source=="bank", .(date, amount)]
    quicken <- rep.exmpl[source=="quicken", .(date, amount)]
    
    maxdays <- 7
    
    bank[quicken,.(bank.date=x.date,quicken.date=date,bank.amount=x.amount,quicken.amount=amount)
         ,on=.(amount,date)
         ,roll='nearest'][abs(difftime(bank.date,quicken.date,unit="days"))>maxdays|is.na(bank.date)]
    
    #    bank.date quicken.date bank.amount quicken.amount
    #       <IDat>       <IDat>       <num>          <num>
    #1: 2022-10-01   2022-08-05       65.01          65.01
    
    quicken[bank,.(bank.date=date,quicken.date=x.date,bank.amount = amount,quicken.amount = x.amount)
            ,on=.(amount,date)
            ,roll='nearest'][abs(difftime(bank.date,quicken.date,unit="days"))>maxdays]
    
    #    bank.date quicken.date bank.amount quicken.amount
    #       <IDat>       <IDat>       <num>          <num>
    #1: 2022-10-05         <NA>       89.56             NA