Search code examples
rdatetimetimemergefuzzyjoin

R fuzzy_left_join with time


I am trying to merge two tables with the conditions "ccode" = "Ticker", "Date"="Date" and "Time"= "Timestamp". However, if there is not an exact match of "Time" it should look at "Timeint" (up to -2 minutes). As this is something I can't do with left_join - I was thinking about fuzzy_left_join.

I am encountering two problems:

  1. My data is stored on bigquery and I can do a left_join which works. However, when I try a fuzzy_left_join, I get the following error message:

"Error: All columns in a tibble must be 1d or 2d objects: * Column col is NULL

  1. I tried to make a reproducable sample (which is a bit different as it is not stored as a list of 2) and it also does not work.

Here is a reproducable sample:

library(fuzzyjoin)
library(anytime)

calls.sample <- data.frame(ccode = c("MMM", "K", "A", "CAG", "PM"),
                           Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
                           Time = c("09:30:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
                           Timeint = c("9:28:00", "14:28:00", "10:58:00", "15:28:00", "12:58:00")
                           )


str(calls.sample$Time)
calls.sample$Time <- as_hms(as.character(calls.sample$Time))
calls.sample$Timeint <- as_hms(as.character(calls.sample$Timeint))


stocks.sample <- data.frame(Ticker = c("MMM", "K", "A", "CAG", "PM"),
                            Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
                            Timestamp = c("9:28:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
                            OpenPrice = c(5, 1,6,7,8))

stocks.sample$Timestamp <- as_hms(as.character(stocks.sample$Timestamp))

fuzzy_left_join(
  calls.sample, stocks.sample,
  by = c(
    "ccode" = "Ticker", 
    "Date" = "Date", 
    "Time" = "Timestamp",
    "Timeint" = "Timestamp"
  ),
  match_fun = list(`==`, `==`, `<=`, `>=`)
)

I am looking for:

ccode   Date     Time    OpenPrice
1   MMM 20111020 09:30:00 5
2     K 20111021 14:30:00 1
3     A 20120102 11:00:00 6
4   CAG 20110510 15:30:00 7
5    PM 20080710 13:00:00 8

I am grateful for any help on this one :)!


Solution

  • I ran your code and it ran without error. While the result was NAs, I fixed one thing with the last two list items

    match_fun = list(`==`, `==`, `>=`, `<=`)

    and got your desired result!

    library(fuzzyjoin, quietly = TRUE); library(anytime, quietly = TRUE); library(hms, quietly = TRUE)
    #> Warning: package 'fuzzyjoin' was built under R version 3.6.3
    #> Warning: package 'anytime' was built under R version 3.6.3
    calls.sample <- data.frame(ccode = c("MMM", "K", "A", "CAG", "PM"),
                               Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
                               Time = c("09:30:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
                               Timeint = c("9:28:00", "14:28:00", "10:58:00", "15:28:00", "12:58:00"))
    calls.sample$Time <- as_hms(as.character(calls.sample$Time))
    calls.sample$Timeint <- as_hms(as.character(calls.sample$Timeint))
    stocks.sample <- data.frame(Ticker = c("MMM", "K", "A", "CAG", "PM"),
                                Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
                                Timestamp = c("9:28:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
                                OpenPrice = c(5, 1,6,7,8))
    stocks.sample$Timestamp <- as_hms(as.character(stocks.sample$Timestamp))
    
    fuzzy_left_join(calls.sample, stocks.sample,
                    by = c("ccode" = "Ticker", 
                           "Date" = "Date", 
                           "Time" = "Timestamp",
                           "Timeint" = "Timestamp"),
                    match_fun = list(`==`, `==`, `>=`, `<=`))
    #>   ccode   Date.x     Time  Timeint Ticker   Date.y Timestamp OpenPrice
    #> 1   MMM 20111020 09:30:00 09:28:00    MMM 20111020  09:28:00         5
    #> 2     K 20111021 14:30:00 14:28:00      K 20111021  14:30:00         1
    #> 3     A 20120102 11:00:00 10:58:00      A 20120102  11:00:00         6
    #> 4   CAG 20110510 15:30:00 15:28:00    CAG 20110510  15:30:00         7
    #> 5    PM 20080710 13:00:00 12:58:00     PM 20080710  13:00:00         8
    

    Created on 2020-10-20 by the reprex package (v0.3.0)