I want to perform a join in RStudio with 2 conditions. My goal is to replicate a SQL query like the following, but in R language:
SELECT
ticket_id,
sum(time_to_close)
FROM helpdesk_ticket ht
LEFT JOIN dim_date dd ON ht.create_date <= dd.formatted_date AND dd.formatted_date <= ht.closed_date
GROUP BY 1
Assuming that helpdesk_ticket and dim_date are two R dataframes, this is what I have tried, but it returns an error in the left_join:
result <- helpdesk_ticket %>%
left_join(dim_date, by = c("create_date" <= "formatted_date", "formatted_date" <= "closed_date")) %>%
group_by(ticket_id) %>%
summarise(time_to_close = sum(time_to_close))
Because the SQL query uses an interval join with <=
and >=
operators rather than equality =
, consider dplyr::join_by
to precompute the inequality join to be used in left_join
.
by <- join_by(
create_date <= formatted_date,
closed_date >= formatted_date,
)
result <- helpdesk_ticket %>%
left_join(dim_date, by) %>%
group_by(ticket_id) %>%
summarise(time_to_close = sum(time_to_close))
Alternatively, if both or either data sets are not too large for cartesian product, consider a cross_join
followed by a filter
:
result <- helpdesk_ticket %>%
cross_join(dim_date) %>%
filter(
create_date <= formatted_date,
closed_date >= formatted_date
)