Search code examples
rdplyrconditional-statementsleft-join

R dplyr conditional join with "join_by" not working


I have to datasets.

One includes service dates with durations for each person (id). enter image description here

Second includes these persons care episodes with their start and end date.

enter image description here

I need to make a conditional join, meaning that the the second DF should include them sum of all service durations that happened during a particular episode (between dates "start" and "end").

I tried the following:

library(tidyverse)

services =  read.csv("https://www.dropbox.com/scl/fi/04br87qn2wmpmoyaz95ri/services.csv?rlkey=vm7ib1xqne1ceawjnsmzeolf5&dl=1")
periods =  read.csv("https://www.dropbox.com/scl/fi/fow33g4ddroinspyi4xy8/periods.csv?rlkey=v9cr0thuh6pm0rc7qrhommkpe&dl=1")

left_join(periods, services, by = join_by(id, between(service_date, start, end)))

But that gave me the following error

enter image description here


Solution

  • Error

    Per the dplyr documentation, between(x, y_lower, y_upper) in join_by() checks if x falls between y_lower and y_upper. The error occurs because service_date is from the y-dataset (services). As @Friede said, you need to swap the x and y datasets.

    Solution

    We then join the results to the second data frame periods

    To get the sum, we can group_by id and then summarise the durations. Then, we could use na.omit() %>% distinct(id, .keep_all = T) to omit all ids which don't have service durations and only keep unique ids.

    library(tidyverse)
    library(dplyr)
    
    services =  read.csv("https://www.dropbox.com/scl/fi/04br87qn2wmpmoyaz95ri/services.csv?rlkey=vm7ib1xqne1ceawjnsmzeolf5&dl=1")
    periods =  read.csv("https://www.dropbox.com/scl/fi/fow33g4ddroinspyi4xy8/periods.csv?rlkey=v9cr0thuh6pm0rc7qrhommkpe&dl=1")
    
    
    periods_w_total_service_times <- left_join(
      periods,
      left_join(services, periods, by = join_by(id, between(service_date, start, end))) %>%
        group_by(id) %>%
        summarise(total_service_duration = sum(duration)),
      by = "id"
    ) %>%
      na.omit() %>% distinct(id, .keep_all = T)
    

    Result

       X       id      start        end total_service_duration
    1 33 ID100107 2016-06-14 2016-12-08               1.500000
    2 34  ID10014 2011-03-09 2011-08-28               2.000000
    3 46 ID100184 2016-10-08 2016-11-22               6.166667
    4 52 ID100218 2013-02-11 2013-08-06               1.666667
    5 57 ID100233 2011-08-15 2012-01-23               6.000000
    6 69 ID100284 2013-07-12 2014-01-01               1.500000
    

    ...

    .

    .