Search code examples
rlubridate

How to determine number of service dates and days between two snapshots in a dataframe?


I'm trying to take data from a nearly-weekly snapshot [some weeks are missing due to holidays, server issues, etc.] and determine the number of service dates between one snapshot and the next. I have a calendar of all dates of service to reference, but I have no idea where to begin with finding the number of dates in the calendar between one data snapshot and the next. I also need to determine the number of days the service actually ran [some only run Monday-Wednesday, for example] so I can calculate the total mileage each route drove during the period of the study.

I have a minimum dataset below. The actual dataset is about half a million rows for this fiscal year.

Toy Data / MWE

### Minimum Working Example
## Required Packages
library(dplyr)
library(lubridate)
library(tidyr)

## Setup the calendar of dates where we need to track mileage
calendar_mwe <- 
  tibble(date = seq.Date(ymd(20240520), ymd(20240801), by = "day")) %>% 
  mutate(day = weekdays(date),
         summer = case_when(date <= "2024-06-16" ~ 0, TRUE ~ 1)) %>% 
  filter(!day %in% c("Saturday", "Sunday"), 
         !date %in% c(
           ymd(20240527), # Memorial Day
           seq.Date(ymd(20240617), ymd(20240621), by = "day"), # Break week
           ymd(20240704) # Independence Day
         ))

## Setup a working example of raw data
# Set rand seed
set.seed(12345)

# Create raw route data
runroute_mwe <- 
  tibble(
    route_id = rep(100:119, 4),
    # Dates the snapshots were taken, including some missing weeks
    dates = rep(c(ymd(20240521), ymd(20240528), ymd(20240618), ymd(20240702)), each = 20),
    # Distance the route traveled
    route_dist = rep(runif(20, min = 0.5, max = 12), 4),
    # Number of days that route was in service [assume 1 = Mon, 2 = Mon-Tue, 3 = Mon-Wed etc.]
    route_days = rep(sample.int(5, size = 20, replace = TRUE, prob = c(.1, .1, .1, .1, .6)), 4)
  ) %>% 
  mutate( # Set a change to mileage based on dates to mimic actual data
    route_dist = case_when(
      dates == min(dates) ~ route_dist + runif(1, min = 1, max = 2),
      dates == max(dates) ~ route_dist - runif(1, min = 1, max = 2),
      TRUE ~ route_dist
    )
  )
> head(calendar_mwe)
# A tibble: 6 × 3
  date       day       summer
  <date>     <chr>      <dbl>
1 2024-05-20 Monday         0
2 2024-05-21 Tuesday        0
3 2024-05-22 Wednesday      0
4 2024-05-23 Thursday       0
5 2024-05-24 Friday         0
6 2024-05-28 Tuesday        0

> head(runroute_mwe)
# A tibble: 6 × 4
  route_id dates      route_dist route_days
     <int> <date>          <dbl>      <int>
1      100 2024-05-21      10.6           5
2      101 2024-05-21      12.4           5
3      102 2024-05-21      11.0           1
4      103 2024-05-21      12.5           4
5      104 2024-05-21       7.53          3
6      105 2024-05-21       4.20          5

Things I've Tried

From here, I can determine the number of days between each snapshot as follows:

## Number of days between snapshots
runroute_daysbtwn <- runroute_mwe %>% 
  nest(.by = dates) %>% 
  mutate(days_between = difftime(lead(dates, 1), dates, units = "day"))
> head(runroute_daysbtwn)
# A tibble: 4 × 3
  dates      data              days_between
  <date>     <list>            <drtn>      
1 2024-05-21 <tibble [20 × 3]>  7 days     
2 2024-05-28 <tibble [20 × 3]> 21 days     
3 2024-06-18 <tibble [20 × 3]> 14 days     
4 2024-07-02 <tibble [20 × 3]> NA days 

This is great-ish: It gives me the number of calendar days between each snapshot, minus the last snapshot [which is fine, since we don't have another snapshot after that one yet]. What it doesn't give me is the number of service days as defined by calendar_mwe.

So, this morning, I thought to try something else:

## Snapshot start and end dates
runroute_datesbtwn <- runroute_mwe %>% 
  nest(.by = dates) %>% 
  mutate(end_date = as.Date(lead(dates) - days(1)),
         end_date = as.Date(ifelse(is.na(end_date), dates, end_date)))

## Number of dates in calendar between the start and end dates
for (i in 1:nrow(runroute_datesbtwn)) {
  runroute_calendar <- calendar_mwe %>% 
    filter(between(date, runroute_datesbtwn$dates[i], runroute_datesbtwn$end_date[i])) %>% 
    nrow()
  
  runroute_datesbtwn$dates_between[i] <- runroute_calendar
}
> head(runroute_datesbtwn)
# A tibble: 4 × 4
  dates      data              end_date   dates_between
  <date>     <list>            <date>             <int>
1 2024-05-21 <tibble [20 × 3]> 2024-05-27             4
2 2024-05-28 <tibble [20 × 3]> 2024-06-17            14
3 2024-06-18 <tibble [20 × 3]> 2024-07-01             6
4 2024-07-02 <tibble [20 × 3]> 2024-07-02             1

This works well and gives me the number of service dates between each snapshot. The unnest()ed version of runroute_datesbtwn now looks like this:

> head(unnest(runroute_datesbtwn, everything()))
# A tibble: 6 × 6
  dates      route_id route_dist route_days end_date   dates_between
  <date>        <int>      <dbl>      <int> <date>             <int>
1 2024-05-21      100      10.6           5 2024-05-27             4
2 2024-05-21      101      12.4           5 2024-05-27             4
3 2024-05-21      102      11.0           1 2024-05-27             4
4 2024-05-21      103      12.5           4 2024-05-27             4
5 2024-05-21      104       7.53          3 2024-05-27             4
6 2024-05-21      105       4.20          5 2024-05-27             4

The Problem

Okay, so now I have the run/route information, a calendar, the start and end dates for each snapshot, and the number of service dates between each snapshot. If every run/route ran Monday-Friday, this would be all I'd need to calculate total mileage driven for each run and route. The problem comes in with one of the initial columns in runroute_mwe: route_days.

Here, we assume that if route_days = 1 then the run only operated on Mondays, if route_days = 2 then the run operated Monday-Tuesday, if route_days = 3 then the run operated Monday-Wednesday, and so forth barring dates in the calendar where no runs occurred.

For the first snapshot, it's easy to say that I just subtract 1 from route_days when I go to calculate a new column of total distance traveled [e.g. total_dist = (route_days - 1) * route_dist] since the dates between the two snapshots are Tuesday-Friday. This won't work for any of the other snapshots, however, so I feel stuck.

I feel like I need to find some way to get the number of each weekday between the snapshots, but I'm not sure how to go about that. I'd imagine a df with additional <int> columns for a count of each weekday [M-F] which would allow me to do some ridiculous case_when() statement to make my final calculations, though that all seems rather cumbersome.

EDIT: Re-wrote question, MWE, toy data, and things I've tried to hopefully be more clear with what my problem is and where I need assistance. This morning I was able to make some headway on the initial problem, but it's still not solved.


Solution

  • SOLVED

    Okay, so I managed to solve this question for myself. I did end up going the route I indicated in my last paragraph above. See code below for my solution. If anyone has a better solution for my problem, please let me know.

    ## Additional package requirement
    library(fuzzyjoin)
    
    ## Snapshot start and end dates
    runroute_datesbtwn_temp <- runroute_mwe %>% 
      nest(.by = dates) %>% 
      mutate(end_date = as.Date(lead(dates) - days(1)),
             end_date = as.Date(ifelse(is.na(end_date), dates, end_date))) %>% 
      select(start_date = dates, end_date)
    
    ## Fuzzy join calendar_mwe to runroute_datesbtwn
    runroute_calendar <- calendar_mwe %>% 
      # This gives me the calendar with identified start_date and end_date for each snapshot in runroute_mwe
      fuzzy_left_join(
        runroute_datesbtwn_temp,
        by = c(
          "date" = "start_date",
          "date" = "end_date"
        ),
        match_fun = list(`>=`, `<=`)
      ) %>% 
      # Filter out the null values
      filter(!is.na(start_date)) %>% 
      # Create a column for each day
      mutate(
        mon = ifelse(day == "Monday",    1, 0),
        tue = ifelse(day == "Tuesday",   1, 0),
        wed = ifelse(day == "Wednesday", 1, 0),
        thu = ifelse(day == "Thursday",  1, 0),
        fri = ifelse(day == "Friday",    1, 0)
      ) %>% 
      group_by(start_date, end_date, summer) %>% 
      # Determine the number of that weekday occurring in each snapshot
      summarise(
        mon = sum(mon),
        tue = sum(tue),
        wed = sum(wed),
        thu = sum(thu),
        fri = sum(fri),
        .groups = "drop"
      )
    
    > head(runroute_calendar)
    # A tibble: 4 × 8
      start_date end_date   summer   mon   tue   wed   thu   fri
      <date>     <date>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 2024-05-21 2024-05-27      0     0     1     1     1     1
    2 2024-05-28 2024-06-17      0     2     3     3     3     3
    3 2024-06-18 2024-07-01      1     2     1     1     1     1
    4 2024-07-02 2024-07-02      1     0     1     0     0     0
    

    From here, it was a simple case of a quick join, unnest, and mutate to get the data I actually needed:

    runroute_datesbtwn <- runroute_datesbtwn %>% 
      # Join in the new calendar with the count of weekdays per snapshot
      left_join(
        runroute_calendar,
        by = c(
          "dates" = "start_date",
          "end_date" = "end_date"
        )
      ) %>% 
      unnest(everything()) %>% 
      mutate(
        # Calculate the actual number of days the route ran
        # This was the initial problem I was trying to solve here
        route_days_tot = case_when(
          route_days == 5 ~ select(., mon:fri) %>% rowSums(na.rm = TRUE),
          route_days == 4 ~ select(., mon:thu) %>% rowSums(na.rm = TRUE),
          route_days == 3 ~ select(., mon:wed) %>% rowSums(na.rm = TRUE),
          route_days == 2 ~ select(., mon:tue) %>% rowSums(na.rm = TRUE),
          route_days == 1 ~ select(., mon)     %>% rowSums(na.rm = TRUE)
        ),
        # Calculate the total mileage
        total_mileage = route_days_tot * route_dist
      ) %>% 
      # Select columns for final output
      select(
        snapshot_date = dates,
        route_id,
        route_dist,
        route_days,
        total_mileage
      )
    
    > head(runroute_datesbtwn)
    # A tibble: 6 × 6
      snapshot_startdate snapshot_enddate route_id route_dist route_days total_mileage
      <date>             <date>              <int>      <dbl>      <int>         <dbl>
    1 2024-05-21         2024-05-27            100      10.6           5          42.3
    2 2024-05-21         2024-05-27            101      12.4           5          49.4
    3 2024-05-21         2024-05-27            102      11.0           1           0  
    4 2024-05-21         2024-05-27            103      12.5           4          37.4
    5 2024-05-21         2024-05-27            104       7.53          3          15.1
    6 2024-05-21         2024-05-27            105       4.20          5          16.8