Calculating Time Differences Between Rows Where the Category Passes Over Midnight

I have a dataset that is comparing pain scores and am trying to determine the duration of time that the person is in a particular pain range. I initially created a variable with the difference between timestamps of each pain assessment. The idea is to treat each pain score as constant until the next time pain is assessed. This works well, until a pain score is performed after midnight. For example, if a pain score is performed at 2000, then at 0200, time difference for the previous pain score will be 6 hours instead of 4 hours. Hence, for some days, I am getting more than 24 hours of pain score results.

I would like to have the time for each day be 24 hours, and have the previous day pain score carried over into the new day. Does anybody have any ideas?

I have included a dummy dataset that is similar to what I am using.

library(tidyverse)



  df <-
  tibble(
    patient_id = c(1, 1, 1, 1, 1,1),
    time = lubridate::as_datetime(c("2022-01-04 02:37:00", "2022-01-04 07:00:00", "2022-01-04 15:00:00", "2022-01-04 20:00:00", "2022-01-05 02:00:00", "2022-01-05 08:00:00")),
    day = c(1, 1, 1, 1, 2,2),
    pain_score = c("None", "Mild", "Mild", "Moderate", "None", "Mild")
  )

When I use the following code, it creates a new variable where the last time difference is 6 hours (not 4):

df <- df %>% 
  arrange(patient_id, time) %>% 
  group_by(patient_id) %>% 
  mutate(time_diff_hours = as.numeric(lead(time) - time, units = 'hours'))

This gives the following output:

I would like the output to be like this:

Any help would be appreciated.

Ben

EDIT: I have expanded the original code to include an extra row to see what I am after, with an additional desired output table. As you can see, where the time goes over midnight, it is broken up into 2 time points with the corresponding pain score with it.

Solution

You can use the ceiling_date and floor_date functions from lubridate to make a sequence that registers every date at 00:00.

Then, by using fill from tidyr you can carry forward the previous values to the date at 00:00. An alternative would be to use na.locf() from the zoo package.

Finally, you can calculate the time difference between the next time and the previous one the way you were wishing. The last row remains as an NA because the lead function will not find another time forward.

I think this satisfies the requirements you had and seems to work.

  df %>% complete(time = seq.POSIXt(min(ceiling_date(time , 'day')), 
                                    max(floor_date(time , 'day')), by = 'day')) %>% 
         arrange(time) %>%
         fill(c(patient_id, pain_score)) %>% fill(day, .direction = "up") %>%
         mutate(time_diff_hours = as.numeric(lead(time ) - time, units = 'hours'))