Search code examples
rloopsdplyrdifftime

Is there a way to take the time difference in R (with inconsistent datestamp)?


I am wondering if anyone has insight on how to approach this problem. I have a dataframe with the columns: Startime, Endtime and ID (dataframe below) I have already grouped the IDs. I am trying to figure out the time difference in seconds of a specific ‘grouped by’ ID (specifically A)

Starttime                   Endtime                    ID

12/18/2019 4:06:59 PM       12/18/2019 4:07:05 PM       A
12/18/2019 4:07:26 PM       12/18/2019 4:07:28 PM       A
12/17/2019 6:48:06 PM       12/17/2019 6:48:07 PM       A
12/17/2019 6:25:16 PM       12/17/2019 6:25:22 PM       A

I went about this by using the following code within R and utilizing the dplyr package:

data <-rawdata %>% group_by(ID) %>% 
       summarize(diff = difftime(last(as.POSIXct(Endtime, format ="%m/%d/%Y %I:%M:%S %p")), 
          first(as.POSIXct(Starttime, format = "%m/%d/%Y %I:%M:%S %p" )), units = "secs"))

However, the code above only calculates the difference between start and end time but not the differences in between. For instance, the code above calculates:

12/18/2019 4:06:59 PM and 12/17/2019 6:25:22 PM

I see this as inaccurate, because it doesn’t account for when 12/17 changes to 12/18, its simply taking the first and last value within the grouped ID.

Is there a way to subgroup data within already grouped data? (further group by time: month day and year and then take its difference in seconds?

Is there a way to do this? Further subgroup the data so that when the day changes, it will know to calculate this difference as well, versus only calculating the first and last values.

 Starttime                Endtime                          ID
 12/18/2019 4:06:59PM     12/18/2019 4:07:05 PM             A
 12/18/2019 4:07:26PM     12/18/2019 4:07:28 PM             A
 12/17/2019 6:48:06PM     12/17/2019 6:48:07PM              A
 12/17/2019 6:25:16PM     12/17/2019 6:25:22PM              A

Any suggestions will help! I feel like I am stuck on this. I will continue to research. Thank you!

******Edited for clarification******

I am trying to calculate my email durations. In this example, I have already grouped by IDs.

I wish to calculate my email read time duration for the ID 'A'. The code I am currently using just calculates its start and end time in seconds.

data <-rawdata %>% group_by(ID) %>% summarize(diff = difftime(last(as.POSIXct(Endtime, format = "%m/%d/%Y %I:%M:%S %p")),first(as.POSIXct(Starttime, format = "%m/%d/%Y %I:%M:%S %p" )), units = "secs"))

However, I think this is not accurate. Overall, I am wanting the time difference by each row for a more accurate reading for me to calculate my total email read time. The output would be:

        Starttime          Endtime                                         

        12/18/2019 4:06:59PM    12/18/2019 4:07:05 PM      A        6 secs
        12/18/2019 4:07:26PM    12/18/2019 4:07:28 PM      A        1 secs
        12/17/2019 6:48:06PM    12/17/2019 6:48:07PM       A        1 sec
        12/17/2019 6:25:16PM    12/17/2019 6:25:22PM       A        6 secs

Solution

  • If you want a difference for each row, then you don't want to summarize for each ID but to mutate without grouping and remove the first and last function calls.

    Therefore, your code should look like this:

    data <- rawdata %>% 
      mutate(diff = difftime(as.POSIXct(Endtime, format = "%m/%d/%Y %I:%M:%S %p"),as.POSIXct(Starttime, format = "%m/%d/%Y %I:%M:%S %p" ), units = "secs"))
    

    But, as you are using dates, you may want to use the lubridate package, which is more flexible than base R:

    library(lubridate)
    data <- rawdata %>% 
      mutate_at(vars(Endtime, Starttime), as.POSIXct, format = "%m/%d/%Y %I:%M:%S %p") %>%
      mutate(diff = interval(Endtime, Starttime)/seconds(1))
    

    Then, if you want the total, you can group by ID and summarize the sum:

    data %>% 
      group_by(ID) %>% 
      summarize(total=sum(diff))
    

    For your next question, try to use the dput function on your data to share a reproducible example, as I stated in my comment. This is much easier for a helper if he/she can just copy-paste your code and run it.