Search code examples
rdplyrcutcumsumdifftime

Issue with dplyr and cut for time intervals


I am trying to split by data into 5 sec time intervals and group them using dplyr.
Below is my original data - I have date and time in separate columns which I have later combined using Posixct

structure(list(Date = c("10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013"), Time = c("20:06:57", "20:07:13", "20:07:25", "20:07:30", "20:08:16", "20:08:17", "20:08:26", "20:09:05", "20:09:06", "20:09:07", "20:09:37", "20:09:38", "20:09:55", "20:12:34", "20:14:15"), ID = c("M1", "M1", "M1", "M3", "M1", "M1", "M8", "M9", "M9", "M9", "M1", "M1", "M1", "M5", "M1")), .Names = c("Date", "Time", "ID"), class = "data.frame", row.names = c(NA, -15L))

Attaching my code below

data$datetime <- as.POSIXct(paste(data$Date, data$Time), format="%m/%d/%Y %H:%M:%S") 
data_order <- data %>%  arrange(datetime,ID)      
data_order$group <-  data_order  %>% group_by(by5sec=cut(datetime, breaks= "5 secs",right =T),ID) %>% group_indices() 

While some observations are grouped right , some are wrong. I have tried 2 versions - removing the "right=T" and keeping it and I got different groups , but had errors in both versions. I have also tried using as.numeric,as.posixct et all before the cut in vain

Attaching the output for both versions.Red ones are wrongly coded as 2 different groups.

****Version 1 "right = T" for cut****

enter image description here

****Version 2 "right = F" for cut****

enter image description here

Could someone please help in solving this,I have spent quite some time and its been a goose chase given my knowledge of R . All I want is 5 sec breaks for the same ID(group should change for a new ID).

Desired output

enter image description here


Solution

  • I'm not entirely clear on the output images you show. Based on your problem description, how about something like this?

    library(tidyverse);
    df %>%
        unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
        mutate(
            datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
            datetime.by5sec = as.numeric(cut(datetime, "sec")) %/% 5 + 1);
    #        datetime       Date     Time ID datetime.by5sec
    #1  2013-10-30 20:06:57 10/30/2013 20:06:57 M1               1
    #2  2013-10-30 20:07:13 10/30/2013 20:07:13 M1               4
    #3  2013-10-30 20:07:25 10/30/2013 20:07:25 M1               6
    #4  2013-10-30 20:07:30 10/30/2013 20:07:30 M3               7
    #5  2013-10-30 20:08:16 10/30/2013 20:08:16 M1              17
    #6  2013-10-30 20:08:17 10/30/2013 20:08:17 M1              17
    #7  2013-10-30 20:08:26 10/30/2013 20:08:26 M8              19
    #8  2013-10-30 20:09:05 10/30/2013 20:09:05 M9              26
    #9  2013-10-30 20:09:06 10/30/2013 20:09:06 M9              27
    #10 2013-10-30 20:09:07 10/30/2013 20:09:07 M9              27
    #11 2013-10-30 20:09:37 10/30/2013 20:09:37 M1              33
    #12 2013-10-30 20:09:38 10/30/2013 20:09:38 M1              33
    #13 2013-10-30 20:09:55 10/30/2013 20:09:55 M1              36
    #14 2013-10-30 20:12:34 10/30/2013 20:12:34 M5              68
    #15 2013-10-30 20:14:15 10/30/2013 20:14:15 M1              88
    

    Explanation: datetime.by5sec gives the 5 sec bin index into which datetime falls. So the first entry sits in bin 1. The second entry is within the 4th 5 sec bin, i.e. within 20 seconds from the first entry, and so on. Here I made use of integer division %/% 5, since cut.POSIXct only allows you to bin by second as interval.


    Update

    The following reproduces your expected output:

    df %>%
        unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
        group_by(ID) %>%
        mutate(
            datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
            difftime = difftime(datetime, lag(datetime, default = 0))) %>%
        ungroup() %>%
        mutate(
            group = cumsum(abs(difftime) >= 5)) %>%
        select(Date, Time, ID, datetime, group);
    ## A tibble: 15 x 5
    #   Date       Time     ID    datetime            group
    #   <chr>      <chr>    <chr> <dttm>              <int>
    # 1 10/30/2013 20:06:57 M1    2013-10-30 20:06:57     1
    # 2 10/30/2013 20:07:13 M1    2013-10-30 20:07:13     2
    # 3 10/30/2013 20:07:25 M1    2013-10-30 20:07:25     3
    # 4 10/30/2013 20:07:30 M3    2013-10-30 20:07:30     4
    # 5 10/30/2013 20:08:16 M1    2013-10-30 20:08:16     5
    # 6 10/30/2013 20:08:17 M1    2013-10-30 20:08:17     5
    # 7 10/30/2013 20:08:26 M8    2013-10-30 20:08:26     6
    # 8 10/30/2013 20:09:05 M9    2013-10-30 20:09:05     7
    # 9 10/30/2013 20:09:06 M9    2013-10-30 20:09:06     7
    #10 10/30/2013 20:09:07 M9    2013-10-30 20:09:07     7
    #11 10/30/2013 20:09:37 M1    2013-10-30 20:09:37     8
    #12 10/30/2013 20:09:38 M1    2013-10-30 20:09:38     8
    #13 10/30/2013 20:09:55 M1    2013-10-30 20:09:55     9
    #14 10/30/2013 20:12:34 M5    2013-10-30 20:12:34    10
    #15 10/30/2013 20:14:15 M1    2013-10-30 20:14:15    11
    

    Explanation: Calculate the time difference between two successive datetime entries, grouped by ID; group is then the cumulative sum of all time differences >=5.