I am trying to split by data into 5 sec time intervals and group them using dplyr.
Below is my original data - I have date and time in separate columns which I have later combined using Posixct
structure(list(Date = c("10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013", "10/30/2013",
"10/30/2013", "10/30/2013"), Time = c("20:06:57", "20:07:13",
"20:07:25", "20:07:30", "20:08:16", "20:08:17", "20:08:26", "20:09:05",
"20:09:06", "20:09:07", "20:09:37", "20:09:38", "20:09:55", "20:12:34",
"20:14:15"), ID = c("M1", "M1", "M1", "M3", "M1", "M1", "M8",
"M9", "M9", "M9", "M1", "M1", "M1", "M5", "M1")), .Names = c("Date",
"Time", "ID"), class = "data.frame", row.names = c(NA, -15L))
Attaching my code below
data$datetime <- as.POSIXct(paste(data$Date, data$Time), format="%m/%d/%Y %H:%M:%S")
data_order <- data %>% arrange(datetime,ID)
data_order$group <- data_order %>% group_by(by5sec=cut(datetime, breaks= "5 secs",right =T),ID) %>% group_indices()
While some observations are grouped right , some are wrong. I have tried 2 versions - removing the "right=T" and keeping it and I got different groups , but had errors in both versions. I have also tried using as.numeric,as.posixct et all before the cut in vain
Attaching the output for both versions.Red ones are wrongly coded as 2 different groups.
****Version 1 "right = T" for cut****
****Version 2 "right = F" for cut****
Could someone please help in solving this,I have spent quite some time and its been a goose chase given my knowledge of R . All I want is 5 sec breaks for the same ID(group should change for a new ID).
Desired output
I'm not entirely clear on the output images you show. Based on your problem description, how about something like this?
df %>%
unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
datetime.by5sec = as.numeric(cut(datetime, "sec")) %/% 5 + 1);
# datetime Date Time ID datetime.by5sec
#1 2013-10-30 20:06:57 10/30/2013 20:06:57 M1 1
#2 2013-10-30 20:07:13 10/30/2013 20:07:13 M1 4
#3 2013-10-30 20:07:25 10/30/2013 20:07:25 M1 6
#4 2013-10-30 20:07:30 10/30/2013 20:07:30 M3 7
#5 2013-10-30 20:08:16 10/30/2013 20:08:16 M1 17
#6 2013-10-30 20:08:17 10/30/2013 20:08:17 M1 17
#7 2013-10-30 20:08:26 10/30/2013 20:08:26 M8 19
#8 2013-10-30 20:09:05 10/30/2013 20:09:05 M9 26
#9 2013-10-30 20:09:06 10/30/2013 20:09:06 M9 27
#10 2013-10-30 20:09:07 10/30/2013 20:09:07 M9 27
#11 2013-10-30 20:09:37 10/30/2013 20:09:37 M1 33
#12 2013-10-30 20:09:38 10/30/2013 20:09:38 M1 33
#13 2013-10-30 20:09:55 10/30/2013 20:09:55 M1 36
#14 2013-10-30 20:12:34 10/30/2013 20:12:34 M5 68
#15 2013-10-30 20:14:15 10/30/2013 20:14:15 M1 88
Explanation: datetime.by5sec
gives the 5 sec bin index into which datetime
falls. So the first entry sits in bin 1. The second entry is within the 4th 5 sec bin, i.e. within 20 seconds from the first entry, and so on. Here I made use of integer division %/% 5
, since cut.POSIXct
only allows you to bin by second as interval.
The following reproduces your expected output:
df %>%
unite(datetime, 1:2, sep = " ", remove = FALSE) %>%
group_by(ID) %>%
datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S"),
difftime = difftime(datetime, lag(datetime, default = 0))) %>%
ungroup() %>%
group = cumsum(abs(difftime) >= 5)) %>%
select(Date, Time, ID, datetime, group);
## A tibble: 15 x 5
# Date Time ID datetime group
# <chr> <chr> <chr> <dttm> <int>
# 1 10/30/2013 20:06:57 M1 2013-10-30 20:06:57 1
# 2 10/30/2013 20:07:13 M1 2013-10-30 20:07:13 2
# 3 10/30/2013 20:07:25 M1 2013-10-30 20:07:25 3
# 4 10/30/2013 20:07:30 M3 2013-10-30 20:07:30 4
# 5 10/30/2013 20:08:16 M1 2013-10-30 20:08:16 5
# 6 10/30/2013 20:08:17 M1 2013-10-30 20:08:17 5
# 7 10/30/2013 20:08:26 M8 2013-10-30 20:08:26 6
# 8 10/30/2013 20:09:05 M9 2013-10-30 20:09:05 7
# 9 10/30/2013 20:09:06 M9 2013-10-30 20:09:06 7
#10 10/30/2013 20:09:07 M9 2013-10-30 20:09:07 7
#11 10/30/2013 20:09:37 M1 2013-10-30 20:09:37 8
#12 10/30/2013 20:09:38 M1 2013-10-30 20:09:38 8
#13 10/30/2013 20:09:55 M1 2013-10-30 20:09:55 9
#14 10/30/2013 20:12:34 M5 2013-10-30 20:12:34 10
#15 10/30/2013 20:14:15 M1 2013-10-30 20:14:15 11
Explanation: Calculate the time difference between two successive datetime
entries, grouped by ID
; group
is then the cumulative sum of all time differences >=5