My data df
looks like the following:
Row Timestamp ID
1 0020-06-29 12:14:00 B
2 0020-06-29 12:27:00 A
3 0020-06-29 12:27:22 B
4 0020-06-29 12:28:30 A
5 0020-06-29 12:43:00 B
6 0020-06-29 12:44:00 C
7 0020-06-29 12:45:00 B
8 0020-06-29 12:55:00 A
9 0020-06-29 12:57:00 C
10 0020-06-29 13:04:00 B
The Timestamp
indicates the date and time of a reading, and ID
the tag identification code.
What I am trying to do is remove any Timestamp
by the same ID
that occurs within 5 minutes of the previous Timestamp. So, although ID
A is seen in Row
2 and Row
4, since the two rows of the dataframe occur within 5 minutes of each other, we would remove Row
4 but keep Row
2 and Row
8, which for ID A occurs 18 minutes later.
Update: The first timestamp should take precedent and all subsequent timestamps should be either kept or removed from then on. So, if we have 3 timestamps corresponding to the same ID and with a time interval of 4.5 minutes and 2 minutes, respectively, between timestamp 1 and 2 and timestamp 2 and 3, I would like remove timestamp 2 and keep 1 and 3. This way the next timestamp we keep would be the one that occurs at least 5 minutes after timestamp 3, and so on.
I have tried the following:
first_date <- df$Timestamp[1:(length(df$Timestamp)-1)]
second_date <- df$Timestamp[2:length(df$Timestamp)]
second_gap <- difftime(second_date, first_date, units="mins")
dup_index <- second_gap>5 # set this as a 5-minute threshold
dup_index <- c(TRUE, dup_index)
df_cleaned <- df[dup_index, ]
But this deletes all observations within 5-minutes of each other and does not take into account the ID
. I would usually just subset
but I am working with around 180 unique ID
s.
Supposing that what I comment above does not occur, a possible solution is the following:
library(tidyverse)
library(lubridate)
elapsed <- function(x)
{
y <- abs(as.duration(x[2:length(x)] %--% x[1:(length(x)-1)]))
y >= 5*60
}
df %>%
group_split(ID) %>%
map_dfr(~ .[c(T, if (nrow(.) > 1) elapsed(.$Timestamp)),]) %>%
arrange(Row)
The output:
# A tibble: 8 × 3
Row Timestamp ID
<int> <chr> <chr>
1 1 0020-06-29 12:14:00 B
2 2 0020-06-29 12:27:00 A
3 3 0020-06-29 12:27:22 B
4 5 0020-06-29 12:43:00 B
5 6 0020-06-29 12:44:00 C
6 8 0020-06-29 12:55:00 A
7 9 0020-06-29 12:57:00 C
8 10 0020-06-29 13:04:00 B