Search code examples
rtimedata-cleaninglubridateposixct

Removing rows if they occur within a certain time of each other by a group value in R


My data df looks like the following:

Row    Timestamp            ID
1    0020-06-29 12:14:00     B 
2    0020-06-29 12:27:00     A 
3    0020-06-29 12:27:22     B  
4    0020-06-29 12:28:30     A 
5    0020-06-29 12:43:00     B 
6    0020-06-29 12:44:00     C 
7    0020-06-29 12:45:00     B 
8    0020-06-29 12:55:00     A 
9    0020-06-29 12:57:00     C 
10   0020-06-29 13:04:00     B 


   
   

The Timestamp indicates the date and time of a reading, and ID the tag identification code.

What I am trying to do is remove any Timestamp by the same ID that occurs within 5 minutes of the previous Timestamp. So, although ID A is seen in Row 2 and Row 4, since the two rows of the dataframe occur within 5 minutes of each other, we would remove Row 4 but keep Row 2 and Row 8, which for ID A occurs 18 minutes later.

Update: The first timestamp should take precedent and all subsequent timestamps should be either kept or removed from then on. So, if we have 3 timestamps corresponding to the same ID and with a time interval of 4.5 minutes and 2 minutes, respectively, between timestamp 1 and 2 and timestamp 2 and 3, I would like remove timestamp 2 and keep 1 and 3. This way the next timestamp we keep would be the one that occurs at least 5 minutes after timestamp 3, and so on.

I have tried the following:

first_date <- df$Timestamp[1:(length(df$Timestamp)-1)]
second_date <- df$Timestamp[2:length(df$Timestamp)]
second_gap <- difftime(second_date, first_date, units="mins")

dup_index <- second_gap>5 # set this as a 5-minute threshold
dup_index <- c(TRUE, dup_index)
df_cleaned <- df[dup_index, ]

But this deletes all observations within 5-minutes of each other and does not take into account the ID. I would usually just subset but I am working with around 180 unique IDs.


Solution

  • Supposing that what I comment above does not occur, a possible solution is the following:

    library(tidyverse)
    library(lubridate)
    
    elapsed <- function(x)
    {
      y <- abs(as.duration(x[2:length(x)] %--% x[1:(length(x)-1)]))
      y >= 5*60
    } 
    
    df %>% 
      group_split(ID) %>% 
      map_dfr(~ .[c(T, if (nrow(.) > 1) elapsed(.$Timestamp)),]) %>% 
      arrange(Row)
    

    The output:

    # A tibble: 8 × 3
        Row Timestamp           ID   
      <int> <chr>               <chr>
    1     1 0020-06-29 12:14:00 B    
    2     2 0020-06-29 12:27:00 A    
    3     3 0020-06-29 12:27:22 B    
    4     5 0020-06-29 12:43:00 B    
    5     6 0020-06-29 12:44:00 C    
    6     8 0020-06-29 12:55:00 A    
    7     9 0020-06-29 12:57:00 C    
    8    10 0020-06-29 13:04:00 B