Search code examples
rfilterdata-cleaningoutliers

Is it possible to filter out outliers that overlap based on time variable in R?


How do I delete outliers by checking for an overlap between ID column and date/time columns withfilter()?

For example, rows with ID =1 overlap in time as shown in the first 2 rows below,thus need to be deleted.

ID Time start Time end
1 2015-03-16 10:40:00 2015-03-16 11:10:00
1 2015-03-16 10:50:00 2015-03-16 10:59:00
2 2015-03-16 10:40:00 2015-03-16 10:45:00
1 2015-03-16 11:20:00 2015-03-16 11:28:56

Solution

  • Try this to remove any time overlaps within a group. Please test it with more data to see if it does what you want. I only tried the small sample below.

    library(tidyverse)
    library(lubridate)
    #> 
    #> Attaching package: 'lubridate'
    #> The following objects are masked from 'package:base':
    #> 
    #>     date, intersect, setdiff, union
    library(slider)
    
    tribble(
      ~id, ~start, ~end,
      1, "2015-03-16 10:40:00", "2015-03-16 11:10:00",
      1, "2015-03-16 10:50:00", "2015-03-16 10:59:00",
      1, "2015-03-16 11:09:00", "2015-03-16 11:11:00",
      2, "2015-03-16 10:40:00", "2015-03-16 10:45:00",
      1, "2015-03-16 11:20:00", "2015-03-16 11:28:56",
      1, "2015-03-16 11:27:00", "2015-03-16 11:30:56",
      2, "2015-03-16 10:44:00", "2015-03-16 11:45:00"
    ) |>
      mutate(
        start = ymd_hms(start, tz = Sys.timezone()),
        end = ymd_hms(end, tz = Sys.timezone())
      ) |>
      arrange(id, start, end) |>
      group_by(id) |>
      mutate(
        roll_start = slide_vec(start, min, .before = Inf),
        roll_end = slide_vec(end, max, .before = Inf),
        overlap = if_else((start >= lag(roll_start) & start <= lag(roll_end)) |
          (end >= lag(roll_start) & end <= lag(roll_end)), "yes", "no")
      ) |>
      filter(overlap == "no" | is.na(overlap)) |> 
      select(- c(starts_with("roll_"), overlap))
    #> # A tibble: 3 × 3
    #> # Groups:   id [2]
    #>      id start               end                
    #>   <dbl> <dttm>              <dttm>             
    #> 1     1 2015-03-16 10:40:00 2015-03-16 11:10:00
    #> 2     1 2015-03-16 11:20:00 2015-03-16 11:28:56
    #> 3     2 2015-03-16 10:40:00 2015-03-16 10:45:00
    

    Created on 2022-04-30 by the reprex package (v2.0.1)