Search code examples
dplyrlubridateanomaly-detection

Extracting co-anomalies across shared time durations in R


I need to extract co-anomalies from a data-frame which already contains univariate anomalies.

# Libraries
library(dplyr)
library(lubridate)
library(stringr)

# Create input dataframe
DF <- data.frame(
  rowID = as.factor(c(1,2,3,4,5,6,7,8)),
  Start = as_datetime(c('2022-01-01 09:00:00', '2022-01-01 12:00:00', '2022-01-02 15:00:00',
                        '2022-01-02 23:30:00', '2022-01-03 00:10:00', '2022-01-29 00:10:00',
                        '2023-12-25 06:00:00', '2023-12-25 08:00:00')),
  Finish = as_datetime(c('2022-01-01 11:00:00', '2022-01-01 15:00:00','2022-01-03 01:00:00',
                         '2022-01-02 23:50:00', '2022-01-03 03:00:00', '2022-01-31 03:00:00',
                         '2023-12-25 11:00:00', '2023-12-25 12:00:00')),
  Process = c('Process1', 'Process2', 'Process1', 'Process2', 'Process3', 'Process3', 'Process3', 'Process3'),
  Anomaly = c('Y','N','Y','Y','Y', 'Y', 'Y', 'Y')
) %>%
  arrange(Start, Process) %>%
  mutate(Interval = interval(Start, Finish)) %>%
  as_tibble()

I'm able to successfully tag co-anomalies which occurred over similar time periods as the process of interest (Process3).

# Declare process of interest
c <- 'Process3'

# Extract co-anomalies within and between Process3
Result <- DF %>%
  filter(int_overlaps(Interval, Interval[Process == c]) == TRUE) %>%
  mutate(coAnomaly = ifelse(Anomaly == 'Y', 'Y', 'N')) %>%
  left_join(DF, ., by = c('rowID' = 'rowID')) %>%
  select(contains('.x'), coAnomaly) %>%
  rename_with(~str_remove(., '.x'))

The code correctly tags co-anomalies between process 3 and other processes. Although it makes errors when detecting process 3 against itself.

Row 6 is an error, the anomaly doesn't co-occur within another Process3 or between any other process.

I'm trying to correctly tag:

  1. Which Process3s co-occurred with other-processes (Between LHS)
  2. Which other-processes co-occurred with Process3s (Between RHS)
  3. Which Process3s co-occurred with Process3s (Within)

Solution

  • You can try this approach using rowwise():

    left_join(DF, DF %>%
      rowwise() %>% 
      filter(any(int_overlaps(Interval, DF$Interval[which(DF$rowID!=rowID & DF$Process == c)]))) %>% 
      mutate(coAnomaly = ifelse(Anomaly == 'Y', 'Y', 'N')) %>%
      select(rowID, coAnomaly)
    )
    

    Output:

      rowID Start               Finish              Process  Anomaly Interval                                         coAnomaly
      <fct> <dttm>              <dttm>              <chr>    <chr>   <Interval>                                       <chr>    
    1 1     2022-01-01 09:00:00 2022-01-01 11:00:00 Process1 Y       2022-01-01 09:00:00 UTC--2022-01-01 11:00:00 UTC NA       
    2 2     2022-01-01 12:00:00 2022-01-01 15:00:00 Process2 N       2022-01-01 12:00:00 UTC--2022-01-01 15:00:00 UTC NA       
    3 3     2022-01-02 15:00:00 2022-01-03 01:00:00 Process1 Y       2022-01-02 15:00:00 UTC--2022-01-03 01:00:00 UTC Y        
    4 4     2022-01-02 23:30:00 2022-01-02 23:50:00 Process2 Y       2022-01-02 23:30:00 UTC--2022-01-02 23:50:00 UTC NA       
    5 5     2022-01-03 00:10:00 2022-01-03 03:00:00 Process3 Y       2022-01-03 00:10:00 UTC--2022-01-03 03:00:00 UTC NA       
    6 6     2022-01-29 00:10:00 2022-01-31 03:00:00 Process3 Y       2022-01-29 00:10:00 UTC--2022-01-31 03:00:00 UTC NA       
    

    Updated, given OP's additional request of separating Between/Within, and new frame:

    rbind(
      DF %>% 
        filter(Process==c) %>% 
        rowwise() %>% 
        filter(any(int_overlaps(Interval, DF$Interval[which(DF$rowID!=rowID & DF$Process == c)]))) %>% 
        mutate(coAnomaly = "within"),
      
      DF %>% 
        filter(Process!=c) %>% 
        rowwise() %>% 
        filter(any(int_overlaps(Interval, DF$Interval[which(DF$rowID!=rowID & DF$Process == c)]))) %>% 
        mutate(coAnomaly = "between")
    )
    

    Output:

      rowID Start               Finish              Process  Anomaly Interval                                         coAnomaly
      <fct> <dttm>              <dttm>              <chr>    <chr>   <Interval>                                       <chr>    
    1 7     2023-12-25 06:00:00 2023-12-25 11:00:00 Process3 Y       2023-12-25 06:00:00 UTC--2023-12-25 11:00:00 UTC within   
    2 8     2023-12-25 08:00:00 2023-12-25 12:00:00 Process3 Y       2023-12-25 08:00:00 UTC--2023-12-25 12:00:00 UTC within   
    3 3     2022-01-02 15:00:00 2022-01-03 01:00:00 Process1 Y       2022-01-02 15:00:00 UTC--2022-01-03 01:00:00 UTC between 
    

    All types of overlaps:

    Here is another approach, which does not depend on indicating a Process of interest (i.e. no need for c="Process3".

    1. Create a small function that takes an interval, and the id and returns a tibble of overlapping ID (oID) and overlapping process (oProcess)
    get_overlap_IDs = function(interval,id) {
      DF %>% filter(int_overlaps(interval, DF$Interval)) %>%
        filter(rowID!=id) %>% 
        select(oID=rowID, oProcess=Process)
    }
    
    1. Apply the function rowwise and unnest
    DF %>% 
      rowwise() %>%
      mutate(keys = list(get_overlap_IDs(Interval, rowID))) %>% 
      unnest(keys)
    

    Output:

      rowID Start               Finish              Process  Anomaly Interval                                         oID   oProcess
      <fct> <dttm>              <dttm>              <chr>    <chr>   <Interval>                                       <fct> <chr>   
    1 3     2022-01-02 15:00:00 2022-01-03 01:00:00 Process1 Y       2022-01-02 15:00:00 UTC--2022-01-03 01:00:00 UTC 4     Process2
    2 3     2022-01-02 15:00:00 2022-01-03 01:00:00 Process1 Y       2022-01-02 15:00:00 UTC--2022-01-03 01:00:00 UTC 5     Process3
    3 4     2022-01-02 23:30:00 2022-01-02 23:50:00 Process2 Y       2022-01-02 23:30:00 UTC--2022-01-02 23:50:00 UTC 3     Process1
    4 5     2022-01-03 00:10:00 2022-01-03 03:00:00 Process3 Y       2022-01-03 00:10:00 UTC--2022-01-03 03:00:00 UTC 3     Process1
    5 7     2023-12-25 06:00:00 2023-12-25 11:00:00 Process3 Y       2023-12-25 06:00:00 UTC--2023-12-25 11:00:00 UTC 8     Process3
    6 8     2023-12-25 08:00:00 2023-12-25 12:00:00 Process3 Y       2023-12-25 08:00:00 UTC--2023-12-25 12:00:00 UTC 7     Process3