r dplyr time-series panel anomaly-detection

How to detect big jumps and quick reversals in panel time-series data in r

I have a time-series panel dataset which is structured in the following way:

df <- data.frame(
  year = c(2012L, 2013L, 2014L, 2012L, 2013L, 2014L),
  id = c(1L, 1L, 1L, 2L, 2L, 2L),
  c = c(11L, 13L, 13L, 16L, 15L, 15L)
)

#>   year id  c
#> 1 2012  1 11
#> 2 2013  1 26
#> 3 2014  1 13
#> 4 2012  2 16
#> 5 2013  2 15
#> 6 2014  2 19

Within each id, I would like to spot huge unusual changes and subsequent reversals in column c (i.e., the variable doubles or shrinks 2 times within one period similar to row 2 in the sample dataset and quickly reverts back to normal in the next period)

The way I would like to define the reversal to normal is to determine whether if you go back around the value (within a specified range) before the change the value goes back to for example +/- 25% of the original value).

I would like to be able to specify the period to look for reversal (e.g. 3 periods/years)

To summarize I would like to be able to detect a huge jump or shrink of two times or more within one period that also quickly reverts back to normal (25% of the original value) within a specified period (3 years).

My end goal is to determine whether there are some possible errors in the database and select all the ids that have such errors.

Solution

You can use the lag function to look backwards. It respects groups specified in a group_by.

The default for lag (or lead) is to shift to the immediate row ahead or behind the current one, but if you supply an integer as the second argument (2 in this case) it will check values further back.

lag and lead return NA when there isn't a row available since we've reached the boundary of a (grouped) dataframe, so here I'm using coalesce to force those NAs into FALSE. You could also have utilized the default argument of the original functions, but this seemed easier.

df <- data.frame(
  year = c(2012L, 2013L, 2014L, 2012L, 2013L, 2014L),
  id = c(1L, 1L, 1L, 2L, 2L, 2L),
  c = c(11L, 26L, 13L, 16L, 15L, 15L)
)

df %>%
  group_by(id) %>%
  mutate(flag = (lag(c) >= 2*lag(c, 2) | lag(c) <= 0.5*lag(c, 2)) 
                & abs(1 - c/(lag(c, 2))) < 0.25) %>%
  mutate(flag = coalesce(flag, FALSE))

# A tibble: 6 x 4
# Groups:   id [2]
   year    id     c flag 
  <int> <int> <int> <lgl>
1  2012     1    11 FALSE
2  2013     1    26 FALSE
3  2014     1    13 TRUE 
4  2012     2    16 FALSE
5  2013     2    15 FALSE
6  2014     2    15 FALSE