Search code examples
rtime-seriesmultiple-columnsmissing-data

Count 'breaks' in a sequence of columns


I have a time series associated with items. Sometimes there are missing periods in the middle. I would like to identify those observations that have a break in the time series. Take these data:

ref_no <- 1:6
X2015 <- c(2015, 2015, 2015, 2015,   NA,   NA)
X2016 <- c(2016, 2016, 2016,   NA, 2016, 2016)
X2017 <- c(2017, 2017,   NA,   NA,   NA, 2017)
X2018 <- c(2018, 2018, 2018, 2018,   NA, 2018)
X2019 <- c(2019, 2019, 2018,   NA, 2019,   NA)
X2020 <- c(2020,   NA,   NA, 2020, 2020,   NA)
df <- data.frame(ref_no, X2015, X2016, X2017, X2018, X2019, X2020)

There are breaks for ref_no items #3, #4 and #5, with #4 having two breaks (missing 2016, 2017 and then 2019).

  ref_no X2015 X2016 X2017 X2018 X2019 X2020
1      1  2015  2016  2017  2018  2019  2020
2      2  2015  2016  2017  2018  2019    NA
3      3  2015  2016    NA  2018  2018    NA
4      4  2015    NA    NA  2018    NA  2020
5      5    NA  2016    NA    NA  2019  2020
6      6    NA  2016  2017  2018    NA    NA

Can we have a variable that counts these breaks, e.g.

c <- c(0,0,1,2,1,0)

Solution

  • We may use diff to get the difference between adjacent element by each row.

    library(dplyr)
    library(tidyr)
    df %>% 
      pivot_longer(cols = -ref_no, values_drop_na = TRUE) %>% 
      group_by(ref_no) %>%
      summarise(break_count = sum(diff(value) > 1))
    

    -output

    # A tibble: 6 × 2
      ref_no break_count
       <int>       <int>
    1      1           0
    2      2           0
    3      3           1
    4      4           2
    5      5           1
    6      6           0
    

    Or using base R

    apply(df[-1], 1, \(x) sum(diff(x[complete.cases(x)]) > 1))
    [1] 0 0 1 2 1 0