Search code examples
rdplyrnarle

ignore NAs in a rolling count when using rle and grouped data


I have a dataset like so:

id <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
value <- c("x","x","y",NA,NA,"x",NA,"y",
           "y","y",NA,"x","x",NA,"x","x")

   id value
1   1     x
2   1     x
3   1     y
4   1  <NA>
5   1  <NA>
6   1     x
7   1  <NA>
8   1     y
9   2     y
10  2     y
11  2  <NA>
12  2     x
13  2     x
14  2  <NA>
15  2     x
16  2     x

I would like to have a rolling count for every time value changes, but skipping over NAs.

My ideal output looks like this:

   id value rolling_count
1   1     x             1
2   1     x             1
3   1     y             2
4   1  <NA>            NA
5   1  <NA>            NA
6   1     x             3
7   1  <NA>            NA
8   1     y             4
9   2     y             1
10  2     y             1
11  2  <NA>            NA
12  2     x             2
13  2     x             2
14  2  <NA>            NA
15  2     x             2
16  2     x             2

I've currently tried the following with rle() which gets me a rolling count but considers single or groups of NAs as part of the count and produces the output below. It's close but not quite there.

df %>%
  group_by(id) %>% 
  mutate(failed_rolling_count = with(rle(value)), rep(seq_along(lengths), lengths))

   id value failed_rolling_count
1   1     x                    1
2   1     x                    1
3   1     y                    2
4   1  <NA>                    3
5   1  <NA>                    3
6   1     x                    4
7   1  <NA>                    5
8   1     y                    6
9   2     y                    1
10  2     y                    1
11  2  <NA>                    2
12  2     x                    3
13  2     x                    3
14  2  <NA>                    4
15  2     x                    5
16  2     x                    5

Any tips much appreciated!


Solution

  • dplyr version 1.1.0 introduced consecutive_id(), so you don't need rle anymore. Here's an approach to add a row number, sort the NAs to the bottom, add the rolling count, and then sort back to the original row order. You can, of course, drop the row column at the end if you'd like.

    df |>
      mutate(row = row_number()) |>
      arrange(is.na(value), row) |>
      mutate(
        rolling_count = ifelse(is.na(value), NA, consecutive_id(value)),
        .by =  "id"
      ) |>
      arrange(row)
    #    id value row rolling_count
    # 1   1     x   1             1
    # 2   1     x   2             1
    # 3   1     y   3             2
    # 4   1  <NA>   4            NA
    # 5   1  <NA>   5            NA
    # 6   1     x   6             3
    # 7   1  <NA>   7            NA
    # 8   1     y   8             4
    # 9   2     y   9             1
    # 10  2     y  10             1
    # 11  2  <NA>  11            NA
    # 12  2     x  12             2
    # 13  2     x  13             2
    # 14  2  <NA>  14            NA
    # 15  2     x  15             2
    # 16  2     x  16             2