Search code examples
rdataframedplyrrun-length-encoding

Create group number for runs of non-NA values


I have the following dataframe df (dput below):

> df
   id value
1   1     1
2   2     3
3   3     2
4  NA     1
5  NA     3
6   8     4
7   9     2
8  10     1
9  NA     1
10 NA     3
11 15     2
12 16     1
13 NA     3
14 NA     4
15 NA     2
16 20     1
17 21     1
18 22     3
19 NA     2
20 NA     1
21 NA     3
22 66     4
23 67     2
24 68     1

For each consecutive run of non-NA "id", I would like to create a unique group ID (variable "id2" in the example). For the NA in between, the "id2" should be NA.

In the example, the first run of non-NA "id" (1, 2 and 3), should all belong to group 1 (id2 = 1). The NA on row 4-5 should have id2 = NA. The second run of non-NA "id" (8, 9, 10) should have id2 = 2, and so on.

The desired output should look like this:

   id value id2
1   1     1   1
2   2     3   1
3   3     2   1
4  NA     1  NA
5  NA     3  NA
6   8     4   2
7   9     2   2
8  10     1   2
9  NA     1  NA
10 NA     3  NA
11 15     2   3
12 16     1   3
13 NA     3  NA
14 NA     4  NA
15 NA     2  NA
16 20     1   4
17 21     1   4
18 22     3   4
19 NA     2  NA
20 NA     1  NA
21 NA     3  NA
22 66     4   5
23 67     2   5
24 68     1   5

As you can see, id2 shows the unique id I want for the values of id between NA. It may be that there is only a NA after the values, like for id's 1, 2 and 3. So I was wondering if anyone knows how to create a unique ID for values between NA's?


dput of df:

df <- structure(list(id = c(1, 2, 3, NA, NA, 8, 9, 10, NA, NA, 15, 
16, NA, NA, NA, 20, 21, 22, NA, NA, NA, 66, 67, 68), value = c(1, 
3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4, 
2, 1)), class = "data.frame", row.names = c(NA, -24L))

Solution

  • Here's a possibility, maybe not the most efficient:

    1. find the transitions between a valid id and an NA. That gives you a 1 each time we have an id which is NA but the previous one is valid.
    2. make the cumulative sum
    3. add NA where needed
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    df <- structure(list(id = c(1, 2, 3, NA, NA, 8, 9, 10, NA, NA, 15, 
                                16, NA, NA, NA, 20, 21, 22, NA, NA, NA, 66, 67, 68), value = c(1, 
                                                                                               3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4, 
                                                                                               2, 1)), class = "data.frame", row.names = c(NA, -24L))
    
    df |> 
      mutate(
        id2 = as.numeric(is.na(id) & !is.na(lag(id))),
        id2 = cumsum(id2) + 1,
        id2 = ifelse(is.na(id), NA, id2)
      )
    #>    id value id2
    #> 1   1     1   1
    #> 2   2     3   1
    #> 3   3     2   1
    #> 4  NA     1  NA
    #> 5  NA     3  NA
    #> 6   8     4   2
    #> 7   9     2   2
    #> 8  10     1   2
    #> 9  NA     1  NA
    #> 10 NA     3  NA
    #> 11 15     2   3
    #> 12 16     1   3
    #> 13 NA     3  NA
    #> 14 NA     4  NA
    #> 15 NA     2  NA
    #> 16 20     1   4
    #> 17 21     1   4
    #> 18 22     3   4
    #> 19 NA     2  NA
    #> 20 NA     1  NA
    #> 21 NA     3  NA
    #> 22 66     4   5
    #> 23 67     2   5
    #> 24 68     1   5
    

    Created on 2023-03-21 with reprex v2.0.2