I have the following dataframe df (dput
below):
> df
id value
1 1 1
2 2 3
3 3 2
4 NA 1
5 NA 3
6 8 4
7 9 2
8 10 1
9 NA 1
10 NA 3
11 15 2
12 16 1
13 NA 3
14 NA 4
15 NA 2
16 20 1
17 21 1
18 22 3
19 NA 2
20 NA 1
21 NA 3
22 66 4
23 67 2
24 68 1
For each consecutive run of non-NA
"id", I would like to create a unique group ID (variable "id2" in the example). For the NA
in between, the "id2" should be NA
.
In the example, the first run of non-NA
"id" (1, 2 and 3), should all belong to group 1 (id2 = 1). The NA
on row 4-5 should have id2 = NA
. The second run of non-NA
"id" (8, 9, 10) should have id2 = 2, and so on.
The desired output should look like this:
id value id2
1 1 1 1
2 2 3 1
3 3 2 1
4 NA 1 NA
5 NA 3 NA
6 8 4 2
7 9 2 2
8 10 1 2
9 NA 1 NA
10 NA 3 NA
11 15 2 3
12 16 1 3
13 NA 3 NA
14 NA 4 NA
15 NA 2 NA
16 20 1 4
17 21 1 4
18 22 3 4
19 NA 2 NA
20 NA 1 NA
21 NA 3 NA
22 66 4 5
23 67 2 5
24 68 1 5
As you can see, id2 shows the unique id I want for the values of id between NA. It may be that there is only a NA
after the values, like for id's 1, 2 and 3. So I was wondering if anyone knows how to create a unique ID for values between NA
's?
dput
of df:
df <- structure(list(id = c(1, 2, 3, NA, NA, 8, 9, 10, NA, NA, 15,
16, NA, NA, NA, 20, 21, 22, NA, NA, NA, 66, 67, 68), value = c(1,
3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4,
2, 1)), class = "data.frame", row.names = c(NA, -24L))
Here's a possibility, maybe not the most efficient:
id
and an NA
. That gives you a 1
each time we have an id
which is NA
but the previous one is valid.NA
where neededlibrary(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- structure(list(id = c(1, 2, 3, NA, NA, 8, 9, 10, NA, NA, 15,
16, NA, NA, NA, 20, 21, 22, NA, NA, NA, 66, 67, 68), value = c(1,
3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4, 2, 1, 1, 3, 2, 1, 3, 4,
2, 1)), class = "data.frame", row.names = c(NA, -24L))
df |>
mutate(
id2 = as.numeric(is.na(id) & !is.na(lag(id))),
id2 = cumsum(id2) + 1,
id2 = ifelse(is.na(id), NA, id2)
)
#> id value id2
#> 1 1 1 1
#> 2 2 3 1
#> 3 3 2 1
#> 4 NA 1 NA
#> 5 NA 3 NA
#> 6 8 4 2
#> 7 9 2 2
#> 8 10 1 2
#> 9 NA 1 NA
#> 10 NA 3 NA
#> 11 15 2 3
#> 12 16 1 3
#> 13 NA 3 NA
#> 14 NA 4 NA
#> 15 NA 2 NA
#> 16 20 1 4
#> 17 21 1 4
#> 18 22 3 4
#> 19 NA 2 NA
#> 20 NA 1 NA
#> 21 NA 3 NA
#> 22 66 4 5
#> 23 67 2 5
#> 24 68 1 5
Created on 2023-03-21 with reprex v2.0.2