I obtained some old data which contain groups that hold only 0
s. I want to clean this dataset to use it for teaching. While some 0
s are naturally occurring in my data, it is not possible for a group to contain only 0
s. That leads me to the conclusion that observations in such a group can safely be labelled with NA
s.
Goal
I wish that 0
should be replaced with NA
only if the whole group is full of 0
.
Reprex
library(dplyr)
df <- tibble(key_1 = rep(1:2, each = 4),
key_2 = rep(letters[1:2], each = 2, times = 2),
value = c(0, 0, 0, -1, 1, 2, 1, 0))
df
#> Output
# A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
# 1 a 0
# 1 a 0
# 1 b 0
# 1 b -1
# 2 a 1
# 2 a 2
# 2 b 1
# 2 b 0
This data represents a situation where multiple key variables uniquely identify each group.
1a
, everything is 0
.2b
, there is one 0
.I expect 1a
to be filled with NA
s while e.g. 2b
and other groups are preserved. I expect the solution to work within a dplyr
pipeline, which is not mandatory, but it would be nice.
Desired Output
# key_1 key_2 value
# <int> <chr> <dbl>
# 1 a NA
# 1 a NA
# 1 b 0
# 1 b -1
# 2 a 1
# 2 a 2
# 2 b 1
# 2 b 0
Attempts so far
I tried using dplyr::case_when()
, and to set the normal values of the value
column as a fallback. This throws the following error:
df |> group_by(key_1, key_2) |>
mutate(value = case_when(sum(value != 0) == 0 ~ NA,
.default = value))
#> Error:
# ! `.default` must have size 1, not size 2.
If I don't specify a default, everything is NA
.
Providing some dummy value to default shows that the condition sum(value != 0) == 0
is working correctly together with group_by
.
df |> group_by(key_1, key_2) |>
mutate(value = case_when(sum(value != 0) == 0 ~ NA,
.default = "default"))
#> Output
# A tibble: 8 × 3
# Groups: key_1, key_2 [4]
# key_1 key_2 value
# <int> <chr> <chr>
# 1 a NA
# 1 a NA
# 1 b default
# 1 b default
# 2 a default
# 2 a default
# 2 b default
# 2 b default
You need to make sure you get an NA
for each row. (case_when
is a little stricter about recycling than base R. Here, your test value is length-1 for the group, and the result you specify, NA
, is length-1, but the result you want has the same length as the number of rows in the group.)
df |>
mutate(
value = case_when(all(value == 0) ~ rep(NA, n()), .default = value),
.by = c(key_1, key_2)
)
# # A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
# 1 1 a NA
# 2 1 a NA
# 3 1 b 0
# 4 1 b -1
# 5 2 a 1
# 6 2 a 2
# 7 2 b 1
# 8 2 b 0