Search code examples
rif-statementdplyrcasena

Fill groups that have only zeros in them with NA


I obtained some old data which contain groups that hold only 0s. I want to clean this dataset to use it for teaching. While some 0s are naturally occurring in my data, it is not possible for a group to contain only 0s. That leads me to the conclusion that observations in such a group can safely be labelled with NAs.

Goal

I wish that 0 should be replaced with NA only if the whole group is full of 0.

Reprex

library(dplyr)
df <- tibble(key_1 = rep(1:2, each = 4),
                 key_2 = rep(letters[1:2], each = 2, times = 2),
                 value = c(0, 0, 0, -1, 1, 2, 1, 0))
df
#> Output
# A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
#     1 a         0
#     1 a         0
#     1 b         0
#     1 b        -1
#     2 a         1
#     2 a         2
#     2 b         1
#     2 b         0

This data represents a situation where multiple key variables uniquely identify each group.

  • In group 1a, everything is 0.
  • In group 2b, there is one 0.

I expect 1a to be filled with NAs while e.g. 2b and other groups are preserved. I expect the solution to work within a dplyr pipeline, which is not mandatory, but it would be nice.

Desired Output

# key_1 key_2 value
# <int> <chr> <dbl>
#     1 a        NA
#     1 a        NA
#     1 b         0
#     1 b        -1
#     2 a         1
#     2 a         2
#     2 b         1
#     2 b         0

Attempts so far

I tried using dplyr::case_when(), and to set the normal values of the value column as a fallback. This throws the following error:

df |> group_by(key_1, key_2) |>
  mutate(value = case_when(sum(value != 0) == 0 ~ NA,
                           .default = value))

#> Error:
# ! `.default` must have size 1, not size 2.

If I don't specify a default, everything is NA.

Providing some dummy value to default shows that the condition sum(value != 0) == 0 is working correctly together with group_by.

df |> group_by(key_1, key_2) |>
  mutate(value = case_when(sum(value != 0) == 0 ~ NA,
                           .default = "default"))
#> Output
# A tibble: 8 × 3
# Groups:   key_1, key_2 [4]
# key_1 key_2 value
# <int> <chr> <chr>
#     1 a     NA
#     1 a     NA
#     1 b     default
#     1 b     default
#     2 a     default
#     2 a     default
#     2 b     default
#     2 b     default

Solution

  • You need to make sure you get an NA for each row. (case_when is a little stricter about recycling than base R. Here, your test value is length-1 for the group, and the result you specify, NA, is length-1, but the result you want has the same length as the number of rows in the group.)

    df |>
      mutate(
        value = case_when(all(value == 0) ~ rep(NA, n()), .default = value),
        .by = c(key_1, key_2)
      )
    # # A tibble: 8 × 3
    #   key_1 key_2 value
    #   <int> <chr> <dbl>
    # 1     1 a        NA
    # 2     1 a        NA
    # 3     1 b         0
    # 4     1 b        -1
    # 5     2 a         1
    # 6     2 a         2
    # 7     2 b         1
    # 8     2 b         0