I want to replace the NAs in a variable with a weighted random sample of the values within a given group. Rough sample data:
A tibble: 16 × 3
letter number code
<chr> <int> <chr>
1 a 1 w1
2 a 1 w1
3 a 1 w2
4 a 1 NA
5 a 2 x1
6 a 2 x2
7 a 2 x2
8 a 2 NA
9 b 1 y1
10 b 1 y2
11 b 1 NA
12 b 2 z1
13 b 2 z2
14 b 2 z3
15 b 2 z4
16 b 2 NA
For each letter & number group, the NA would be replaced based on the proportions that the given codes exist already in that group. The group where letter = a and number = 1, has the values w1 and w2. For each NA code in that group, it would be replaced by sampling those two options, weighted based on frequency they already appear in that group.
Given the outcome would be a random weighted value, there isn't an exact expected outcome, but in principle this is the idea:
# A tibble: 16 × 3
letter number code
<chr> <int> <chr>
1 a 1 w1
2 a 1 w1
3 a 1 w2
4 a 1 2/3 chance w1; 1/3 chance w2
5 a 2 x1
6 a 2 x2
7 a 2 x2
8 a 2 1/3 chance x1; 2/3 chance x2
9 b 1 y1
10 b 1 y2
11 b 1 50% chance either y1 or y2
12 b 2 z1
13 b 2 z2
14 b 2 z3
15 b 2 z4
16 b 2 25% chance z1, z2, z3, z4
I can do this manually, entering the options and proportions for each group:
df |>
mutate(
replaced =
case_when(
letter == "a" & number == 1 ~ replace_na(code, sample(c("w1", "w2"), 1, replace = TRUE, prob = c(2/3, 1/3))),
letter == "a" & number == 2 ~ replace_na(code, sample(c("x1", "x2"), 1, replace = TRUE, prob = c(1/3, 2/3))),
letter == "b" & number == 1 ~ replace_na(code, sample(c("y1", "y2"), 1, replace = TRUE,)),
letter == "b" & number == 2 ~ replace_na(code, sample(c("z1", "z2", "z3", "z4"), 1, replace = TRUE))
)
)
But my actual data is too large to make that a viable option. Is there a way to streamline this so the possible values and proportions per group are determined automatically? No reason it needs to be a case_when
mutate function. I've wondered if splitting into a list of dfs for each group, mapping, and then recombining might be a better approach for this.
structure(list(letter = c("a", "a", "a", "a", "a", "a", "a",
"a", "b", "b", "b", "b", "b", "b", "b", "b"), number = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
code = c("w1", "w1", "w2", NA, "x1", "x2", "x2", NA, "y1",
"y2", NA, "z1", "z2", "z3", "z4", NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -16L))
We can use sample
on the non-missing entries by group:
library(dplyr)
df |>
group_by(letter, number) |>
mutate(code = coalesce(code, sample(x = na.omit(code), size = n(), replace = TRUE))) |>
ungroup()
# # A tibble: 16 × 3
# letter number code
# <chr> <int> <chr>
# 1 a 1 w1
# 2 a 1 w1
# 3 a 1 w2
# 4 a 1 w2
# 5 a 2 x1
# 6 a 2 x2
# 7 a 2 x2
# 8 a 2 x2
# ...
Empirically, we can check that sample
doesn't do unique()
on the vector to sample from, meaning this should work just fine:
set.seed(47)
sample(c('a', 'a', 'a', 'b'), size = 100, replace = TRUE) |> table()
# a b
# 79 21