I'm trying to find the top 3 factor levels within each group, based on an aggregating variable, and group the remaining factor levels into "other" for each group. Normally I'd use fct_lump_n for this, but I can't figure out how to make it work within each group. Here's an example, where I want to form groups based on the x variable, order the y variables based on the value of z, choose the first 3 y variables, and group the rest of y into "other":
set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
y = factor(sample(letters[1:10], 100, replace = T)),
z = sample(100, 100, replace = T))
I've tried doing this:
df %>%
group_by(x) %>%
arrange(desc(z), .by_group = T) %>%
slice_head(n = 3)
which returns this:
# A tibble: 9 x 3
# Groups: x [3]
x y z
<fct> <fct> <int>
1 r i 95
2 r c 92
3 r a 88
4 s g 94
5 s g 92
6 s f 92
7 t j 100
8 t d 93
9 t i 81
This is basically what I want, but I'm missing the 'other' variable within each of r, s, and t, which collects the values of z which have not been counted.
Can I use fct_lump_n for this? Or slice_head combined with grouping the excluded variables into "other"?
Tried in R 4.0.0 and tidyverse
1.3.0:
set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
y = factor(sample(letters[1:10], 100, replace = T)),
z = sample(100, 100, replace = T))
df %>%
group_by(x) %>%
arrange(desc(z)) %>%
mutate(a = row_number(-z)) %>%
mutate(y = case_when(a > 3 ~ "Other", TRUE ~ as.character(y))) %>%
mutate(a = case_when(a > 3 ~ "Other", TRUE ~ as.character(a))) %>%
group_by(x, y, a) %>%
summarize(z = sum(z)) %>%
arrange(x, a) %>%
select(-a)
Output:
# A tibble: 12 x 3
# Groups: x, y [11]
x y z
<fct> <chr> <int>
1 r b 92
2 r j 89
3 r g 83
4 r Other 749
5 s i 93
6 s h 93
7 s i 84
8 s Other 1583
9 t a 99
10 t b 98
11 t i 95
12 t Other 1508
Note: the use of variable a
together with y
is to compensate the fact that y is sampled with replacement (see row 5 and 7 of output). If I don't use a
, row 5 and 7 of output will have their z
summed up. Also note that I try to solve the problem posed, but I left y
as character, since I suppose those "Other"s are not meant to be one same factor level.