Search code examples
rdplyrplyrr-factor

How to use fct_lump() to get the top n levels by group and put the rest in 'other'?


I'm trying to find the top 3 factor levels within each group, based on an aggregating variable, and group the remaining factor levels into "other" for each group. Normally I'd use fct_lump_n for this, but I can't figure out how to make it work within each group. Here's an example, where I want to form groups based on the x variable, order the y variables based on the value of z, choose the first 3 y variables, and group the rest of y into "other":

set.seed(50)
df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
             y = factor(sample(letters[1:10], 100, replace = T)),
             z = sample(100, 100, replace = T))

I've tried doing this:

df %>%
  group_by(x) %>%
  arrange(desc(z), .by_group = T) %>%
  slice_head(n = 3)

which returns this:

# A tibble: 9 x 3
# Groups:   x [3]
  x     y         z
  <fct> <fct> <int>
1 r     i        95
2 r     c        92
3 r     a        88
4 s     g        94
5 s     g        92
6 s     f        92
7 t     j       100
8 t     d        93
9 t     i        81

This is basically what I want, but I'm missing the 'other' variable within each of r, s, and t, which collects the values of z which have not been counted.

Can I use fct_lump_n for this? Or slice_head combined with grouping the excluded variables into "other"?


Solution

  • Tried in R 4.0.0 and tidyverse 1.3.0:

    set.seed(50)
    df <- tibble(x = factor(sample(letters[18:20], 100, replace = T)),
                 y = factor(sample(letters[1:10], 100, replace = T)),
                 z = sample(100, 100, replace = T))
    
    df %>%
      group_by(x) %>%
      arrange(desc(z)) %>%
      mutate(a = row_number(-z)) %>%
      mutate(y = case_when(a > 3 ~ "Other", TRUE ~ as.character(y))) %>%
      mutate(a = case_when(a > 3 ~ "Other", TRUE ~ as.character(a))) %>%
      group_by(x, y, a) %>%
      summarize(z = sum(z)) %>%
      arrange(x, a) %>%
      select(-a)
    

    Output:

    # A tibble: 12 x 3
    # Groups:   x, y [11]
       x     y         z
       <fct> <chr> <int>
     1 r     b        92
     2 r     j        89
     3 r     g        83
     4 r     Other   749
     5 s     i        93
     6 s     h        93
     7 s     i        84
     8 s     Other  1583
     9 t     a        99
    10 t     b        98
    11 t     i        95
    12 t     Other  1508
    

    Note: the use of variable a together with y is to compensate the fact that y is sampled with replacement (see row 5 and 7 of output). If I don't use a, row 5 and 7 of output will have their z summed up. Also note that I try to solve the problem posed, but I left y as character, since I suppose those "Other"s are not meant to be one same factor level.