Search code examples
rdplyrsum

Partially aggregate (sum) values that are below the first n observations in dplyr


I have a dataframe like df:

id <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh", "ii", "jj")
value <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
df <- data.frame(id, value)

I want to aggregate the values that are rank below n-th position (in this case, n=5), such that, the new data looks like df_new:

id_new <- c("aa", "bb", "cc", "dd", "ee", "others") 
value_new <- c(10, 9, 8, 7, 6, 15)
df_new <- data.frame(id_new, value_new)

I have tried to combine top_n with summarise in dplyr but I do not find the way to aggregate a dataset partially.

Any idea?


Solution

  • One way could be to utilize the fct_lump_n in forcats (part of tidyverse):

    The idea is to treat id as a factor with x levels, fct_lump_n lumps all but n into the same factor. We could then use count to summarize all the others-factor-level into one. I.e.

    library(forcats)
    library(dplyr)
    
    df |>
      mutate(id = fct_lump_n(as.factor(id), n = 5, w = value, other_level = "others")) |>
      count(id, wt = value, name = "value")
    

    Or use a similar logic with dplyr alone: As we have no value-weighting in the first place, we'll need to arrange before recoding.

    library(dplyr)
    
    df |>
      arrange(desc(value)) |> 
      mutate(id = if_else(row_number() > 5, "others", id)) |>
      count(id, wt = value, name = "value")
    

    Output:

          id value
    1     aa    10
    2     bb     9
    3     cc     8
    4     dd     7
    5     ee     6
    6 others    15