I have a dataframe like df:
id <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh", "ii", "jj")
value <- c(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
df <- data.frame(id, value)
I want to aggregate the values that are rank below n-th position (in this case, n=5), such that, the new data looks like df_new:
id_new <- c("aa", "bb", "cc", "dd", "ee", "others")
value_new <- c(10, 9, 8, 7, 6, 15)
df_new <- data.frame(id_new, value_new)
I have tried to combine top_n
with summarise
in dplyr but I do not find the way to aggregate a dataset partially.
Any idea?
One way could be to utilize the fct_lump_n
in forcats
(part of tidyverse
):
The idea is to treat id as a factor with x levels, fct_lump_n
lumps all but n
into the same factor. We could then use count
to summarize all the others-factor-level into one. I.e.
library(forcats)
library(dplyr)
df |>
mutate(id = fct_lump_n(as.factor(id), n = 5, w = value, other_level = "others")) |>
count(id, wt = value, name = "value")
Or use a similar logic with dplyr
alone: As we have no value-weighting in the first place, we'll need to arrange before recoding.
library(dplyr)
df |>
arrange(desc(value)) |>
mutate(id = if_else(row_number() > 5, "others", id)) |>
count(id, wt = value, name = "value")
Output:
id value
1 aa 10
2 bb 9
3 cc 8
4 dd 7
5 ee 6
6 others 15