I am working with the R Programming language.
Suppose I have the following data:
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
name = myFun(400)
variable = as.integer(abs(rnorm(400, 500,100)))
my_data = data.frame(name,variable)
I want to keep the top 5 rows (based on the value of "variable") and group (sum) everything else as "other". I thought of the following way to do this:
my_data <- my_data [order(-variable),]
my_data_top_5 = my_data[1:5,]
my_data_remainder = my_data[6:nrow(my_data),]
other_count = sum(my_data_remainder$variable)
other = data.frame( name = "other", variable = other_count)
final_result = rbind(my_data_top_5, other)
I think this worked - but is there a more efficient way to do this?
Thanks!
In tidyverse
, arrange
the data based on the desc
ending order of 'variable', replace
the 'name' from 6th onwards to 'other' and do a group by sum
library(dplyr)
my_data %>%
arrange(desc(variable)) %>%
group_by(name = replace(name, 6:n(), "other")) %>%
summarise(variable = sum(variable, na.rm = TRUE), .groups = 'drop')