Search code examples
rdata-manipulation

R: Creating an "Other" Variable


I am working with the R Programming language.

Suppose I have the following data:

myFun <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

name = myFun(400)

variable = as.integer(abs(rnorm(400, 500,100)))

my_data = data.frame(name,variable)

I want to keep the top 5 rows (based on the value of "variable") and group (sum) everything else as "other". I thought of the following way to do this:

my_data <- my_data [order(-variable),]

my_data_top_5 = my_data[1:5,]

my_data_remainder = my_data[6:nrow(my_data),]
other_count = sum(my_data_remainder$variable)

other = data.frame( name = "other", variable = other_count)

final_result = rbind(my_data_top_5, other)

I think this worked - but is there a more efficient way to do this?

Thanks!


Solution

  • In tidyverse, arrange the data based on the descending order of 'variable', replace the 'name' from 6th onwards to 'other' and do a group by sum

    library(dplyr)
    my_data %>%
       arrange(desc(variable)) %>%
      group_by(name = replace(name, 6:n(), "other")) %>% 
       summarise(variable = sum(variable, na.rm = TRUE), .groups = 'drop')