Search code examples
rsortingdata-cleaningtrim

How to trim X% top and bottom of a data frame (based on a column)?


I have the following data frame:

set.seed(3994)
val <- round(runif(n=30, min = 5, max= 300), digits=0)
cat <- rep(c("A", "B", "C"), each= 10)
date <- as.Date(sample(seq(as.Date('2000/01/01'), as.Date('2020/01/01'), by="day"), 30))

df <- data.frame(val, cat, date)
df <- df %>%  
  arrange(cat, val)

I want to trim top X% and bottom X% of my data for each category based on column cat. For example I want to remove top 2% and bottom 2% for category "A", "B", and "C". When the data is sorted based on val column.

I wrote the following code:

trimTopBottomByCategory <- function(dataframe, category_col, numeric_col, date_column,  x) {
  trimmed_dataframes <- list()
  
  categories <- unique(dataframe[[category_col]])
  for (category in categories) {
    subset_df <- dataframe[dataframe[[category_col]] == category, ]
    
    n <- nrow(subset_df)
    num_to_trim <- ceiling(x / 100 * n)
    
    sorted_subset <- subset_df[order(subset_df[[numeric_col]]), ]
    trimmed_df <- sorted_subset[(num_to_trim + 1):(n - num_to_trim), ]
    trimmed_dataframes[[category]] <- trimmed_df
  }

  trimmed_combined <- do.call(rbind, trimmed_dataframes)
  return(trimmed_combined <- trimmed_combined %>% 
             arrange(category_col, date_column))
}

My Question: I hope my code is doing what it is supposed to. But I was wondering if there is a method in R that does the same?

Bonus Question: I don't understand my final data is not sorted for the date column


Solution

  • order by cat and data rather than by cat and val. (Should also work with dplyr::arrange, but I don't want to load dplyr.)

    df <- df[with(df, order(cat, date)), ]
    

    You can use ave, where first argument is value val, and second is the category cat. ave applies FUN to the values in each category. To get the highest and lowest 2% we can use quantile, and compare the values subsequently. Actually it's boolean, but due to val is numeric we get numeric back, so we use as.logical to get desired boolean, with which we can generate ss to subset the data frame.

    ss <- with(df, as.logical(ave(val, cat, FUN=\(x) {
      q <- quantile(x, probs=c(.02, 1 - .02))
      x >= q[1] & x <= q[2]
    })))
    
    df[ss, ]
    #    val cat       date
    # 3   81   A 2000-08-10
    # 10 188   A 2000-11-03
    # 4  171   A 2006-11-26
    # 2  182   A 2009-07-05
    # 7  173   A 2010-09-12
    # 6   54   A 2012-06-01
    # 1  227   A 2014-08-05
    # 9   95   A 2016-09-13
    # 17 219   B 2002-12-29
    # 14 221   B 2004-07-28
    # 18 225   B 2011-06-29
    # 19 191   B 2013-03-05
    # 16 236   B 2013-09-27
    # 12 117   B 2015-11-30
    # 15 131   B 2017-11-22
    # 13  92   B 2019-02-09
    # 27 251   C 2000-03-13
    # 30 160   C 2001-03-12
    # 28 112   C 2002-02-19
    # 29 174   C 2005-07-19
    # 22 248   C 2006-12-23
    # 21 176   C 2012-01-25
    # 26  85   C 2016-08-06
    # 24  56   C 2017-12-12
    

    Data:

    df <- structure(list(val = c(81, 188, 171, 12, 264, 182, 173, 54, 227, 
    95, 219, 221, 274, 78, 225, 191, 236, 117, 131, 92, 251, 160, 
    112, 265, 174, 248, 176, 42, 85, 56), cat = c("A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
    "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C", "C", 
    "C"), date = structure(c(11179, 11264, 13478, 13910, 14119, 14430, 
    14864, 15492, 16287, 17057, 12050, 12627, 14565, 14605, 15154, 
    15769, 15975, 16769, 17492, 17936, 11029, 11393, 11737, 12467, 
    12983, 13505, 15364, 15472, 17019, 17512), class = "Date")), row.names = c(3L, 
    10L, 4L, 5L, 8L, 2L, 7L, 6L, 1L, 9L, 17L, 14L, 20L, 11L, 18L, 
    19L, 16L, 12L, 15L, 13L, 27L, 30L, 28L, 23L, 29L, 22L, 21L, 25L, 
    26L, 24L), class = "data.frame")