Search code examples
rdata.tableweighted-average

Trimmed mean by group in R data.table


I have a data.table on which I want to find the weighted mean of column performance by month.

  dat <- structure(list(year = c(2014, 2015, 2016, 2017, 2018, 2019, 2020, 
                                 2021, 2014, 2015, 2016, 2017, 2018, 2019, 2020), 
                        month = c(2, 
                                  2, 2, 2, 2, 2, 2, 2, 10, 10, 10, 10, 10, 10, 10), 
                        performance = c(0.826973794097158, 
                                        0.61975709469356, 0.924350659523548, -0.183133219063708, -0.529913189565746, 
                                        -0.148531188902535, -0.0773058814083695, 1.42862504650241, 0.465498268732376, 
                                        0.148719963224136, 0.205614191281359, 0.560651497949418, -0.484408605607923, 
                                        0.875353374774486, 0.351469397380814)), 
                   row.names = c(NA, -15L), class = c("data.table", "data.frame"))

This data.table looks like following -

    year month performance
 1: 2014     2  0.82697379
 2: 2015     2  0.61975709
 3: 2016     2  0.92435066
 4: 2017     2 -0.18313322
 5: 2018     2 -0.52991319
 6: 2019     2 -0.14853119
 7: 2020     2 -0.07730588
 8: 2021     2  1.42862505
 9: 2014    10  0.46549827
10: 2015    10  0.14871996
11: 2016    10  0.20561419
12: 2017    10  0.56065150
13: 2018    10 -0.48440861
14: 2019    10  0.87535337
15: 2020    10  0.35146940

To find the weighted mean by month, I have used the following code -

setDT(dat)[, lapply(.SD, function(x) weighted.mean(x, na.rm = TRUE)), by = .(month), .SDcols = c("performance")]

and the result I am getting is -

   month performance
1:     2   0.3576029
2:    10   0.3032712

However, the weighted mean performance of month 10 should be greater than month 2 as it has more positive values.

It seems that only the month 2 of the year 2021, is weighing heavily on its performance causing it to outperform the performance of month 10. Actually, the code above is only finding the mean and NOT the weighted.mean. The result is the same if I use mean instead of weighted.mean.

setDT(dat)[, lapply(.SD, function(x) mean(x, na.rm = TRUE)), by = .(month), .SDcols = c("performance")]

and the result after using simple mean is following, which is the same as the result of weighted.mean.

   month performance
1:     2   0.3576029
2:    10   0.3032712

The desired result should give equal weight to the performance of each year so that exceptional performance in one particular year does not falsely show that the product sells wonderfully during that month each year.

Can someone point out what is wrong with my weighted mean calculation?


Solution

  • You could simply remove outliers :

    remove_outliers <- function(x, na.rm = TRUE, ...) {
      qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
      H <- 1.5 * IQR(x, na.rm = na.rm)
      y <- x
      y[x < (qnt[1] - H)] <- NA
      y[x > (qnt[2] + H)] <- NA
      y
    }
    setDT(dat)[, lapply(.SD, function(x) mean(remove_outliers(x))), by = .(month), .SDcols = c("performance")]
    
    month performance
    1:     2   0.3576029
    2:    10   0.4345511
    

    Or limit outliers, for instance to first and third quartile:

    limit_outliers <- function(x, na.rm = TRUE, ...) {
      qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
      y <- x
      y[x < (qnt[1] )] <- qnt[1]
      y[x > (qnt[2] )] <- qnt[2] 
      y
    }
    
    setDT(dat)[, lapply(.SD, function(x) mean(limit_outliers(x), na.rm = TRUE)), by = .(month), .SDcols = c("performance")]
    
    month performance
    1:     2   0.3261458
    2:    10   0.3432951