Search code examples
rmedianfrequency-distributionquartile

Use R to calculate median without replicating elements


I have a frequency distribution with huge numbers. I want to calculate median and quartiles but R complains. Here is what is working for small numbers:

> TABLE <- data.frame(DATA = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), F = c(48,0,192,1152,5664,23040,77952,214272,423984,558720,267840,0,0,0,0,0,0,0,0))
> summary(rep(TABLE$DAT,TABLE$F))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   9.000  10.000   9.397  10.000  11.000

Here is, what I get for huge numbers:

> TABLE <- data.frame(DATA = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), F = c(240,0,1200,9600,69600,470400,2992800,17859840,98312880,489292800,2164619760,8325820800,26865302400,68711068800,128967422400,153763315200,96770419200,26824089600,2395008000))
> summary(rep(TABLE$DAT,TABLE$F))
Error in rep(TABLE$DAT, TABLE$F) : invalid 'times' argument
In addition: Warning message:
In summary(rep(TABLE$DAT, TABLE$F)) :
  NAs introduced by coercion to integer range

This error does not surprise me because using "rep" I wanted to create an enormous vector. But I do not know, how to avoid this and calculate the median and the quartiles.


Solution

  • Rather than trying to replicate that monster to use summary() you can get "weighted quantiles". This post has a formula. But as with most things, once you know the right terms you can find a package that already does the work!

    #install.packages("Hmisc")
    
    TABLE <- data.frame(DATA = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), F = c(240,0,1200,9600,69600,470400,2992800,17859840,98312880,489292800,2164619760,8325820800,26865302400,68711068800,128967422400,153763315200,96770419200,26824089600,2395008000))
    
    
    Hmisc::wtd.quantile(TABLE$DATA, probs = c(0.25, 0.5, 0.75), weight = TABLE$F)
    #> 25% 50% 75% 
    #>  15  16  16
    

    Created on 2018-04-06 by the reprex package (v0.2.0).