I have a frequency distribution with huge numbers. I want to calculate median and quartiles but R complains. Here is what is working for small numbers:
> TABLE <- data.frame(DATA = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), F = c(48,0,192,1152,5664,23040,77952,214272,423984,558720,267840,0,0,0,0,0,0,0,0))
> summary(rep(TABLE$DAT,TABLE$F))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 9.000 10.000 9.397 10.000 11.000
Here is, what I get for huge numbers:
> TABLE <- data.frame(DATA = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), F = c(240,0,1200,9600,69600,470400,2992800,17859840,98312880,489292800,2164619760,8325820800,26865302400,68711068800,128967422400,153763315200,96770419200,26824089600,2395008000))
> summary(rep(TABLE$DAT,TABLE$F))
Error in rep(TABLE$DAT, TABLE$F) : invalid 'times' argument
In addition: Warning message:
In summary(rep(TABLE$DAT, TABLE$F)) :
NAs introduced by coercion to integer range
This error does not surprise me because using "rep" I wanted to create an enormous vector. But I do not know, how to avoid this and calculate the median and the quartiles.
Rather than trying to replicate that monster to use summary()
you can get "weighted quantiles".
This post has a formula.
But as with most things, once you know the right terms you can find a package
that already does the work!
#install.packages("Hmisc")
TABLE <- data.frame(DATA = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), F = c(240,0,1200,9600,69600,470400,2992800,17859840,98312880,489292800,2164619760,8325820800,26865302400,68711068800,128967422400,153763315200,96770419200,26824089600,2395008000))
Hmisc::wtd.quantile(TABLE$DATA, probs = c(0.25, 0.5, 0.75), weight = TABLE$F)
#> 25% 50% 75%
#> 15 16 16
Created on 2018-04-06 by the reprex package (v0.2.0).