Search code examples
rdistributionmedianquantile

When values have different sampling probabilities, what is the fastest way to calculate the median?


Consider this toy example:

A teacher wants to calculate the median height of students in his class. But not all students show up to class every day so on any given day, the calculated median height might be different. Their probabilities of being in class and their heights are given in the table below. Given this information, he could estimate the expected median.

>set.seed(123)
>data1 <- data.frame(Student=c(LETTERS[1:10]), Height.cm=sort( rnorm(n=10, mean=140, sd=10)), Prob.in.class=c(1,.75,1,.5,1,1,1,.25,1,.5))

>data1

   Student Height.cm Prob.in.class
1        A  127.3494          1.00
2        B  133.1315          0.75
3        C  134.3952          1.00
4        D  135.5434          0.50
5        E  137.6982          1.00
6        F  140.7051          1.00
7        G  141.2929          1.00
8        H  144.6092          0.25
9        I  155.5871          1.00
10       J  157.1506          0.50

What is the fastest way in R to estimate the median (or arbitrary quantiles) of a distribution like this?

For my actual calculations, I'll need to estimate the medians and arbitrary quantiles on hundreds of different vectors with tens of thousands of points (and associated probabilities) each. I've seen this suggestion where the probability density function is estimated using the trapezoidal method but I'm not sure this is the best way.

Any advice you could provide would be greatly appreciated. Thanks!


Solution

  • Something like this should work, but be careful on the weight vector as shown below

    #your data
    set.seed(123)
    data1 <- data.frame(Student=c(LETTERS[1:10]), Height.cm=sort( rnorm(n=10, mean=140, sd=10)), Prob.in.class=c(1,.75,1,.5,1,1,1,.25,1,.5))
    
    #Test a known ...
    data2 <- c(1,1,1,1,1,2,3,3,3,3,3) # median clearly 2
    median(data2) #yields 2, yah... 
    
    #using weights... median should be 2 if function working right
    data3 <- data.frame(Student=c(LETTERS[1:3]), Height.cm=c(1,2,3), Prob.in.class=c(5/12,2/12,5/12))
    reldist::wtd.quantile(data3$Height.cm, q = .5, 
                      weight = data3$Prob.in.class) # yields 3, not the right answer
    
    #the wtd.quantile function does not like probabilities. 
    #multiply the weights to something greater than 1 seems to work. 
    reldist::wtd.quantile(data3$Height.cm, q = .5, weight = data3$Prob.in.class*100) # yields 2, the right answer