Search code examples
pythonnumpystatisticsrankingpercentile

Understanding numpy percentile computation


I understand percentile in the context of test scores with many examples (eg. you SAT score falls in the 99th percentile), but I am not sure I understand percentile in the following context and what is going on. Imagine a model outputs probabilities (on some days we have a lot of new data and outputted probabilities, and some days we don't). Imagine I want to compute the 99th percentile of outputted probabilities. Here are the probabilities for today:

a = np.array([0,0.2,0.4,0.7,1])
p = np.percentile(a,99)
print(p)

0.988

I don't understand how the 99th percentile is computed in this situation where there are only 5 outputted probabilities. How was the output computed? Thanks!


Solution

  • Linear interpolation is applied. You can check consistency yourself:

    a = np.array([0,0.2,0.4,0.7,1])
    
    np.sort(a)  # array([ 0. ,  0.2,  0.4,  0.7,  1. ])
    
    np.percentile(a, 75)   # 0.70
    np.percentile(a, 100)  # 1.0
    np.percentile(a, 99)   # 0.988
    
    0.70 + (1.0 - 0.70) * (99 - 75) / (100 - 75)  # 0.988
    

    The documentation also specifies 'linear' as the default:

    numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)

    'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.