python numpy statistics ranking percentile

Understanding numpy percentile computation

I understand percentile in the context of test scores with many examples (eg. you SAT score falls in the 99th percentile), but I am not sure I understand percentile in the following context and what is going on. Imagine a model outputs probabilities (on some days we have a lot of new data and outputted probabilities, and some days we don't). Imagine I want to compute the 99th percentile of outputted probabilities. Here are the probabilities for today:

a = np.array([0,0.2,0.4,0.7,1])
p = np.percentile(a,99)
print(p)

0.988

I don't understand how the 99th percentile is computed in this situation where there are only 5 outputted probabilities. How was the output computed? Thanks!

Solution

Linear interpolation is applied. You can check consistency yourself:

a = np.array([0,0.2,0.4,0.7,1])

np.sort(a)  # array([ 0. ,  0.2,  0.4,  0.7,  1. ])

np.percentile(a, 75)   # 0.70
np.percentile(a, 100)  # 1.0
np.percentile(a, 99)   # 0.988

0.70 + (1.0 - 0.70) * (99 - 75) / (100 - 75)  # 0.988

The documentation also specifies 'linear' as the default:

numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)

'linear': i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.