Search code examples
pythonnumpypandas

Why does np.percentile return NaN for high percentiles?


This code:

print len(my_series)
print np.percentile(my_series, 98)
print np.percentile(my_series, 99)

gives:

14221  # This is the series length
1644.2  # 98th percentile
nan  # 99th percentile?

Why does 98 work fine but 99 gives nan?


Solution

  • np.percentile treats nan's as very high numbers. So the high percentiles will be in the range where you will end up with a nan. In your case, between 1 and 2 percent of your data will be nan's (98th percentile will return you a number (which is not actually the 98th percentile of all the valid values) and the 99th will return you a nan).

    To calculate the percentile without the nan's, you can use np.nanpercentile()

    So:

    print(np.nanpercentile(my_series, 98))
    print(np.nanpercentile(my_series, 99))
    

    Edit: In new Numpy version, np.percentile will return nan if nan's are present, so making this problem directly apparent. np.nanpercentile still works the same. `