Search code examples
pythonpandasquantile

Does the quantile() function in Pandas ignore NaN?


I have a dfAB

import pandas as pd
import random

A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]

dfAB = pd.DataFrame({ 'A': A, 'B': B })
dfAB

We can take the quantile function, because I want to know the 75th percentile of the columns:

dfAB.quantile(0.75)

But say now I put some NaNs in the dfAB and re-do the function, obviously its differnt:

dfAB.loc[5:8]=np.nan
dfAB.quantile(0.75)

Basically, when I calculated the mean of the dfAB, I passed skipna to ignore Na's as I didn't want them affecting my stats (I have quite a few in my code, on purpose, and obv making them zero doesn't help)

dfAB.mean(skipna=True)

Thus, what im getting at is whether/how the quantile function addresses NaN's?


Solution

  • Yes, this appears to be the way that pd.quantile deals with NaN values. To illustrate, you can compare the results to np.nanpercentile, which explicitely Computes the qth percentile of the data along the specified axis, while ignoring nan values (quoted from the docs, my emphasis):

    >>> dfAB
          A     B
    0   5.0  10.0
    1  43.0  67.0
    2  86.0   2.0
    3  61.0  83.0
    4   2.0  27.0
    5   NaN   NaN
    6   NaN   NaN
    7   NaN   NaN
    8   NaN   NaN
    9  27.0  70.0
    
    >>> dfAB.quantile(0.75)
    A    56.50
    B    69.25
    Name: 0.75, dtype: float64
    
    >>> np.nanpercentile(dfAB, 75, axis=0)
    array([56.5 , 69.25])
    

    And see that they are equivalent