Search code examples
pythonpandasnumpystandard-deviation

Numpy.nanstd not skipping nan values correctly for DataFrame read in through excel


I have an excel file called 'nan_test.xlsx' Which looks like this (the original has blanks but when printed in python the blanks are replaced with 'NaN':

    ID  Month1  Month2  Month3
0  ABC     1.0               
1  FFF     2.0     2.0     2.0
2  XYZ             3.0     3.0

I am trying to get standard deviation while ignoring blank/nan values.

As of now I have this but the behavior is all over the place. It correctly does std dev for the row with 3 numeric values, 2 numeric values and one nan value, but not the one with 2 nan values and one numeric value.

df = pd.read_excel('nan_test.xlsx')
print(df)
df['std'] = np.nanstd(df.iloc[:, 1:], axis=1, ddof=1)
print(df)

    ID  Month1  Month2  Month3
0  ABC     1.0     NaN     NaN
1  FFF     2.0     2.0     2.0
2  XYZ     NaN     3.0     3.0
    ID  Month1  Month2  Month3  std
0  ABC     1.0     NaN     NaN  NaN
1  FFF     2.0     2.0     2.0  0.0
2  XYZ     NaN     3.0     3.0  0.0

Solution

  • It is ddof=1 causing first row std returning NaN.

    ddof : int, optional
    
    Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, 
    where N represents the number of non-NaN elements. By default ddof is zero.
    

    So, N of first row is 1 (i.e. there is only 1 non-NaN value). N - ddof = 1 - 1 = 0. Divisor is 0, so std returns NaN

    You need ddof=0 to have the first row

    df['std'] = np.nanstd(df.iloc[:, 1:], axis=1, ddof=0)
    
    Out[416]:
        ID  Month1  Month2  Month3  std
    0  ABC     1.0     NaN     NaN  0.0
    1  FFF     2.0     2.0     2.0  0.0
    2  XYZ     NaN     3.0     3.0  0.0