I have an excel file called 'nan_test.xlsx' Which looks like this (the original has blanks but when printed in python the blanks are replaced with 'NaN':
ID Month1 Month2 Month3
0 ABC 1.0
1 FFF 2.0 2.0 2.0
2 XYZ 3.0 3.0
I am trying to get standard deviation while ignoring blank/nan values.
As of now I have this but the behavior is all over the place. It correctly does std dev for the row with 3 numeric values, 2 numeric values and one nan value, but not the one with 2 nan values and one numeric value.
df = pd.read_excel('nan_test.xlsx')
print(df)
df['std'] = np.nanstd(df.iloc[:, 1:], axis=1, ddof=1)
print(df)
ID Month1 Month2 Month3
0 ABC 1.0 NaN NaN
1 FFF 2.0 2.0 2.0
2 XYZ NaN 3.0 3.0
ID Month1 Month2 Month3 std
0 ABC 1.0 NaN NaN NaN
1 FFF 2.0 2.0 2.0 0.0
2 XYZ NaN 3.0 3.0 0.0
It is ddof=1
causing first row std
returning NaN
.
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of non-NaN elements. By default ddof is zero.
So, N
of first row is 1
(i.e. there is only 1 non-NaN value). N - ddof = 1 - 1 = 0
. Divisor is 0
, so std
returns NaN
You need ddof=0
to have the first row
df['std'] = np.nanstd(df.iloc[:, 1:], axis=1, ddof=0)
Out[416]:
ID Month1 Month2 Month3 std
0 ABC 1.0 NaN NaN 0.0
1 FFF 2.0 2.0 2.0 0.0
2 XYZ NaN 3.0 3.0 0.0