Given that I have a pandas Series, I want to fill the NaNs with zero if either all the values are NaN or if all the values are either zero or NaN.
For example, I would want to fill the NaNs in the following Series with zeroes.
0 0
1 0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
But, I would not want to fillna(0) the following Series:
0 0
1 0
2 2
3 0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
I was looking at the documentation and it seems like I could use pandas.Series.value_counts to ensure the values are only 0 and NaN, and then simply call fillna(0).In other words, I am looking to check if set(s.unique().astype(str)).issubset(['0.0','nan']), THEN fillna(0), otherwise do not.
Considering how powerful pandas is, it seemed like a there may be a better way to do this. Does anyone have any suggestions to do this cleanly and efficiently?
Potential solution thanks to cᴏʟᴅsᴘᴇᴇᴅ
if s.dropna().eq(0).all():
s = s.fillna(0)
You can compare by 0
and isna
if only NaN
s and 0
and then fillna
:
if ((s == 0) | (s.isna())).all():
s = pd.Series(0, index=s.index)
Or compare unique values:
if pd.Series(s.unique()).fillna(0).eq(0).all():
s = pd.Series(0, index=s.index)
@cᴏʟᴅsᴘᴇᴇᴅ solution, thank you - compare Series without NaN
s with dropna
:
if s.dropna().eq(0).all():
s = pd.Series(0, index=s.index)
Solution from question - need convert to string
s, because problem with compare with NaN
s:
if set(s.unique().astype(str)).issubset(['0.0','nan']):
s = pd.Series(0, index=s.index)
Timings:
s = pd.Series(np.random.choice([0,np.nan], size=10000))
In [68]: %timeit ((s == 0) | (s.isna())).all()
The slowest run took 4.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 574 µs per loop
In [69]: %timeit pd.Series(s.unique()).fillna(0).eq(0).all()
1000 loops, best of 3: 587 µs per loop
In [70]: %timeit s.dropna().eq(0).all()
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 774 µs per loop
In [71]: %timeit set(s.unique().astype(str)).issubset(['0.0','nan'])
The slowest run took 5.78 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 157 µs per loop