Search code examples
pythonpandasconditional-statementsmultiple-conditionsfillna

How to efficiently fillna(0) if series is all-nan, or else remaining non-nan entries are zero?


Given that I have a pandas Series, I want to fill the NaNs with zero if either all the values are NaN or if all the values are either zero or NaN.

For example, I would want to fill the NaNs in the following Series with zeroes.

0       0
1       0
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

But, I would not want to fillna(0) the following Series:

0       0
1       0
2       2
3       0
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

I was looking at the documentation and it seems like I could use pandas.Series.value_counts to ensure the values are only 0 and NaN, and then simply call fillna(0).In other words, I am looking to check if set(s.unique().astype(str)).issubset(['0.0','nan']), THEN fillna(0), otherwise do not.

Considering how powerful pandas is, it seemed like a there may be a better way to do this. Does anyone have any suggestions to do this cleanly and efficiently?

Potential solution thanks to cᴏʟᴅsᴘᴇᴇᴅ

if s.dropna().eq(0).all():
    s = s.fillna(0)

Solution

  • You can compare by 0 and isna if only NaNs and 0 and then fillna:

    if ((s == 0) | (s.isna())).all():
        s = pd.Series(0, index=s.index)
    

    Or compare unique values:

    if pd.Series(s.unique()).fillna(0).eq(0).all():
        s = pd.Series(0, index=s.index)
    

    @cᴏʟᴅsᴘᴇᴇᴅ solution, thank you - compare Series without NaNs with dropna:

     if s.dropna().eq(0).all():
        s = pd.Series(0, index=s.index)
    

    Solution from question - need convert to strings, because problem with compare with NaNs:

    if set(s.unique().astype(str)).issubset(['0.0','nan']):
    
        s = pd.Series(0, index=s.index)
    

    Timings:

    s = pd.Series(np.random.choice([0,np.nan], size=10000))
    
    In [68]: %timeit ((s == 0) | (s.isna())).all()
    The slowest run took 4.85 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 574 µs per loop
    
    In [69]: %timeit pd.Series(s.unique()).fillna(0).eq(0).all()
    1000 loops, best of 3: 587 µs per loop
    
    In [70]: %timeit s.dropna().eq(0).all()
    The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000 loops, best of 3: 774 µs per loop
    
    In [71]: %timeit set(s.unique().astype(str)).issubset(['0.0','nan'])
    The slowest run took 5.78 times longer than the fastest. This could mean that an intermediate result is being cached.
    10000 loops, best of 3: 157 µs per loop