Search code examples
pythonpandassumnan

Pandas sum of two columns - dealing with nan-values correctly


When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).

Initial dataframe:

Surf1     Surf2
0         0
NaN       8
8         15
NaN       NaN
16        14
15        7

Desired output:

Surf1     Surf2     Sum
0         0         0
NaN       8         8
8         15        23
NaN       NaN       NaN
16        14        30
15        7         22

Tried code: -> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.

import pandas as pd
import numpy as np

data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)

data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)

Solution

  • From the documentation pandas.DataFrame.sum

    By default, the sum of an empty or all-NA Series is 0.

    >>> pd.Series([]).sum() # min_count=0 is the default 0.0

    This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

    Change your code to

    data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
    

    output

       Surf1  Surf2
    0   10.0   22.0
    1    NaN    8.0
    2    8.0   15.0
    3    NaN    NaN
    4   16.0   14.0
    5   15.0    7.0
       Surf1  Surf2   Sum
    0   10.0   22.0  32.0
    1    NaN    8.0   8.0
    2    8.0   15.0  23.0
    3    NaN    NaN   NaN
    4   16.0   14.0  30.0
    5   15.0    7.0  22.0