Search code examples
pythonpandasdataframetypescoercion

Odd behaviour of pandas DataFrame.sum when column contains string value


I have 3 pandas dataframes of survey responses that look exactly the same but are created in different ways:

import pandas as pd

df1 = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])

df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df2.loc[1,2] = 'hey'

df3 = pd.DataFrame(index=range(3), columns=range(3))
for i in range(3):
    for j in range(3):
        if (i,j) != (1,2):
            df3.loc[i,j] = i*3 + j + 1
        else:
            df3.loc[i,j] = 'hey'

# df1, df2, df3 look the same as below
   0  1    2
0  1  2    3
1  4  5  hey
2  7  8    9

Now, when I take sums along the columns, they all give me the same results.

sumcol1 = df1.sum()
sumcol2 = df2.sum()
sumcol3 = df3.sum()

# sumcol1, sumcol2, sumcol3 look the same as below
0    12
1    15
dtype: int64

However, when I take sums across the rows, df3 gives different result compared to df1 and df2.

Furthermore, it seems that when axis=0 the sum of a column that contains strings will not be calculated, whereas when axis=1 all row sums will be calculated with elements belonging to a column with string elements skipped.

sumrow1 = df1.sum(axis=1)
sumrow2 = df2.sum(axis=1)
sumrow3 = df3.sum(axis=1)

#sumrow1
0     3
1     9
2    15
dtype: int64

#sumrow2
0     3
1     9
2    15
dtype: int64

#sumrow3
0    0.0
1    0.0
2    0.0
dtype: float64

I have 3 questions regarding this.

  1. What causes the different behaviour between sumcol1 and sumrow1?

  2. What causes the different behaviour between sumrow1 and sumrow3?

  3. Is there a proper way to get a result that is the same as sumrow1 with df3?

Added:

  1. Is there a smart way to add only the numerical values while keeping the strings?

    • My current workaround (thanks to jpp's kind answer):

      df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
      df_c = df.copy()
      for col in df.select_dtypes(['object']).columns:
          df_c[col] = pd.to_numeric(df_c[col], errors='coerce')
      df['sum'] = df_c.sum(axis=1)
      
      #result
         0  1    2   sum
      0  1  2    3   6.0
      1  4  5  hey   9.0
      2  7  8    9  24.0
      

I am working with Python 3.6.6, pandas 0.23.4.


Solution

  • There are a couple of issues:

    • The main problem is your construction of df3 has all three series with dtype object, while df1 and df2 have dtype=int for the first two series.
    • Data in Pandas dataframes is organized and stored by series [column]. Therefore, type-casting is performed by series. Hence the logic for summing across "rows and columns" is necessarily different and not necessarily consistent with regards to mixed types.

    To understand what's happening with the first issue, you have to appreciate that Pandas doesn't continually check the most appropriate dtype is selected after each operation. This would be prohibitively expensive.

    You can check dtypes for yourself:

    print({'df1': df1.dtypes, 'df2': df2.dtypes, 'df3': df3.dtypes})
    
    {'df1': 0     int64
            1     int64
            2    object
          dtype: object,
    
     'df2': 0     int64
            1     int64
            2    object
          dtype: object,
    
     'df3': 0    object
            1    object
            2    object
          dtype: object}
    

    You can apply conversion selectively to df3 via an operation which checks if any null values result post-conversion:

    for col in df3.select_dtypes(['object']).columns:
        col_num = pd.to_numeric(df3[col], errors='coerce')
        if not col_num.isnull().any():  # check if any null values
            df3[col] = col_num          # assign numeric series
    
    print(df3.dtypes)
    
    0     int64
    1     int64
    2    object
    dtype: object
    

    You should then see consistent treatment. At this point, it's worth discarding your original df3: it's not documented anywhere that continual series type-checking can or should be applied after each operation.

    To disregard non-numeric values when summing across rows or columns you can force conversion via pd.to_numeric with errors='coerce':

    df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
    
    col_sum = df.apply(pd.to_numeric, errors='coerce').sum()
    row_sum = df.apply(pd.to_numeric, errors='coerce').sum(1)
    
    print(col_sum)
    
    0    12.0
    1    15.0
    2    12.0
    dtype: float64
    
    print(row_sum)
    
    0     6.0
    1     9.0
    2    24.0
    dtype: float64