I have 3 pandas dataframes of survey responses that look exactly the same but are created in different ways:
import pandas as pd
df1 = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df2.loc[1,2] = 'hey'
df3 = pd.DataFrame(index=range(3), columns=range(3))
for i in range(3):
for j in range(3):
if (i,j) != (1,2):
df3.loc[i,j] = i*3 + j + 1
else:
df3.loc[i,j] = 'hey'
# df1, df2, df3 look the same as below
0 1 2
0 1 2 3
1 4 5 hey
2 7 8 9
Now, when I take sums along the columns, they all give me the same results.
sumcol1 = df1.sum()
sumcol2 = df2.sum()
sumcol3 = df3.sum()
# sumcol1, sumcol2, sumcol3 look the same as below
0 12
1 15
dtype: int64
However, when I take sums across the rows, df3
gives different result compared to df1
and df2
.
Furthermore, it seems that when axis=0 the sum of a column that contains strings will not be calculated, whereas when axis=1 all row sums will be calculated with elements belonging to a column with string elements skipped.
sumrow1 = df1.sum(axis=1)
sumrow2 = df2.sum(axis=1)
sumrow3 = df3.sum(axis=1)
#sumrow1
0 3
1 9
2 15
dtype: int64
#sumrow2
0 3
1 9
2 15
dtype: int64
#sumrow3
0 0.0
1 0.0
2 0.0
dtype: float64
I have 3 questions regarding this.
What causes the different behaviour between sumcol1
and sumrow1
?
What causes the different behaviour between sumrow1
and sumrow3
?
Is there a proper way to get a result that is the same as sumrow1
with df3
?
Added:
Is there a smart way to add only the numerical values while keeping the strings?
My current workaround (thanks to jpp's kind answer):
df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
df_c = df.copy()
for col in df.select_dtypes(['object']).columns:
df_c[col] = pd.to_numeric(df_c[col], errors='coerce')
df['sum'] = df_c.sum(axis=1)
#result
0 1 2 sum
0 1 2 3 6.0
1 4 5 hey 9.0
2 7 8 9 24.0
I am working with Python 3.6.6, pandas 0.23.4.
There are a couple of issues:
df3
has all
three series with dtype object
, while df1
and df2
have
dtype=int
for the first two series.To understand what's happening with the first issue, you have to appreciate that Pandas doesn't continually check the most appropriate dtype is selected after each operation. This would be prohibitively expensive.
You can check dtypes
for yourself:
print({'df1': df1.dtypes, 'df2': df2.dtypes, 'df3': df3.dtypes})
{'df1': 0 int64
1 int64
2 object
dtype: object,
'df2': 0 int64
1 int64
2 object
dtype: object,
'df3': 0 object
1 object
2 object
dtype: object}
You can apply conversion selectively to df3
via an operation which checks if any null values result post-conversion:
for col in df3.select_dtypes(['object']).columns:
col_num = pd.to_numeric(df3[col], errors='coerce')
if not col_num.isnull().any(): # check if any null values
df3[col] = col_num # assign numeric series
print(df3.dtypes)
0 int64
1 int64
2 object
dtype: object
You should then see consistent treatment. At this point, it's worth discarding your original df3
: it's not documented anywhere that continual series type-checking can or should be applied after each operation.
To disregard non-numeric values when summing across rows or columns you can force conversion via pd.to_numeric
with errors='coerce'
:
df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
col_sum = df.apply(pd.to_numeric, errors='coerce').sum()
row_sum = df.apply(pd.to_numeric, errors='coerce').sum(1)
print(col_sum)
0 12.0
1 15.0
2 12.0
dtype: float64
print(row_sum)
0 6.0
1 9.0
2 24.0
dtype: float64