I'm wondering why pandas Series subtract a pandas dataframe produce such a strange result.
df = pd.DataFrame(np.arange(10).reshape(2, 5), columns='a-b-c-d-e'.split('-'))
df.max(axis=1) - df[['b']]
What are the steps for pandas to produce the result?
b 0 1
0 NaN NaN NaN
1 NaN NaN NaN
By default an operation between a DataFrame and a Series is broadcasted on the DataFrame by column, over the rows. This makes it easy to perform operations combining a DataFrame and aggregation per column:
# let's subtract the DataFrame to its max per column
df.max(axis=0) - df[['b']]
a b c d e
b NaN 5 NaN NaN NaN
1 NaN 0 NaN NaN NaN
Here, since you're aggregating per row, this is no longer possible. You should use rsub
with the parameter axis=0
:
df[['b']].rsub(df.max(axis=1), axis=0)
Output:
b
0 3
1 3
Note that using two Series would also align the values:
df.max(axis=1) - df['b']
Output:
0 3
1 3
dtype: int64
df.max(axis=1) - df[['b']]
?First, let's have a look at each operand:
# df.max(axis=1)
0 4
1 9
dtype: int64
# df[['b']]
b
0 1
1 6
Since df[['b']]
is 2D (DataFrame), and df.max(axis=1)
is 1D (Series), df.max(axis=1)
will be used as if it was a "wide" DataFrame:
# df.max(axis=1).to_frame().T
0 1
0 4 9
There are no columns in common, thus the output is only NaNs with the union of column names ({'b'}|{0, 1}
-> {'b', 0, 1}
).
If you replace the NaNs that are used in the operation by 0
this makes it obvious how the values are used:
df[['b']].rsub(df.max(axis=1).to_frame().T, fill_value=0)
b 0 1
0 -1.0 4.0 9.0
1 -6.0 NaN NaN
Now let's check a different example in which one of the row indices has the same name as one of the selected columns:
df = pd.DataFrame(np.arange(10).reshape(2, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['b', 0]
)
df.max(axis=1) - df[['b']]
Now the output only has 2 columns, b
the common indice and 1
the second index in the Series ({'b', 1}|{'b'}
-> {'b', 1}
):
1 b
b NaN 3
1 NaN -2