Search code examples
pythonpandasdataframe

Pandas Series subtract Pandas Dataframe strange result


I'm wondering why pandas Series subtract a pandas dataframe produce such a strange result.

df = pd.DataFrame(np.arange(10).reshape(2, 5), columns='a-b-c-d-e'.split('-'))
df.max(axis=1) - df[['b']]

What are the steps for pandas to produce the result?

    b   0   1
0 NaN NaN NaN
1 NaN NaN NaN

Solution

  • By default an operation between a DataFrame and a Series is broadcasted on the DataFrame by column, over the rows. This makes it easy to perform operations combining a DataFrame and aggregation per column:

    # let's subtract the DataFrame to its max per column
    df.max(axis=0) - df[['b']]
    
        a  b   c   d   e
    b NaN  5 NaN NaN NaN
    1 NaN  0 NaN NaN NaN
    

    Here, since you're aggregating per row, this is no longer possible. You should use rsub with the parameter axis=0:

    df[['b']].rsub(df.max(axis=1), axis=0)
    

    Output:

       b
    0  3
    1  3
    

    Note that using two Series would also align the values:

    df.max(axis=1) - df['b']
    

    Output:

    0    3
    1    3
    dtype: int64
    

    Why 3 columns with df.max(axis=1) - df[['b']]?

    First, let's have a look at each operand:

    # df.max(axis=1)
    0    4
    1    9
    dtype: int64
    
    # df[['b']]
       b
    0  1
    1  6
    

    Since df[['b']] is 2D (DataFrame), and df.max(axis=1) is 1D (Series), df.max(axis=1) will be used as if it was a "wide" DataFrame:

    # df.max(axis=1).to_frame().T
       0  1
    0  4  9
    

    There are no columns in common, thus the output is only NaNs with the union of column names ({'b'}|{0, 1} -> {'b', 0, 1}).

    If you replace the NaNs that are used in the operation by 0 this makes it obvious how the values are used:

    df[['b']].rsub(df.max(axis=1).to_frame().T, fill_value=0)
    
         b    0    1
    0 -1.0  4.0  9.0
    1 -6.0  NaN  NaN
    

    Now let's check a different example in which one of the row indices has the same name as one of the selected columns:

    df = pd.DataFrame(np.arange(10).reshape(2, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['b', 0]
                     )
    df.max(axis=1) - df[['b']]
    

    Now the output only has 2 columns, b the common indice and 1 the second index in the Series ({'b', 1}|{'b'} -> {'b', 1}):

        1  b
    b NaN  3
    1 NaN -2