Search code examples
pythonpandasdataframegroup-by

Calculation problem after pandas grouping


In pandas, how to achieve, grouped by date, in each date group, each row of column A and B multiplied and then summed, and then divided by the sum of all B columns in the date group.

I have tried:

(df.groupby('date')['A','B']
    .transform(lambda x: (x['A'] * x['B']).sum())
    .div(df.groupby('date')['B'].agg('sum')))

and:

(df.groupby('date')
    .transform(lambda x: (x['A'] * x['B']).sum())
    .div(df.groupby('date')['B'].agg('sum')))

both showed:

KeyError: 'A'


Solution

  • You should use .apply() instead of .transform(), as follows:

    df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
    

    By using .apply(), the x variables in the lambda function will be recognised as DataFrame objects representing each group, which can each be indexed for individual columns.

    The .transform() method only treats x as a Series object representing only one column at a time.

    From this answer:

    Two major differences between `apply` and `transform`

    There are two major differences between the transform and apply groupby methods.

    • Input:
      • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
      • while transform passes each column for each group individually as a Series to the custom function.
    • Output:
      • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
      • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

    So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

    Example

    Sample data:

    import pandas as pd
    
    data = {
        'date': ['01/01/1999', '02/01/1999', '03/01/1999', '03/01/1999'],
        'A': [2, 4, 7, 4],
        'B': [5, 7, 9, 6]
    }
    
    df = pd.DataFrame(data)
    print(df)
    
             date  A  B
    0  01/01/1999  2  5
    1  02/01/1999  4  7
    2  03/01/1999  7  9
    3  03/01/1999  4  6
    

    The calculation for '03/01/1999' would be:

    ((7 * 9) + (4 * 6)) / (9 + 6) # = 5.8
    

    Calculation for each date group using .apply():

    df_ab_calc = df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
    print(df_ab_calc)
    
    date
    01/01/1999    2.0
    02/01/1999    4.0
    03/01/1999    5.8