In pandas, how to achieve, grouped by date, in each date group, each row of column A and B multiplied and then summed, and then divided by the sum of all B columns in the date group.
I have tried:
(df.groupby('date')['A','B']
.transform(lambda x: (x['A'] * x['B']).sum())
.div(df.groupby('date')['B'].agg('sum')))
and:
(df.groupby('date')
.transform(lambda x: (x['A'] * x['B']).sum())
.div(df.groupby('date')['B'].agg('sum')))
both showed:
KeyError: 'A'
You should use .apply()
instead of .transform()
, as follows:
df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
By using .apply()
, the x
variables in the lambda
function will be recognised as DataFrame
objects representing each group, which can each be indexed for individual columns.
The .transform()
method only treats x
as a Series
object representing only one column at a time.
From this answer:
Two major differences between `apply` and `transform`
There are two major differences between the
transform
andapply
groupby methods.
- Input:
apply
implicitly passes all the columns for each group as a DataFrame to the custom function.- while
transform
passes each column for each group individually as a Series to the custom function.- Output:
- The custom function passed to
apply
can return a scalar, or a Series or DataFrame (or numpy array or even list).- The custom function passed to
transform
must return a sequence (a one dimensional Series, array or list) the same length as the group.So,
transform
works on just one Series at a time andapply
works on the entire DataFrame at once.
Sample data:
import pandas as pd
data = {
'date': ['01/01/1999', '02/01/1999', '03/01/1999', '03/01/1999'],
'A': [2, 4, 7, 4],
'B': [5, 7, 9, 6]
}
df = pd.DataFrame(data)
print(df)
date A B
0 01/01/1999 2 5
1 02/01/1999 4 7
2 03/01/1999 7 9
3 03/01/1999 4 6
The calculation for '03/01/1999'
would be:
((7 * 9) + (4 * 6)) / (9 + 6) # = 5.8
Calculation for each date group using .apply()
:
df_ab_calc = df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
print(df_ab_calc)
date
01/01/1999 2.0
02/01/1999 4.0
03/01/1999 5.8