EDIT: Because several people have already gone this route, I don't want to get rid of arbFun - I used a mean here because it was a simple example, but it's a stand in for a much more complicated function that cannot be easily removed. It is important that the arbFun take the inputs of two separate columns and output a single value per group. Within that restriction, we can modify arbFun as we please, but any solution that gets rid of it entirely is probably not workable.
Bear with my stupidly complicated simple example here - I really tried to get an easy version of my very complicated problem.
Say I have a dataframe data
and a function arbfun
given by the following:
import pandas as pd
data = pd.DataFrame({'Label1':[1,2,2,1,1,2],
'Label2':['north', 'north', 'north', 'south', 'south', 'south'],
'A':[2,4,6,8,10,12],'B':[4,1,37,1,1,1]})
def arbFun(col1, col2):
"""
Calculates the average value of col1 for all values of (col2 == 1).
Args:
col1: A pandas Series containing the values to be averaged.
col2: A pandas Series containing the filter conditions.
Returns:
The average value of col1 for all values of (col2 == 1).
"""
df = pd.DataFrame({'Col1':col1, 'Col2':col2})
# Filter the data based on the condition
filtered_data = df[df['Col2'] == 1]
# Calculate the average
if len(filtered_data) > 0:
average = filtered_data['Col1'].mean()
else:
average = None # Return None if no data meets the condition
return average
If I run ArbFun on the whole of data, I get a value of 8.5. So far so good. But now, I want to group by my two label columns and output the results of my ArbFun, as well as some additional information.
output = data.groupby(['Label1', 'Label2']).agg(
Column_A =('A', 'sum'),
Filtered_Mean = lambda x: arbFun(x['A'], x['B'])
)
Now this should output something like
Label1,Label2,Column_A,Filtered_Mean
1,'north',2,None
2,'north',10,4
1,'south',18,9
2,'south',12,12
But instead, I get a type error: TypeError: Must provide 'func' or tuples of '(column, aggfunc)
. I have tried hard to understand where this is coming from but so far have failed. What am I doing wrong?
Another option is to use DataFrame.groupby().apply()
and return a pd.Series
:
def group_fn(g):
return pd.Series(
{"Column_A": g["A"].sum(), "Filtered_Mean": g.loc[g["B"] == 1, "A"].mean()}
)
data = data.groupby(["Label1", "Label2"], as_index=False).apply(group_fn)
print(data)
Prints:
Label1 Label2 Column_A Filtered_Mean
0 1 north 2.0 NaN
1 1 south 18.0 9.0
2 2 north 10.0 4.0
3 2 south 12.0 12.0
EDIT: If you have to use arbFun
function you can do:
def group_fn(g):
return pd.Series(
{"Column_A": g["A"].sum(), "Filtered_Mean": arbFun(g["A"], g["B"])}
)
data = data.groupby(["Label1", "Label2"], as_index=False).apply(group_fn)
print(data)