Search code examples
pythonpython-3.xpandasdataframegroup-by

Aggregating multiple columns of a pandas dataframe using custom functions


EDIT: Because several people have already gone this route, I don't want to get rid of arbFun - I used a mean here because it was a simple example, but it's a stand in for a much more complicated function that cannot be easily removed. It is important that the arbFun take the inputs of two separate columns and output a single value per group. Within that restriction, we can modify arbFun as we please, but any solution that gets rid of it entirely is probably not workable.

Bear with my stupidly complicated simple example here - I really tried to get an easy version of my very complicated problem.

Say I have a dataframe data and a function arbfun given by the following:

import pandas as pd
data = pd.DataFrame({'Label1':[1,2,2,1,1,2],
                     'Label2':['north', 'north', 'north', 'south', 'south', 'south'],
                     'A':[2,4,6,8,10,12],'B':[4,1,37,1,1,1]})

def arbFun(col1, col2):

    """
  Calculates the average value of col1 for all values of (col2 == 1).

  Args:
    col1: A pandas Series containing the values to be averaged.
    col2: A pandas Series containing the filter conditions.

  Returns:
    The average value of col1 for all values of (col2 == 1).
  """

    df = pd.DataFrame({'Col1':col1, 'Col2':col2})

    # Filter the data based on the condition
    filtered_data = df[df['Col2'] == 1]

    # Calculate the average
    if len(filtered_data) > 0:
        average = filtered_data['Col1'].mean()
    else:
        average = None  # Return None if no data meets the condition

    return average

If I run ArbFun on the whole of data, I get a value of 8.5. So far so good. But now, I want to group by my two label columns and output the results of my ArbFun, as well as some additional information.

output = data.groupby(['Label1', 'Label2']).agg(
    Column_A =('A', 'sum'),
    Filtered_Mean = lambda x: arbFun(x['A'], x['B'])
)

Now this should output something like

Label1,Label2,Column_A,Filtered_Mean
1,'north',2,None
2,'north',10,4
1,'south',18,9
2,'south',12,12

But instead, I get a type error: TypeError: Must provide 'func' or tuples of '(column, aggfunc). I have tried hard to understand where this is coming from but so far have failed. What am I doing wrong?


Solution

  • Another option is to use DataFrame.groupby().apply() and return a pd.Series:

    def group_fn(g):
        return pd.Series(
            {"Column_A": g["A"].sum(), "Filtered_Mean": g.loc[g["B"] == 1, "A"].mean()}
        )
    
    
    data = data.groupby(["Label1", "Label2"], as_index=False).apply(group_fn)
    print(data)
    

    Prints:

       Label1 Label2  Column_A  Filtered_Mean
    0       1  north       2.0            NaN
    1       1  south      18.0            9.0
    2       2  north      10.0            4.0
    3       2  south      12.0           12.0
    

    EDIT: If you have to use arbFun function you can do:

    def group_fn(g):
        return pd.Series(
            {"Column_A": g["A"].sum(), "Filtered_Mean": arbFun(g["A"], g["B"])}
        )
    
    
    data = data.groupby(["Label1", "Label2"], as_index=False).apply(group_fn)
    print(data)