Search code examples
pythonpandasdataframegroup-by

Pandas DataFrame aggregation with a condition


I want to aggregate DataFrame based on a condition and I am able to do it but not for every scenario, here is an example:

import pandas as pd
import numpy as np

th=0.5
tuples = list(zip(*[
            ["foo", "foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
            ["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
            ["one", "two", "three", "one", "two", "three", "one", "two", "three", "one", "two", "three"],
        ]))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second", "third"])

# create test DataFrame and a simpler copy
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
       0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
       0.10108433, 0.32280973]), index=index, columns=["data"])
df2 = df1.reset_index()[["first", "second", "data"]].groupby(["first", "second"]).sum().copy()

# if the amount is greater or equal to the threshold copy the value
df1.loc[df1['data'] >= th, 'test'] = df1.loc[df1['data'] >= th, 'data']
df2.loc[df2['data'] >= th, 'test'] = df2.loc[df2['data'] >= th, 'data']

# if the amount is smaller than the threshold aggregate all such values for the group
df1.loc[df1['data'] < th, 'test'] = df1.loc[df1['data'] < th, 'data'].groupby(["first", "second"]).sum()
df2.loc[df2['data'] < th, 'test'] = df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()

it gives the output

                        data      test
first second third                    
foo   A      one    0.043250  0.278887
             two    0.177301  0.278887
             three  0.058336  0.278887
      B      one    0.159956  0.923709
             two    0.488583  0.923709
             three  0.275170  0.923709
bar   A      one    0.261548  0.261548
             two    0.688576  0.688576
             three  0.751735  0.751735
      B      one    0.347016  0.770910
             two    0.101084  0.770910
             three  0.322810  0.770910

                  data      test
first second                    
bar   A       1.701859  1.701859
      B       0.770910  0.770910
foo   A       0.278887       NaN
      B       0.923709  0.923709

The first DataFrame is as intended but for the second one there is np.nan where the aggregation should be. Why the first works and the second does not? How can I achieve this?

Bonus: I would like to have a DataFrame with n aggregation columns and do this procedure for each of them from the innermost to the outermost keeping the value if it is over a threshold and aggregating all those smaller than the threshold - is there a better solution than a for loop?


Solution

  • The problem seems to be:

    df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()
    

    First, don't pass a list of one element for groupby, it raises a FutureWarning:

    FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

    Next, if you group by one level, you have to broadcast the aggregated operation to all filtered rows to keep the same dimension (multilevel index)

    >>> df2.loc[df2['data'] < th, 'data'].groupby('first').sum()
    first
    foo    0.278887
    Name: data, dtype: float64
    
    >>> df2.loc[df2['data'] < th, 'data'].groupby('first').transform('sum')
    first  second
    foo    A         0.278887
    Name: data, dtype: float64
    

    Now the levels are the same so you can set the values:

    df2.loc[df2['data'] < th, 'test'] = \
        df2.loc[df2['data'] < th, 'data'].groupby('first').transform('sum')
    print(df2)
    
    # Output
                      data      test
    first second                    
    bar   A       1.701859  1.701859
          B       0.770910  0.770910
    foo   A       0.278887  0.278887  # <- HERE
          B       0.923709  0.923709