Search code examples
pythonpandasdataframeaggregate-functions

How to use column value as parameter in aggregation function in pandas?


Given a certain table of type

A B C
t r 1
t r 1
n j 2
n j 2
n j 2

I would like to group on A and B and only take the number of rows specified by C

So the desired output would be

A B C
t r 1
n j 2
n j 2

I am trying to achieve that through this function but with no luck

df.groupby(['A', 'B']).agg(lambda x: x.head(df.C))

Solution

  • You can use groupby.cumcount and boolean indexing:

    out = df[df['C'].gt(df.groupby(['A', 'B']).cumcount())]
    

    Or with a classical groupby.apply:

    (df.groupby(['A', 'B'], sort=False, as_index=False, group_keys=False)
       .apply(lambda g: g.head(g['C'].iloc[0]))
    )
    

    output:

       A  B  C
    0  t  r  1
    2  n  j  2
    3  n  j  2
    

    Intermediates for the groupby.cumcount approach:

       A  B  C  cumcount  C > cumcount
    0  t  r  1         0          True
    1  t  r  1         1         False
    2  n  j  2         0          True
    3  n  j  2         1          True
    4  n  j  2         2         False