Search code examples
pythonpython-polars

Adding aggregation that exclude all except one column in Polars Python group by


I have a big df (60 M rows) and I need to do some groupings and aggregations

This is a sample like my data

dic = {
      'Attr1' : ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
      'Attr2' : ['X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y'],
      'Val1' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    }
df = pl.DataFrame(dic)
df
shape: (12, 3)
┌───────┬───────┬──────┐
│ Attr1 ┆ Attr2 ┆ Val1 │
│ ---   ┆ ---   ┆ ---  │
│ str   ┆ str   ┆ i64  │
╞═══════╪═══════╪══════╡
│ A     ┆ X     ┆ 1    │
│ A     ┆ X     ┆ 2    │
│ A     ┆ Y     ┆ 3    │
│ A     ┆ Y     ┆ 4    │
│ …     ┆ …     ┆ …    │
│ C     ┆ X     ┆ 9    │
│ C     ┆ X     ┆ 10   │
│ C     ┆ Y     ┆ 11   │
│ C     ┆ Y     ┆ 12   │
└───────┴───────┴──────┘

I am trying to get the max value per Attr1 with the grouping per Attr1 and Attr2

df.group_by(
    ['Attr1' , 'Attr2'], maintain_order=True
    ).agg(
        pl.col('Attr1').count().alias('count'),
        pl.col('Val1').max().alias('MAX')
        )
shape: (6, 4)
┌───────┬───────┬───────┬─────┐
│ Attr1 ┆ Attr2 ┆ count ┆ MAX │
│ ---   ┆ ---   ┆ ---   ┆ --- │
│ str   ┆ str   ┆ u32   ┆ i64 │
╞═══════╪═══════╪═══════╪═════╡
│ A     ┆ X     ┆ 2     ┆ 2   │
│ A     ┆ Y     ┆ 2     ┆ 4   │
│ B     ┆ X     ┆ 2     ┆ 6   │
│ B     ┆ Y     ┆ 2     ┆ 8   │
│ C     ┆ X     ┆ 2     ┆ 10  │
│ C     ┆ Y     ┆ 2     ┆ 12  │
└───────┴───────┴───────┴─────┘

But I need to get the following output

shape: (6, 4)
┌───────┬───────┬───────┬─────┐
│ Attr1 ┆ Attr2 ┆ count ┆ MAX │
│ ---   ┆ ---   ┆ ---   ┆ --- │
│ str   ┆ str   ┆ u32   ┆ i64 │
╞═══════╪═══════╪═══════╪═════╡
│ A     ┆ X     ┆ 2     ┆ 4   │
│ A     ┆ Y     ┆ 2     ┆ 4   │
│ B     ┆ X     ┆ 2     ┆ 8   │
│ B     ┆ Y     ┆ 2     ┆ 8   │
│ C     ┆ X     ┆ 2     ┆ 12  │
│ C     ┆ Y     ┆ 2     ┆ 12  │
└───────┴───────┴───────┴─────┘

I could separate the DF into two DFs then group then merge but I am looking for the best optimized solution for high performance an low memory cost.

Thanks in advance


Solution

  • There are two separate things to optimize on. One is code elegance/brevity and the other is performance. Often we think that whichever solution is less code will also be better performance.

    Given that you're using max and not an aggregation that depends on all the values, @alexp's answer is the best performance, utilizing a hybrid between a group_by context and a window. The max of the max approach gives the benefit of having the second group (namely of Attr1 only) be half the size of the first grouping.

    In contrast, if you did only window functions it'd look like:

    (
        df2
        .select(
            'Attr1','Attr2',
            pl.col('Attr1').count().over(['Attr1','Attr2']).alias('count'),
            pl.col('Val1').max().over('Attr1').alias('MAX')
            )
        .unique(['Attr1','Attr2'])
        .sort(['Attr1','Attr2'])
    )
    

    This might look nicer because it only has the single select context but it will perform worse as it doesn't get the max of max benefit. The use of unique at the end is because a window function will return all the original rows in contrast to a group_by which only returns the unique combinations of whatever is in the group_by. That's another reason to do a combination of a group_by.agg with a subsequent window function is to control which rows will be in the result set from the onset rather than paring them down at the end.

    Another approach is the joining of group_bys. That would look like:

    (
        df
        .group_by(['Attr1','Attr2'],maintain_order=True)
        .agg(pl.col('Attr1').count().alias('count'))
        .join(
            df
            .group_by(['Attr1'],maintain_order=True)
            .agg(pl.col('Val1').max().alias('MAX')),
            on='Attr1'
        )
    )
    

    That one looks worse with the double group_by and the added complexity of the join makes it seem like it'd be slower than the nice elegant looking window function. The thing is that the window function is doing the same work, it's just not salient. It also doesn't get the same optimizations that group_by gets.

    Benchmark Time

    Say we start with

    import polars as pl
    import numpy as np
    n=100_000_000
    df = pl.DataFrame({
          'Attr1' : np.random.choice(['A','B','C'],n),
          'Attr2' : np.random.choice(['X','Y','Z'],n),
          'Val1' :  np.random.randint(0,100,n),
        }).sort(['Attr1','Attr2'])
    

    and let's say we want to calculate the standard deviation instead of the max so our hybrid approach won't work.

    We can do a pure window:

    %%timeit
    (
        df
        .select(
            'Attr1','Attr2',
            pl.col('Attr1').count().over(['Attr1','Attr2']).alias('count'),
            pl.col('Val1').std().over('Attr1').alias('MAX')
            )
        .unique(['Attr1','Attr2'])
    
    )
    6.91 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    or joined group_bys

    %%timeit
    (
        df
        .group_by(['Attr1','Attr2'],maintain_order=True)
        .agg(
            pl.col('Attr1').count().alias('count')
            )
        .join(
            df
            .group_by(['Attr1'],maintain_order=True)
            .agg(pl.col('Val1').std().alias('MAX')),
            on='Attr1'
        )
    )
    3.81 s ± 302 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    As you can see the joined group_bys is almost twice as fast as using a pure window function.