Search code examples
pythonpython-polars

How to Use Aggregation Functions as an Index in a Polars DataFrame?


I have a Polars DataFrame, and I want to create a summarized view where aggregated values (e.g., unique IDs, total sends) are displayed in a format that makes comparison across months easier. Here's an example of my dataset:

My example dataframe:

import polars as pl
df = pl.DataFrame({
    "Channel": ["X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X", "Y", "Y", "X", "X", "Y", "Y"],
    "ID": ["a", "b", "b", "a", "e", "b", "g", "h", "a", "a", "k", "a", "b", "n", "o", "p"],
    "Month": ["1", "2", "1", "2", "1", "2", "1", "2", "1", "2", "1", "2", "1", "2", "1", "2"]
})

Currently, I use the following group_by() approach to calculate the number of unique IDs and the total number of sends for each Month and Channel:

(
    df
    .group_by(
        pl.col("Month"),
        pl.col("Channel")
    )
    .agg(
        pl.col("ID").n_unique().alias("Uniques ID"),
        pl.col("ID").len().alias("Total sends")
    )
)
shape: (4, 4)
┌───────┬─────────┬────────────┬─────────────┐
│ Month ┆ Channel ┆ Uniques ID ┆ Total sends │
│ ---   ┆ ---     ┆ ---        ┆ ---         │
│ str   ┆ str     ┆ u32        ┆ u32         │
╞═══════╪═════════╪════════════╪═════════════╡
│ 1     ┆ X       ┆ 3          ┆ 4           │
│ 1     ┆ Y       ┆ 4          ┆ 4           │
│ 2     ┆ X       ┆ 3          ┆ 4           │
│ 2     ┆ Y       ┆ 3          ┆ 4           │
└───────┴─────────┴────────────┴─────────────┘

However, my actual dataset is much larger, and have more agg_functions, so I want a format that better highlights comparisons across months. Ideally, I want the output to look like this:

| Channels | agg_func     | months | months |
|----------|--------------|--------|--------|
|          |              | 1      | 2      |
| X        | Uniques ID   | 3      | 3      |
| X        | Total sends  | 4      | 4      |
| Y        | Uniques ID   | 4      | 3      |
| Y        | Total sends  | 4      | 4      |

I believe I could use .pivot() and pass the aggregation functions as part of the index. But, I'm not sure how to implement this directly without creating an auxiliary DataFrame. Any suggestions?


Solution

  • You can aggregate multiple aggregates while pivoting and then explode the lists:

    (
        df.pivot(
            on="Month",
            values="ID",
            aggregate_function=
            pl.concat_list(
                pl.element().n_unique().alias("value"),
                pl.element().len().alias("value")
            )
        )
        .with_columns(agg_func=["Uniques ID","Total sends"])
        .explode(pl.exclude("Channel"))
    )
    
    shape: (4, 4)
    ┌─────────┬─────┬─────┬─────────────┐
    │ Channel ┆ 1   ┆ 2   ┆ agg_func    │
    │ ---     ┆ --- ┆ --- ┆ ---         │
    │ str     ┆ u32 ┆ u32 ┆ str         │
    ╞═════════╪═════╪═════╪═════════════╡
    │ X       ┆ 3   ┆ 3   ┆ Uniques ID  │
    │ X       ┆ 4   ┆ 4   ┆ Total sends │
    │ Y       ┆ 4   ┆ 3   ┆ Uniques ID  │
    │ Y       ┆ 4   ┆ 4   ┆ Total sends │
    └─────────┴─────┴─────┴─────────────┘
    

    Or, you can do it with multiple pivots (one per aggregate function):

    pl.concat([
        df.pivot(
           on="Month",
           values="ID",
           aggregate_function=agg_func
        ).with_columns(
            pl.lit(agg_func_name).alias("agg_func")
        )
        for agg_func, agg_func_name in [
            (pl.element().n_unique(), "Uniques ID"), 
            (pl.element().len(), "Total sends")
        ]
    ])
    
    # alternatively group_by first and then pivot
    # pl.concat([
    #     df.group_by("Month","Channel")
    #     .agg(agg_func)
    #     .with_columns(agg_func=pl.lit(agg_func_name))
    #     for agg_func, agg_func_name in [
    #         (pl.col.ID.n_unique(), "Uniques ID"), 
    #         (pl.col.ID.len(), "Total sends")
    #     ]
    # ]).pivot(on="Month", values="ID")
    
    shape: (4, 4)
    ┌─────────┬─────┬─────┬─────────────┐
    │ Channel ┆ 1   ┆ 2   ┆ agg_func    │
    │ ---     ┆ --- ┆ --- ┆ ---         │
    │ str     ┆ u32 ┆ u32 ┆ str         │
    ╞═════════╪═════╪═════╪═════════════╡
    │ X       ┆ 3   ┆ 3   ┆ Uniques ID  │
    │ Y       ┆ 4   ┆ 3   ┆ Uniques ID  │
    │ X       ┆ 4   ┆ 4   ┆ Total sends │
    │ Y       ┆ 4   ┆ 4   ┆ Total sends │
    └─────────┴─────┴─────┴─────────────┘
    

    Of course, you can also extend your solution with unpivot and pivot

    (
        df
        .group_by("Month","Channel")
        .agg(
            pl.col("ID").n_unique().alias("Uniques ID"),
            pl.col("ID").len().alias("Total sends")
        )
        .unpivot(index=["Month","Channel"], variable_name="agg_func")
        .pivot(on="Month", values="value")
    )
    
    shape: (4, 4)
    ┌─────────┬─────────────┬─────┬─────┐
    │ Channel ┆ agg_func    ┆ 2   ┆ 1   │
    │ ---     ┆ ---         ┆ --- ┆ --- │
    │ str     ┆ str         ┆ u32 ┆ u32 │
    ╞═════════╪═════════════╪═════╪═════╡
    │ Y       ┆ Uniques ID  ┆ 3   ┆ 4   │
    │ X       ┆ Uniques ID  ┆ 3   ┆ 3   │
    │ Y       ┆ Total sends ┆ 4   ┆ 4   │
    │ X       ┆ Total sends ┆ 4   ┆ 4   │
    └─────────┴─────────────┴─────┴─────┘