Search code examples
python-polars

Groupby aggregate two columns into a dictionary in Polars


Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.

import polars as pl

df = pl.DataFrame(
    {
        "names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
        "dates": ["1", "1", "2", "3", "3", "4"],
        "groups": ["A", "A", "B", "B", "B", "C"],
    }
)

This is what i'm trying to do:

  groups                                 combined
0      A                    {'1': ['foo', 'ham']}
1      B  {'2': ['spam'], '3': ['cheese', 'egg']}
2      C                           {'4': ['foo']}

In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars.

(
    df.group_by("groups", "dates").all()
    .to_pandas()
    .groupby(["groups"])
    .apply(lambda x: dict(zip(x["dates"], x["names"])))
    .reset_index(name="combined")
)

While this works, it certainly doesn't feel right.

Alternatively, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.

df.group_by("groups").agg(
    pl.map_groups(exprs=["dates", "names"], function=build_dict).alias("combined")
)

Solution

  • You can do this:

    print(data
        .group_by("groups", "dates").agg("names")
        .group_by("groups")
        .agg(
            pl.map_groups(["dates", "names"], 
                lambda s: pl.Series([dict(zip(s[0], s[1].to_list()))], dtype=pl.Object),
            )
        )
    )
    
    
    shape: (3, 2)
    ┌────────┬─────────────────────────────────────────┐
    │ groups ┆ dates                                   │
    │ ---    ┆ ---                                     │
    │ str    ┆ object                                  │ 
    ╞════════╪═════════════════════════════════════════╡
    │ A      ┆ {'1': ['foo', 'ham']}                   │
    │ B      ┆ {'2': ['spam'], '3': ['cheese', 'egg']} │
    │ C      ┆ {'4': ['foo']}                          │
    └────────┴─────────────────────────────────────────┘
    
    

    This is not really how you should be using DataFrames though.

    Columns of type object have very little support for the rest of the Polars API.

    There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.