Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.
import polars as pl
df = pl.DataFrame(
{
"names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
"dates": ["1", "1", "2", "3", "3", "4"],
"groups": ["A", "A", "B", "B", "B", "C"],
}
)
This is what i'm trying to do:
groups combined
0 A {'1': ['foo', 'ham']}
1 B {'2': ['spam'], '3': ['cheese', 'egg']}
2 C {'4': ['foo']}
In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars.
(
df.group_by("groups", "dates").all()
.to_pandas()
.groupby(["groups"])
.apply(lambda x: dict(zip(x["dates"], x["names"])))
.reset_index(name="combined")
)
While this works, it certainly doesn't feel right.
Alternatively, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.
df.group_by("groups").agg(
pl.map_groups(exprs=["dates", "names"], function=build_dict).alias("combined")
)
You can do this:
print(data
.group_by("groups", "dates").agg("names")
.group_by("groups")
.agg(
pl.map_groups(["dates", "names"],
lambda s: pl.Series([dict(zip(s[0], s[1].to_list()))], dtype=pl.Object),
)
)
)
shape: (3, 2)
┌────────┬─────────────────────────────────────────┐
│ groups ┆ dates │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪═════════════════════════════════════════╡
│ A ┆ {'1': ['foo', 'ham']} │
│ B ┆ {'2': ['spam'], '3': ['cheese', 'egg']} │
│ C ┆ {'4': ['foo']} │
└────────┴─────────────────────────────────────────┘
This is not really how you should be using DataFrames
though.
Columns of type object
have very little support for the rest of the Polars API.
There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.