There is a polars dataframe which consists of 3 fields listed below.
df = pl.from_repr("""
┌─────────┬─────────────────────┬─────────────┐
│ user_id ┆ date ┆ part_of_day │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ns] ┆ cat │
╞═════════╪═════════════════════╪═════════════╡
│ 173367 ┆ 2021-08-03 00:00:00 ┆ day │
│ 132702 ┆ 2021-10-28 00:00:00 ┆ evening │
│ 100853 ┆ 2021-07-29 00:00:00 ┆ night │
│ 305810 ┆ 2021-08-24 00:00:00 ┆ day │
│ 305239 ┆ 2021-08-13 00:00:00 ┆ day │
└─────────┴─────────────────────┴─────────────┘
""")
My task is to calculate the number of unique users for each week and time of day. Polars provides three different types of grouping options::
It seems like group_by_dynamic would be the best fit for this specific task. However, it does not allow for other columns to be used as the grouper. Therefore, my question is how can I accomplish this task using Polars?
Additionally, I have included some code that I would have used to solve this problem using Pandas.
(
df
.cast({"part_of_day": pl.String})
.to_pandas()
.groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
.user_id
.nunique()
)
Desired result:
date part_of_day
2021-08-01 night 1
2021-08-08 day 1
2021-08-15 day 1
2021-08-29 day 1
2021-10-31 evening 1
Name: user_id, dtype: int64
The pandas example looks equivalent to some date arithmetic combined with dt.truncate
(df.with_columns(
(pl.col("date") + pl.duration(weeks=1)).dt.truncate("1w") - pl.duration(days=1)
)
.group_by("date", "part_of_day")
.agg(pl.col("user_id").n_unique())
.sort("date") # optional
)
shape: (5, 3)
┌─────────────────────┬─────────────┬─────────┐
│ date | part_of_day | user_id │
│ --- | --- | --- │
│ datetime[μs] | cat | u32 │
╞═════════════════════╪═════════════╪═════════╡
│ 2021-08-01 00:00:00 | night | 1 │
│ 2021-08-08 00:00:00 | day | 1 │
│ 2021-08-15 00:00:00 | day | 1 │
│ 2021-08-29 00:00:00 | day | 1 │
│ 2021-10-31 00:00:00 | evening | 1 │
└─────────────────────┴─────────────┴─────────┘