Search code examples
pythongroup-bypython-polars

Python Polars group by on both time and categorical values


There is a polars dataframe which consists of 3 fields listed below.

df = pl.from_repr("""
┌─────────┬─────────────────────┬─────────────┐
│ user_id ┆ date                ┆ part_of_day │
│ ---     ┆ ---                 ┆ ---         │
│ i64     ┆ datetime[ns]        ┆ cat         │
╞═════════╪═════════════════════╪═════════════╡
│ 173367  ┆ 2021-08-03 00:00:00 ┆ day         │
│ 132702  ┆ 2021-10-28 00:00:00 ┆ evening     │
│ 100853  ┆ 2021-07-29 00:00:00 ┆ night       │
│ 305810  ┆ 2021-08-24 00:00:00 ┆ day         │
│ 305239  ┆ 2021-08-13 00:00:00 ┆ day         │
└─────────┴─────────────────────┴─────────────┘
""")

My task is to calculate the number of unique users for each week and time of day. Polars provides three different types of grouping options::

  1. group_by: This option allows for simple grouping by columns or expressions.
  2. group_by_dynamic: This option only accepts a datetime column as the grouper.
  3. rolling: This option also only accepts a datetime column as the grouper.

It seems like group_by_dynamic would be the best fit for this specific task. However, it does not allow for other columns to be used as the grouper. Therefore, my question is how can I accomplish this task using Polars?

Additionally, I have included some code that I would have used to solve this problem using Pandas.

(
    df
    .cast({"part_of_day": pl.String})
    .to_pandas()
    .groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
    .user_id
    .nunique()
)

Desired result:

date        part_of_day
2021-08-01  night          1
2021-08-08  day            1
2021-08-15  day            1
2021-08-29  day            1
2021-10-31  evening        1
Name: user_id, dtype: int64

Solution

  • The pandas example looks equivalent to some date arithmetic combined with dt.truncate

    (df.with_columns(
           (pl.col("date") + pl.duration(weeks=1)).dt.truncate("1w")  - pl.duration(days=1)
       )
       .group_by("date", "part_of_day")
       .agg(pl.col("user_id").n_unique())
       .sort("date") # optional
    )
    
    shape: (5, 3)
    ┌─────────────────────┬─────────────┬─────────┐
    │ date                | part_of_day | user_id │
    │ ---                 | ---         | ---     │
    │ datetime[μs]        | cat         | u32     │
    ╞═════════════════════╪═════════════╪═════════╡
    │ 2021-08-01 00:00:00 | night       | 1       │
    │ 2021-08-08 00:00:00 | day         | 1       │
    │ 2021-08-15 00:00:00 | day         | 1       │
    │ 2021-08-29 00:00:00 | day         | 1       │
    │ 2021-10-31 00:00:00 | evening     | 1       │
    └─────────────────────┴─────────────┴─────────┘