Search code examples
pythonpython-polars

Adding secondary group by clause using group_by_dynamic() operation in polars


I would like to group_by the data in interval of a hourly/daily/weekly and further group by certain other clauses. I was able to acheive group_by hourly/daily/weekly basis by using group_by_dynamic option provided by polars.

How do we add a secondary non datetime group_by clause to the polars dataframe after using group_by_dynamic operation in polars?

The sample dataframe read from csv is

df = pl.from_repr("""
┌─────────────────────┬──────────┬────────┬─────────┬──────────┐
│ Date                ┆ Item     ┆ Issue  ┆ Channel ┆ ID       │
│ ---                 ┆ ---      ┆ ---    ┆ ---     ┆ ---      │
│ datetime[μs]        ┆ str      ┆ str    ┆ str     ┆ i64      │
╞═════════════════════╪══════════╪════════╪═════════╪══════════╡
│ 2023-01-02 01:40:00 ┆ Item ABC ┆ ASDFFF ┆ Web     ┆ 32513995 │
│ 2023-01-02 02:15:00 ┆ Item ABC ┆ WERWET ┆ Web     ┆ 32513995 │
│ 2023-01-02 03:00:00 ┆ Item ABC ┆ BVRTNB ┆ Twitter ┆ 32513995 │
│ 2023-01-03 04:11:00 ┆ Item ABC ┆ VDFGVS ┆ Fax     ┆ 32513995 │
│ 2023-01-03 04:30:00 ┆ Item ABC ┆ QWEDWE ┆ Twitter ┆ 32513995 │
│ 2023-01-03 04:45:00 ┆ Item ABC ┆ BRHMNU ┆ Fax     ┆ 32513995 │
└─────────────────────┴──────────┴────────┴─────────┴──────────┘
""")

I am grouping this data in houlry interval using polars group_by_dynamic operation using the below code snippet.

import polars as pl


q = (
    pl.scan_csv("Test.csv", try_parse_dates=True)
    .filter(pl.col("Item") == "Item ABC")
    .group_by_dynamic("Date", every="1h", closed="right")
    .agg(pl.col("ID").count().alias("total"))
    .sort("Date")
)

df = q.collect()

This code gives me result as

┌─────────────────────┬───────┐
│ Date                ┆ total │
╞═════════════════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ 2     │
│ 2023-01-02 02:00:00 ┆ 1     │
│ 2023-01-02 03:00:00 ┆ 1     │
│ 2023-01-05 04:00:00 ┆ 3     │
└─────────────────────┴───────┘

But i would want to further group by this data by "Channel" and expecting the result as

┌────────────-──────-─┬─────────┬───────┐
│ Date                ┆ Channel ┆ total │
╞═════════════════════╪═════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1     │
│ 2023-01-02 01:00:00 ┆ Web     ┆ 1     │
│ 2023-01-02 01:00:00 ┆ Web     ┆ 1     │
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1     │
│ 2023-01-03 01:00:00 ┆ Fax     ┆ 2     │
│ 2023-01-11 01:00:00 ┆ Twitter ┆ 1     │
└─────────────────────┴─────────┴───────┘

Solution

  • You can specify group_by

    q = (
        pl.scan_csv("Test.csv", try_parse_dates=True)
        .filter(pl.col("Item") == "Item ABC")
        .group_by_dynamic("Date", every="1h", closed="right", group_by="Item")
        .agg(pl.col("ID").count().alias("total"))
        .sort("Date")
    )