I would like to group_by the data in interval of a hourly/daily/weekly and further group by certain other clauses. I was able to acheive group_by hourly/daily/weekly basis by using group_by_dynamic option provided by polars.
How do we add a secondary non datetime group_by clause to the polars dataframe after using group_by_dynamic operation in polars?
The sample dataframe read from csv is
df = pl.from_repr("""
┌─────────────────────┬──────────┬────────┬─────────┬──────────┐
│ Date ┆ Item ┆ Issue ┆ Channel ┆ ID │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ str ┆ str ┆ i64 │
╞═════════════════════╪══════════╪════════╪═════════╪══════════╡
│ 2023-01-02 01:40:00 ┆ Item ABC ┆ ASDFFF ┆ Web ┆ 32513995 │
│ 2023-01-02 02:15:00 ┆ Item ABC ┆ WERWET ┆ Web ┆ 32513995 │
│ 2023-01-02 03:00:00 ┆ Item ABC ┆ BVRTNB ┆ Twitter ┆ 32513995 │
│ 2023-01-03 04:11:00 ┆ Item ABC ┆ VDFGVS ┆ Fax ┆ 32513995 │
│ 2023-01-03 04:30:00 ┆ Item ABC ┆ QWEDWE ┆ Twitter ┆ 32513995 │
│ 2023-01-03 04:45:00 ┆ Item ABC ┆ BRHMNU ┆ Fax ┆ 32513995 │
└─────────────────────┴──────────┴────────┴─────────┴──────────┘
""")
I am grouping this data in houlry interval using polars group_by_dynamic operation using the below code snippet.
import polars as pl
q = (
pl.scan_csv("Test.csv", try_parse_dates=True)
.filter(pl.col("Item") == "Item ABC")
.group_by_dynamic("Date", every="1h", closed="right")
.agg(pl.col("ID").count().alias("total"))
.sort("Date")
)
df = q.collect()
This code gives me result as
┌─────────────────────┬───────┐
│ Date ┆ total │
╞═════════════════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ 2 │
│ 2023-01-02 02:00:00 ┆ 1 │
│ 2023-01-02 03:00:00 ┆ 1 │
│ 2023-01-05 04:00:00 ┆ 3 │
└─────────────────────┴───────┘
But i would want to further group by this data by "Channel" and expecting the result as
┌────────────-──────-─┬─────────┬───────┐
│ Date ┆ Channel ┆ total │
╞═════════════════════╪═════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1 │
│ 2023-01-02 01:00:00 ┆ Web ┆ 1 │
│ 2023-01-02 01:00:00 ┆ Web ┆ 1 │
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1 │
│ 2023-01-03 01:00:00 ┆ Fax ┆ 2 │
│ 2023-01-11 01:00:00 ┆ Twitter ┆ 1 │
└─────────────────────┴─────────┴───────┘
You can specify group_by
q = (
pl.scan_csv("Test.csv", try_parse_dates=True)
.filter(pl.col("Item") == "Item ABC")
.group_by_dynamic("Date", every="1h", closed="right", group_by="Item")
.agg(pl.col("ID").count().alias("total"))
.sort("Date")
)