Context
I have a Polars DataFrame ("df") consisting of a temporal column ("date"), a would-be ID column ("id") and a number of numerical columns (i.e. features). The shape is (14852, 431).
The df represents data about financial transactions. The ID indicates the client; the date is the starting date of the month those transactions were carried out. The features are some "averages" (e.g. average money spent, number of transactions, etc.).
After appropriate manipulation, this df is to be fed into a Machine Learning model for training purposes.
Aim
Qualitatively I am trying to do the following:
restricted_df
of shape (6, 431), I would need to replace it with a unstacked_df
of shape (1, 431 * 6). In practice, if I have a feature name "money_spent" in restricted_df
, then unstacked_df
should contain something like "money_spent_0", "money_spent_1", ..., "money_spent_5".Approach
I sort of know the various pieces of the puzzle, even though I am admittedly new to Polars. In my opinion they are:
group_by_dynamic("date", every="180d", period="180d", group_by="id")
unstack(step=1, columns=features)
However, I can't really get them to work, at least not in an efficient manner. See below for an inefficient solution.
Issue
I reckon the main issue is that, as far as my understanding goes, after group_by_dynamic
Polars expects .agg
that is meant to be applied on single columns, e.g. via pl.col("foo").some_function()
. However, Series does not come with the unstack method, so it does not quite work.
Attempted Solution
One highly inefficient approach is to convert the aforementioned Series to a DataFrame, and then unstack this instead. However, this alone does not quite solve the problem. In fact, we merely end up with a df with the same 431 columns, where each one of these contains a DataFrame (the one we unstacked) for each row.
This is obtained via
df.group_by_dynamic("date", every="180d", period="180d", group_by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1)))
Schematically for, a feature "foo", we end up with something of the form
| foo |
|col_0 1, col_1 2, ...|
Instead of the desired
|foo_0|foo_1|...|
| 1 | 2 |...|
To remedy this we could insert a to_dict()
and, at the end, use the unnest
function. This is obtained via
df.group_by_dynamic("date", every="180d", period="180d", group_by="id").agg(pl.col(features).map_elements(lambda x: pl.DataFrame(x).unstack(step=1).to_dict())).unnest()
Question
This could work, but is obviously very inefficient, and seems overkill to me. Is there any way we can get this thing done?
Minimal Example
import numpy as np
import polars as pl
from datetime import date
# Generate fake data
ids = [1]*6 + [2]*6
start = date(2023, 1, 1)
end = date(2023, 12, 1)
dates = pl.date_range(start, end, "1mo", eager=True)
foos = np.arange(0, 12)
bars = np.arange(12, 24)
# Generate df
df = pl.DataFrame({"id":ids, "date":dates, "foo":foos, "bar":bars})
# Print df
print(df)
┌─────┬────────────┬─────┬─────┐
│ id ┆ date ┆ foo ┆ bar │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1 ┆ 2023-01-01 ┆ 0 ┆ 12 │
│ 1 ┆ 2023-02-01 ┆ 1 ┆ 13 │
│ 1 ┆ 2023-03-01 ┆ 2 ┆ 14 │
│ 1 ┆ 2023-04-01 ┆ 3 ┆ 15 │
│ … ┆ … ┆ … ┆ … │
│ 2 ┆ 2023-09-01 ┆ 8 ┆ 20 │
│ 2 ┆ 2023-10-01 ┆ 9 ┆ 21 │
│ 2 ┆ 2023-11-01 ┆ 10 ┆ 22 │
│ 2 ┆ 2023-12-01 ┆ 11 ┆ 23 │
# Group df as required
grouped_df = df.group_by_dynamic("date", every="180d", period="180d", group_by="id")
# Check group content
for _name, group in grouped_df:
print(group)
shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id ┆ date ┆ foo ┆ bar │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1 ┆ 2023-01-01 ┆ 0 ┆ 12 │
│ 1 ┆ 2023-02-01 ┆ 1 ┆ 13 │
│ 1 ┆ 2023-03-01 ┆ 2 ┆ 14 │
│ 1 ┆ 2023-04-01 ┆ 3 ┆ 15 │
│ 1 ┆ 2023-05-01 ┆ 4 ┆ 16 │
│ 1 ┆ 2023-06-01 ┆ 5 ┆ 17 │
└─────┴────────────┴─────┴─────┘
shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id ┆ date ┆ foo ┆ bar │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 2 ┆ 2023-07-01 ┆ 6 ┆ 18 │
│ 2 ┆ 2023-08-01 ┆ 7 ┆ 19 │
│ 2 ┆ 2023-09-01 ┆ 8 ┆ 20 │
│ 2 ┆ 2023-10-01 ┆ 9 ┆ 21 │
│ 2 ┆ 2023-11-01 ┆ 10 ┆ 22 │
│ 2 ┆ 2023-12-01 ┆ 11 ┆ 23 │
└─────┴────────────┴─────┴─────┘
# Manipulation
result = ...
# Expected output after correct manipulation
print(result)
shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id ┆ date ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 2023-01-01 ┆ 0 ┆ 1 ┆ … ┆ 14 ┆ 15 ┆ 16 ┆ 17 │
│ 2 ┆ 2023-07-01 ┆ 6 ┆ 7 ┆ … ┆ 20 ┆ 21 ┆ 22 ┆ 23 │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘
It looks like .list.to_struct
may be the missing part you need?
We can use n_field_strategy="max_width"
to ensure all results have the same "length".
fields=
can take a callable, in this case you want to add the column name as a prefix.
You can then unnest the resulting struct columns:
features = "foo", "bar"
(df.group_by_dynamic("date", group_by="id", every="6mo")
.agg(features)
.with_columns(
pl.col(feature).list.to_struct(
fields = lambda idx, feature=feature: f"{feature}_{idx}",
n_field_strategy = "max_width"
)
for feature in features
)
.unnest(*features)
)
shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ id ┆ date ┆ foo_0 ┆ foo_1 ┆ foo_2 ┆ foo_3 ┆ foo_4 ┆ foo_5 ┆ bar_0 ┆ bar_1 ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 2023-01-01 ┆ 0 ┆ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 12 ┆ 13 ┆ 14 ┆ 15 ┆ 16 ┆ 17 │
│ 2 ┆ 2023-07-01 ┆ 6 ┆ 7 ┆ 8 ┆ 9 ┆ 10 ┆ 11 ┆ 18 ┆ 19 ┆ 20 ┆ 21 ┆ 22 ┆ 23 │
└─────┴────────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
Another approach could be to reshape into long then wide with .unpivot()
and .pivot()
features = "foo", "bar"
(df.group_by_dynamic("date", group_by="id", every="6mo")
.agg(features)
.unpivot(index=["id", "date"])
.explode("value")
.with_columns(pl.format("{}_{}", pl.col.variable, pl.int_range(pl.len()).over("id", "variable")))
.pivot(on="variable", index=["id", "date"])
)