Search code examples
python-polars

How to achieve group_by_dynamic followed by unstack?


Context

I have a Polars DataFrame ("df") consisting of a temporal column ("date"), a would-be ID column ("id") and a number of numerical columns (i.e. features). The shape is (14852, 431).

The df represents data about financial transactions. The ID indicates the client; the date is the starting date of the month those transactions were carried out. The features are some "averages" (e.g. average money spent, number of transactions, etc.).

After appropriate manipulation, this df is to be fed into a Machine Learning model for training purposes.

Aim

Qualitatively I am trying to do the following:

  1. For each unique ID (client), create a 6-months sliding window;
  2. Restrict the df to the transactions in that period for that ID;
  3. "Unstack" its rows. That is: if in a given time window there are 6 transactions for a given ID, i.e. we have a restricted_df of shape (6, 431), I would need to replace it with a unstacked_df of shape (1, 431 * 6). In practice, if I have a feature name "money_spent" in restricted_df, then unstacked_df should contain something like "money_spent_0", "money_spent_1", ..., "money_spent_5".

Approach

I sort of know the various pieces of the puzzle, even though I am admittedly new to Polars. In my opinion they are:

  1. group_by_dynamic("date", every="180d", period="180d", group_by="id")
  2. unstack(step=1, columns=features)

However, I can't really get them to work, at least not in an efficient manner. See below for an inefficient solution.

Issue

I reckon the main issue is that, as far as my understanding goes, after group_by_dynamic Polars expects .agg that is meant to be applied on single columns, e.g. via pl.col("foo").some_function(). However, Series does not come with the unstack method, so it does not quite work.

Attempted Solution

One highly inefficient approach is to convert the aforementioned Series to a DataFrame, and then unstack this instead. However, this alone does not quite solve the problem. In fact, we merely end up with a df with the same 431 columns, where each one of these contains a DataFrame (the one we unstacked) for each row.

This is obtained via

df.group_by_dynamic("date", every="180d", period="180d", group_by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1)))

Schematically for, a feature "foo", we end up with something of the form

|       foo           |
|col_0 1, col_1 2, ...| 

Instead of the desired

|foo_0|foo_1|...|
|  1  |  2  |...|

To remedy this we could insert a to_dict() and, at the end, use the unnest function. This is obtained via

df.group_by_dynamic("date", every="180d", period="180d", group_by="id").agg(pl.col(features).map_elements(lambda x: pl.DataFrame(x).unstack(step=1).to_dict())).unnest()

Question

This could work, but is obviously very inefficient, and seems overkill to me. Is there any way we can get this thing done?

Minimal Example

import numpy as np
import polars as pl
from datetime import date

# Generate fake data
ids = [1]*6 + [2]*6
start = date(2023, 1, 1)
end =  date(2023, 12, 1)
dates = pl.date_range(start, end, "1mo", eager=True)
foos = np.arange(0, 12)
bars = np.arange(12, 24)

# Generate df
df = pl.DataFrame({"id":ids, "date":dates, "foo":foos, "bar":bars})

# Print df
print(df)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1   ┆ 2023-01-01 ┆ 0   ┆ 12  │
│ 1   ┆ 2023-02-01 ┆ 1   ┆ 13  │
│ 1   ┆ 2023-03-01 ┆ 2   ┆ 14  │
│ 1   ┆ 2023-04-01 ┆ 3   ┆ 15  │
│ …   ┆ …          ┆ …   ┆ …   │
│ 2   ┆ 2023-09-01 ┆ 8   ┆ 20  │
│ 2   ┆ 2023-10-01 ┆ 9   ┆ 21  │
│ 2   ┆ 2023-11-01 ┆ 10  ┆ 22  │
│ 2   ┆ 2023-12-01 ┆ 11  ┆ 23  │

# Group df as required
grouped_df = df.group_by_dynamic("date", every="180d", period="180d", group_by="id")

# Check group content
for _name, group in grouped_df:

  print(group)

shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1   ┆ 2023-01-01 ┆ 0   ┆ 12  │
│ 1   ┆ 2023-02-01 ┆ 1   ┆ 13  │
│ 1   ┆ 2023-03-01 ┆ 2   ┆ 14  │
│ 1   ┆ 2023-04-01 ┆ 3   ┆ 15  │
│ 1   ┆ 2023-05-01 ┆ 4   ┆ 16  │
│ 1   ┆ 2023-06-01 ┆ 5   ┆ 17  │
└─────┴────────────┴─────┴─────┘
shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 2   ┆ 2023-07-01 ┆ 6   ┆ 18  │
│ 2   ┆ 2023-08-01 ┆ 7   ┆ 19  │
│ 2   ┆ 2023-09-01 ┆ 8   ┆ 20  │
│ 2   ┆ 2023-10-01 ┆ 9   ┆ 21  │
│ 2   ┆ 2023-11-01 ┆ 10  ┆ 22  │
│ 2   ┆ 2023-12-01 ┆ 11  ┆ 23  │
└─────┴────────────┴─────┴─────┘

# Manipulation
result = ...

# Expected output after correct manipulation
print(result)

shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ ---        ┆ ---   ┆ ---   ┆   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ date       ┆ i64   ┆ i64   ┆   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ … ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
│ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ … ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘

Solution

  • It looks like .list.to_struct may be the missing part you need?

    We can use n_field_strategy="max_width" to ensure all results have the same "length".

    fields= can take a callable, in this case you want to add the column name as a prefix.

    You can then unnest the resulting struct columns:

    features = "foo", "bar"
    
    (df.group_by_dynamic("date", group_by="id", every="6mo")
       .agg(features)
       .with_columns(
          pl.col(feature).list.to_struct(
             fields = lambda idx, feature=feature: f"{feature}_{idx}", 
             n_field_strategy = "max_width"
          )
          for feature in features
       )
       .unnest(*features)
    )
    
    shape: (2, 14)
    ┌─────┬────────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
    │ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ foo_2 ┆ foo_3 ┆ foo_4 ┆ foo_5 ┆ bar_0 ┆ bar_1 ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
    │ --- ┆ ---        ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
    │ i64 ┆ date       ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
    ╞═════╪════════════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
    │ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ 2     ┆ 3     ┆ 4     ┆ 5     ┆ 12    ┆ 13    ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
    │ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ 8     ┆ 9     ┆ 10    ┆ 11    ┆ 18    ┆ 19    ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
    └─────┴────────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
    

    Another approach could be to reshape into long then wide with .unpivot() and .pivot()

    features = "foo", "bar"
    
    (df.group_by_dynamic("date", group_by="id", every="6mo")
       .agg(features)
       .unpivot(index=["id", "date"])
       .explode("value")
       .with_columns(pl.format("{}_{}", pl.col.variable, pl.int_range(pl.len()).over("id", "variable")))
       .pivot(on="variable", index=["id", "date"])
    )