Search code examples
python-polarsexpand

Expand/Explode Polars rows from column which is a list


I have the following code

import polars as pl
import datetime as dt
from dateutil.relativedelta import relativedelta

def get_3_month_splits(product: str) -> list[str]:
    front, start_dt, total_m = product.rsplit('.', 2)
    start_dt = dt.datetime.strptime(start_dt, '%Y%m')
    total_m  = int(total_m)
    return [f'{front}.{(start_dt+relativedelta(months=m)).strftime("%Y%m")}.3' for m in range(0, total_m, 3)]

df = pl.DataFrame({
    'product': ['CHECK.GB.202403.12', 'CHECK.DE.202506.6', 'CASH.US.202509.12'],
    'qty':     [3, 6, -3],
    'price':   [100, 102, 95],
})
print(df)

df2 = pl.DataFrame([{'product_split': split} | d
    for d in df.iter_rows(named=True) for split in get_3_month_splits(d['product'])
])
print(df2)

Essentially, I want to expand each row by splitting the product into a more detailed product_split columns, while keeping all other fields (like qty, price) the same.

The above accomplishes this, but is there a more native way to achieve this via with_columns?


Solution

  • It looks like you want to create date ranges:

    splits = pl.col("product").str.splitn(".", 4)
    start_dt = splits.struct[-2].str.to_date("%Y%m")
    total_m  = splits.struct[-1]
    
    (
        df.with_columns(
            pl.date_ranges(
                start_dt, 
                start_dt.dt.offset_by(pl.format("{}mo", total_m.first())), 
                interval = "3mo", 
                closed="left"
            )
            .over(total_m) 
            .alias("date_ranges")
        )
        .explode("date_ranges")
        .with_columns(
            pl.format("{}.{}.{}.{}", 
                splits.struct[0], 
                splits.struct[1], 
                pl.col("date_ranges").dt.to_string("%Y%m"), 
                splits.struct[-1]
            )
            .alias("product_split")
        )
    )
    
    shape: (10, 5)
    ┌────────────────────┬─────┬───────┬─────────────┬────────────────────┐
    │ product            ┆ qty ┆ price ┆ date_ranges ┆ product_split      │
    │ ---                ┆ --- ┆ ---   ┆ ---         ┆ ---                │
    │ str                ┆ i64 ┆ i64   ┆ date        ┆ str                │
    ╞════════════════════╪═════╪═══════╪═════════════╪════════════════════╡
    │ CHECK.GB.202403.12 ┆ 3   ┆ 100   ┆ 2024-03-01  ┆ CHECK.GB.202403.12 │
    │ CHECK.GB.202403.12 ┆ 3   ┆ 100   ┆ 2024-06-01  ┆ CHECK.GB.202406.12 │
    │ CHECK.GB.202403.12 ┆ 3   ┆ 100   ┆ 2024-09-01  ┆ CHECK.GB.202409.12 │
    │ CHECK.GB.202403.12 ┆ 3   ┆ 100   ┆ 2024-12-01  ┆ CHECK.GB.202412.12 │
    │ CHECK.DE.202506.6  ┆ 6   ┆ 102   ┆ 2025-06-01  ┆ CHECK.DE.202506.6  │
    │ CHECK.DE.202506.6  ┆ 6   ┆ 102   ┆ 2025-09-01  ┆ CHECK.DE.202509.6  │
    │ CASH.US.202509.12  ┆ -3  ┆ 95    ┆ 2025-09-01  ┆ CASH.US.202509.12  │
    │ CASH.US.202509.12  ┆ -3  ┆ 95    ┆ 2025-12-01  ┆ CASH.US.202512.12  │
    │ CASH.US.202509.12  ┆ -3  ┆ 95    ┆ 2026-03-01  ┆ CASH.US.202603.12  │
    │ CASH.US.202509.12  ┆ -3  ┆ 95    ┆ 2026-06-01  ┆ CASH.US.202606.12  │
    └────────────────────┴─────┴───────┴─────────────┴────────────────────┘
    

    The string processing can be done in several ways, it seems your data has a known number of fields, so we can use .str.splitn()

    Finally, we .explode() the list of dates produced by .date_ranges()