Say I have a pl.DataFrame()
with 2 columns: The first column contains Date
, the second List[str]
.
import polars as pl
df = pl.DataFrame([
pl.Series('Date', [2000, 2001, 2002]),
pl.Series('Ids', [
['a'],
['b', 'c'],
['d'],
])
])
Date | Ids |
---|---|
2000 | ['a'] |
2001 | ['b', 'c'] |
2002 | ['d'] |
Is it possible to accumulate the List[str]
column so that each row contains itself and all previous lists in Polars? Like so:
Date | Ids |
---|---|
2000 | ['a'] |
2001 | ['a', 'b', 'c'] |
2002 | ['a', 'b', 'c', 'd'] |
It is possible to use .rolling()
if you set the period to the frame height.
(df.with_row_index()
.rolling(index_column="index", period=f"{df.height}i")
.agg(
pl.col.Date.last(),
pl.col.Ids.flatten()
)
)
shape: (3, 3)
┌───────┬──────┬──────────────────────┐
│ index ┆ Date ┆ Ids │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ list[str] │
╞═══════╪══════╪══════════════════════╡
│ 0 ┆ 2000 ┆ ["a"] │
│ 1 ┆ 2001 ┆ ["a", "b", "c"] │
│ 2 ┆ 2002 ┆ ["a", "b", "c", "d"] │
└───────┴──────┴──────────────────────┘
However, as it requires df.height
it cannot work with LazyFrames.
It also seems to struggle with "larger data" examples.