Search code examples
pythonpython-polars

Accumulating lists in Polars


Say I have a pl.DataFrame() with 2 columns: The first column contains Date, the second List[str].

import polars as pl

df = pl.DataFrame([
    pl.Series('Date', [2000, 2001, 2002]),
    pl.Series('Ids', [
        ['a'], 
        ['b', 'c'], 
        ['d'], 
    ])
])
Date Ids
2000 ['a']
2001 ['b', 'c']
2002 ['d']

Is it possible to accumulate the List[str] column so that each row contains itself and all previous lists in Polars? Like so:

Date Ids
2000 ['a']
2001 ['a', 'b', 'c']
2002 ['a', 'b', 'c', 'd']

Solution

  • It is possible to use .rolling() if you set the period to the frame height.

    (df.with_row_index()
       .rolling(index_column="index", period=f"{df.height}i")
       .agg(
          pl.col.Date.last(),
          pl.col.Ids.flatten()
       )
    )
    
    shape: (3, 3)
    ┌───────┬──────┬──────────────────────┐
    │ index ┆ Date ┆ Ids                  │
    │ ---   ┆ ---  ┆ ---                  │
    │ u32   ┆ i64  ┆ list[str]            │
    ╞═══════╪══════╪══════════════════════╡
    │ 0     ┆ 2000 ┆ ["a"]                │
    │ 1     ┆ 2001 ┆ ["a", "b", "c"]      │
    │ 2     ┆ 2002 ┆ ["a", "b", "c", "d"] │
    └───────┴──────┴──────────────────────┘
    

    However, as it requires df.height it cannot work with LazyFrames.

    It also seems to struggle with "larger data" examples.