Search code examples
pythonpython-polars

How can I reduce the amount of data in a polars DataFrame?


I have a csv file with a size of 28 GB, which I want to plot. Those are way too many data points obviously, so how can I reduce the data? I would like to merge about 1000 data points into one by calculating the mean. This is the sturcture of my DataFrame:

df = pl.from_repr("""
┌─────────────────┬────────────┐
│ Time in seconds ┆ Force in N │
│ ---             ┆ ---        │
│ f64             ┆ f64        │
╞═════════════════╪════════════╡
│ 0.0             ┆ 2310.18    │
│ 0.0005          ┆ 2313.23    │
│ 0.001           ┆ 2314.14    │
└─────────────────┴────────────┘
""")

I thought about using group_by_dynamic, and then calculating the mean of each group, but this only seems to work when using datetimes? The time in seconds is given as a float however.


Solution

  • You can also group by an integer column to create groups of size N:

    In case of a group_by_dynamic on an integer column, the windows are defined by:

    “1i” # length 1

    “10i” # length 10

    We can add a row index and cast to pl.Int64 to use it.

    (df.with_row_index()
       .group_by_dynamic(pl.col.index.cast(pl.Int64), every="2i")
       .agg("force")
    )
    
    shape: (4, 2)
    ┌───────┬────────────┐
    │ index ┆ force      │
    │ ---   ┆ ---        │
    │ i64   ┆ list[str]  │
    ╞═══════╪════════════╡
    │ 0     ┆ ["A", "B"] │
    │ 2     ┆ ["C", "D"] │
    │ 4     ┆ ["E", "F"] │
    │ 6     ┆ ["G"]      │
    └───────┴────────────┘