I have a csv file with a size of 28 GB, which I want to plot. Those are way too many data points obviously, so how can I reduce the data? I would like to merge about 1000 data points into one by calculating the mean. This is the sturcture of my DataFrame:
df = pl.from_repr("""
┌─────────────────┬────────────┐
│ Time in seconds ┆ Force in N │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════════════╪════════════╡
│ 0.0 ┆ 2310.18 │
│ 0.0005 ┆ 2313.23 │
│ 0.001 ┆ 2314.14 │
└─────────────────┴────────────┘
""")
I thought about using group_by_dynamic
, and then calculating the mean of each group, but this only seems to work when using datetimes? The time in seconds is given as a float however.
You can also group by an integer column to create groups of size N
:
In case of a
group_by_dynamic
on an integer column, the windows are defined by:
“1i”
# length 1
“10i”
# length 10
We can add a row index and cast to pl.Int64
to use it.
(df.with_row_index()
.group_by_dynamic(pl.col.index.cast(pl.Int64), every="2i")
.agg("force")
)
shape: (4, 2)
┌───────┬────────────┐
│ index ┆ force │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═══════╪════════════╡
│ 0 ┆ ["A", "B"] │
│ 2 ┆ ["C", "D"] │
│ 4 ┆ ["E", "F"] │
│ 6 ┆ ["G"] │
└───────┴────────────┘