I am trying to bin values to prepare data to be later fed into a plotting library.
For this I am trying to use polars Expr.cut
. The dataframe I operate on contains different groups of values, each of these groups should be binned using different breaks. Ideally I would like to use np.linspace(BinMin, BinMax, 50)
for the breaks
argument of Expr.cut
.
I managed to make the BinMin
and BinMax
columns in the dataframe. But I can't manage to use np.linspace
to define the breaks dynamically for each row of the dataframe.
This is a minimal example of what I tried:
import numpy as np
import polars as pl
df = pl.DataFrame({"Value": [12], "BinMin": [0], "BinMax": [100]})
At this point the dataframe looks like:
┌───────┬────────┬────────┐
│ Value ┆ BinMin ┆ BinMax │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═══════╪════════╪════════╡
│ 12 ┆ 0 ┆ 100 │
└───────┴────────┴────────┘
And trying to use Expr.cut
with dynamic breaks:
df.with_columns(pl.col("Value").cut(breaks=np.linspace(pl.col("BinMin"), pl.col("BinMax"), 50)).alias("Bin"))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.with_columns(pl.col("Value").cut(breaks=range(pl.col("BinMin"), pl.col("BinMax"))).alias("Bin"))
TypeError: 'Expr' object cannot be interpreted as an integer
I understand the error, that np.linspace
is expecting to be called with actual scalar integers, not polars Expr
. But I cannot figure out how to call it with dynamic breaks derived from the BinMin
and BinMax
columns.
Unfortunately, pl.Expr.cut
doesn't support expressions for the breaks
argument (yet), but requires a fixed sequence.
(This would be a good feature request though).
A naive solution that will work for DataFrames, but doesn't use polars' native expression API, would be to use pl.Expr.map_elements
together with the corresponding functionality in numpy.
def my_cut(x, num=50):
seq = np.linspace(x["BinMin"], x["BinMax"], num=num)
idx = np.digitize(x["Value"], seq)
return seq[idx-1:idx+1].tolist()
(
df
.with_columns(
pl.struct("Value", "BinMin", "BinMax").map_elements(my_cut).alias("Bin")
)
)
shape: (1, 4)
┌───────┬────────┬────────┬────────────────────────┐
│ Value ┆ BinMin ┆ BinMax ┆ Bin │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ list[f64] │
╞═══════╪════════╪════════╪════════════════════════╡
│ 12 ┆ 0 ┆ 100 ┆ [10.204082, 12.244898] │
└───────┴────────┴────────┴────────────────────────┘