Search code examples
pythonnumpyhistogrampython-polars

Using dynamic cut() breaks for each row of a dataframe


I am trying to bin values to prepare data to be later fed into a plotting library.

For this I am trying to use polars Expr.cut. The dataframe I operate on contains different groups of values, each of these groups should be binned using different breaks. Ideally I would like to use np.linspace(BinMin, BinMax, 50) for the breaks argument of Expr.cut.

I managed to make the BinMin and BinMax columns in the dataframe. But I can't manage to use np.linspace to define the breaks dynamically for each row of the dataframe.

This is a minimal example of what I tried:

import numpy as np
import polars as pl

df = pl.DataFrame({"Value": [12], "BinMin": [0], "BinMax": [100]})

At this point the dataframe looks like:

┌───────┬────────┬────────┐
│ Value ┆ BinMin ┆ BinMax │
│ ---   ┆ ---    ┆ ---    │
│ i64   ┆ i64    ┆ i64    │
╞═══════╪════════╪════════╡
│ 12    ┆ 0      ┆ 100    │
└───────┴────────┴────────┘

And trying to use Expr.cut with dynamic breaks:

df.with_columns(pl.col("Value").cut(breaks=np.linspace(pl.col("BinMin"), pl.col("BinMax"), 50)).alias("Bin"))


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 df.with_columns(pl.col("Value").cut(breaks=range(pl.col("BinMin"), pl.col("BinMax"))).alias("Bin"))

TypeError: 'Expr' object cannot be interpreted as an integer

I understand the error, that np.linspace is expecting to be called with actual scalar integers, not polars Expr. But I cannot figure out how to call it with dynamic breaks derived from the BinMin and BinMax columns.


Solution

  • Unfortunately, pl.Expr.cut doesn't support expressions for the breaks argument (yet), but requires a fixed sequence.

    (This would be a good feature request though).

    A naive solution that will work for DataFrames, but doesn't use polars' native expression API, would be to use pl.Expr.map_elements together with the corresponding functionality in numpy.

    def my_cut(x, num=50):
        seq = np.linspace(x["BinMin"], x["BinMax"], num=num)
        idx = np.digitize(x["Value"], seq)
        return seq[idx-1:idx+1].tolist()
    
    (
        df
        .with_columns(
            pl.struct("Value", "BinMin", "BinMax").map_elements(my_cut).alias("Bin")
        )
    )
    
    shape: (1, 4)
    ┌───────┬────────┬────────┬────────────────────────┐
    │ Value ┆ BinMin ┆ BinMax ┆ Bin                    │
    │ ---   ┆ ---    ┆ ---    ┆ ---                    │
    │ i64   ┆ i64    ┆ i64    ┆ list[f64]              │
    ╞═══════╪════════╪════════╪════════════════════════╡
    │ 12    ┆ 0      ┆ 100    ┆ [10.204082, 12.244898] │
    └───────┴────────┴────────┴────────────────────────┘