Search code examples
pythonscikit-learnpython-polars

Apply Scaler() on each ID on polars dataframe


I have a dataset with multiple columns and an ID column. Each ID can have different magnitudes and varying sizes across these columns. I want to normalize the columns for each ID separately.

import polars as pl
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df = pl.DataFrame(
{    "ID" : [1,1,2,2,3,3],
    "Values" : [1,2,3,4,5,6]}
)

If i do this, its using the scaler of the entire dataframe, and i would like to use scaler() for each ID.

I tried this:

(
    df
    .with_columns(
        Value_scaled = scaler.fit_transform(df.select(pl.col("Value"))).over("ID"),
    )
)

But : AttributeError: 'numpy.ndarray' object has no attribute 'over'

And i also tried using a group_by()

(
    df
    .group_by(
        pl.col("ID")
    ).agg(
        scaler.fit_transform(pl.col("Value")).alias("Value_scaled")
    )
)

And i get :

TypeError: float() argument must be a string or a real number, not 'Expr'


Solution

  • Following the definition outlined in the documentation, the functionality of the MinMaxScaler can be implemented easily using polars' native expression API.

    def min_max_scaler(x: str | pl.Expr) -> pl.Expr:
        if isinstance(x, str):
            x = pl.col(x)
        return (x - x.min()) / (x.max() - x.min())
    

    Then, it is compatible with polars' window functions, such as pl.Expr.over, to apply the scaling separately for each ID.

    df.with_columns(min_max_scaler("Values").over("ID"))
    
    shape: (6, 2)
    ┌─────┬────────┐
    │ ID  ┆ Values │
    │ --- ┆ ---    │
    │ i64 ┆ f64    │
    ╞═════╪════════╡
    │ 1   ┆ 0.0    │
    │ 1   ┆ 1.0    │
    │ 2   ┆ 0.0    │
    │ 2   ┆ 1.0    │
    │ 3   ┆ 0.0    │
    │ 3   ┆ 1.0    │
    └─────┴────────┘