Cluster a column

I have a column I want to cluster:

df = pl.DataFrame({"values": [0.1, 0.5, 0.7, -0.2, 0.4, -0.7, 0.05]})

shape: (7, 1)
┌────────┐
│ values │
│ ---    │
│ f64    │
╞════════╡
│ 0.1    │
│ 0.5    │
│ 0.7    │
│ -0.2   │
│ 0.4    │
│ -0.7   │
│ 0.05   │
└────────┘

If a value is less than -0.5 it will be -1, zero if it is lesser than zero and 1 otherwise.

df = df.with_columns(
    pl.when(pl.col("values") <= -0.5)
    .then(-1)
    .otherwise(
        pl.when(pl.col("values") <= 0)
        .then(0.0)
        .otherwise(1)
    )
)

shape: (7, 2)
┌────────┬─────────┐
│ values ┆ literal │
│ ---    ┆ ---     │
│ f64    ┆ f64     │
╞════════╪═════════╡
│ 0.1    ┆ 1.0     │
│ 0.5    ┆ 1.0     │
│ 0.7    ┆ 1.0     │
│ -0.2   ┆ 0.0     │
│ 0.4    ┆ 1.0     │
│ -0.7   ┆ -1.0    │
│ 0.05   ┆ 1.0     │
└────────┴─────────┘

Of course if I have more bins the code will be really a mess. Any advice for the general case?

Solution

One great way to accomplish this when there numerous bins is to use a join_asof.

First, I'll modify your input data (so we can see how to handle the case when break points are <= versus strictly <).

df = pl.DataFrame({"values": pl.int_range(0, 10, eager=True) / 4 - 1.0})
df

shape: (10, 1)
┌────────┐
│ values │
│ ---    │
│ f64    │
╞════════╡
│ -1.0   │
│ -0.75  │
│ -0.5   │
│ -0.25  │
│ 0.0    │
│ 0.25   │
│ 0.5    │
│ 0.75   │
│ 1.0    │
│ 1.25   │
└────────┘

Upper value included (`<=`)

Now we'll create a DataFrame that contains our upper limit break points, and the clustered values we want. For example:

cluster_vals = (
    pl.DataFrame({
        'bin_value': [-0.5, 0.0, float('Inf')],
        'clustered_value': [-1.0, 0.0, 1.0],
    })
)
cluster_vals

>>> cluster_vals
shape: (3, 2)
┌───────────┬─────────────────┐
│ bin_value ┆ clustered_value │
│ ---       ┆ ---             │
│ f64       ┆ f64             │
╞═══════════╪═════════════════╡
│ -0.5      ┆ -1.0            │
│ 0.0       ┆ 0.0             │
│ inf       ┆ 1.0             │
└───────────┴─────────────────┘

Then we perform our join_asof. The join_asof with a strategy=forward will include the upper break points in the bin.

Note that both DataFrames in a join_asof must be pre-sorted by the asof column. (I've explicitly sorted both here, even though our clustered_vals DataFrame is already sorted - just as a reminder).

(
    df
    .sort('values')
    .join_asof(
        other=cluster_vals
              .sort('bin_value'),
        left_on='values',
        right_on='bin_value',
        strategy='forward',
    )
)

shape: (10, 3)
┌────────┬───────────┬─────────────────┐
│ values ┆ bin_value ┆ clustered_value │
│ ---    ┆ ---       ┆ ---             │
│ f64    ┆ f64       ┆ f64             │
╞════════╪═══════════╪═════════════════╡
│ -1.0   ┆ -0.5      ┆ -1.0            │
│ -0.75  ┆ -0.5      ┆ -1.0            │
│ -0.5   ┆ -0.5      ┆ -1.0            │
│ -0.25  ┆ 0.0       ┆ 0.0             │
│ 0.0    ┆ 0.0       ┆ 0.0             │
│ 0.25   ┆ inf       ┆ 1.0             │
│ 0.5    ┆ inf       ┆ 1.0             │
│ 0.75   ┆ inf       ┆ 1.0             │
│ 1.0    ┆ inf       ┆ 1.0             │
│ 1.25   ┆ inf       ┆ 1.0             │
└────────┴───────────┴─────────────────┘

Notice how our upper break points of each bin (-0.5 and 0.0) are included in the bin.

Upper value not included (strictly `<`)

If we do not want the upper value to be included in the bin, we can switch to a strategy=backward and change our bins to be the lower endpoint of our bin.

Our revised clustered values DataFrame would be:

cluster_vals = (
    pl.DataFrame({
        'bin_value': [float('-Inf'), -0.5, 0.0],
        'clustered_value': [-1.0, 0.0, 1.0],
    })
)
cluster_vals

shape: (3, 2)
┌───────────┬─────────────────┐
│ bin_value ┆ clustered_value │
│ ---       ┆ ---             │
│ f64       ┆ f64             │
╞═══════════╪═════════════════╡
│ -inf      ┆ -1.0            │
│ -0.5      ┆ 0.0             │
│ 0.0       ┆ 1.0             │
└───────────┴─────────────────┘

And our join_asof would become:

(
    df
    .sort('values')
    .join_asof(
        other=cluster_vals
              .sort('bin_value'),
        left_on='values',
        right_on='bin_value',
        strategy='backward',
    )
)

shape: (10, 3)
┌────────┬───────────┬─────────────────┐
│ values ┆ bin_value ┆ clustered_value │
│ ---    ┆ ---       ┆ ---             │
│ f64    ┆ f64       ┆ f64             │
╞════════╪═══════════╪═════════════════╡
│ -1.0   ┆ null      ┆ null            │
│ -0.75  ┆ null      ┆ null            │
│ -0.5   ┆ -0.5      ┆ -1.0            │
│ -0.25  ┆ -0.5      ┆ -1.0            │
│ 0.0    ┆ 0.0       ┆ 0.0             │
│ 0.25   ┆ 0.0       ┆ 0.0             │
│ 0.5    ┆ 0.0       ┆ 0.0             │
│ 0.75   ┆ 0.0       ┆ 0.0             │
│ 1.0    ┆ 0.0       ┆ 0.0             │
│ 1.25   ┆ 0.0       ┆ 0.0             │
└────────┴───────────┴─────────────────┘

Note how our upper break points of each bin are no longer included in the bin.

Note on `when/then/otherwise`

Just a quick note: when/then statements can be chained so that only one otherwise is needed:

df.with_columns(
    pl.when(pl.col("values") <= -0.5)
    .then(-1)
    .when(pl.col("values") <= 0)
    .then(0.0)
    .otherwise(1)
    .alias('result')
)

Cluster a column

Upper value included (<=)

Upper value not included (strictly <)

Note on when/then/otherwise

Upper value included (`<=`)

Upper value not included (strictly `<`)

Note on `when/then/otherwise`