I have a column I want to cluster:
df = pl.DataFrame({"values": [0.1, 0.5, 0.7, -0.2, 0.4, -0.7, 0.05]})
shape: (7, 1)
┌────────┐
│ values │
│ --- │
│ f64 │
╞════════╡
│ 0.1 │
│ 0.5 │
│ 0.7 │
│ -0.2 │
│ 0.4 │
│ -0.7 │
│ 0.05 │
└────────┘
If a value is less than -0.5 it will be -1, zero if it is lesser than zero and 1 otherwise.
df = df.with_columns(
pl.when(pl.col("values") <= -0.5)
.then(-1)
.otherwise(
pl.when(pl.col("values") <= 0)
.then(0.0)
.otherwise(1)
)
)
shape: (7, 2)
┌────────┬─────────┐
│ values ┆ literal │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞════════╪═════════╡
│ 0.1 ┆ 1.0 │
│ 0.5 ┆ 1.0 │
│ 0.7 ┆ 1.0 │
│ -0.2 ┆ 0.0 │
│ 0.4 ┆ 1.0 │
│ -0.7 ┆ -1.0 │
│ 0.05 ┆ 1.0 │
└────────┴─────────┘
Of course if I have more bins the code will be really a mess. Any advice for the general case?
One great way to accomplish this when there numerous bins is to use a join_asof
.
First, I'll modify your input data (so we can see how to handle the case when break points are <=
versus strictly <
).
df = pl.DataFrame({"values": pl.int_range(0, 10, eager=True) / 4 - 1.0})
df
shape: (10, 1)
┌────────┐
│ values │
│ --- │
│ f64 │
╞════════╡
│ -1.0 │
│ -0.75 │
│ -0.5 │
│ -0.25 │
│ 0.0 │
│ 0.25 │
│ 0.5 │
│ 0.75 │
│ 1.0 │
│ 1.25 │
└────────┘
<=
)Now we'll create a DataFrame that contains our upper limit break points, and the clustered values we want. For example:
cluster_vals = (
pl.DataFrame({
'bin_value': [-0.5, 0.0, float('Inf')],
'clustered_value': [-1.0, 0.0, 1.0],
})
)
cluster_vals
>>> cluster_vals
shape: (3, 2)
┌───────────┬─────────────────┐
│ bin_value ┆ clustered_value │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪═════════════════╡
│ -0.5 ┆ -1.0 │
│ 0.0 ┆ 0.0 │
│ inf ┆ 1.0 │
└───────────┴─────────────────┘
Then we perform our join_asof
. The join_asof
with a strategy=forward
will include the upper break points in the bin.
Note that both DataFrames in a join_asof
must be pre-sorted by the asof column. (I've explicitly sorted both here, even though our clustered_vals
DataFrame is already sorted - just as a reminder).
(
df
.sort('values')
.join_asof(
other=cluster_vals
.sort('bin_value'),
left_on='values',
right_on='bin_value',
strategy='forward',
)
)
shape: (10, 3)
┌────────┬───────────┬─────────────────┐
│ values ┆ bin_value ┆ clustered_value │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═════════════════╡
│ -1.0 ┆ -0.5 ┆ -1.0 │
│ -0.75 ┆ -0.5 ┆ -1.0 │
│ -0.5 ┆ -0.5 ┆ -1.0 │
│ -0.25 ┆ 0.0 ┆ 0.0 │
│ 0.0 ┆ 0.0 ┆ 0.0 │
│ 0.25 ┆ inf ┆ 1.0 │
│ 0.5 ┆ inf ┆ 1.0 │
│ 0.75 ┆ inf ┆ 1.0 │
│ 1.0 ┆ inf ┆ 1.0 │
│ 1.25 ┆ inf ┆ 1.0 │
└────────┴───────────┴─────────────────┘
Notice how our upper break points of each bin (-0.5 and 0.0) are included in the bin.
<
)If we do not want the upper value to be included in the bin, we can switch to a strategy=backward
and change our bins to be the lower endpoint of our bin.
Our revised clustered values DataFrame would be:
cluster_vals = (
pl.DataFrame({
'bin_value': [float('-Inf'), -0.5, 0.0],
'clustered_value': [-1.0, 0.0, 1.0],
})
)
cluster_vals
shape: (3, 2)
┌───────────┬─────────────────┐
│ bin_value ┆ clustered_value │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪═════════════════╡
│ -inf ┆ -1.0 │
│ -0.5 ┆ 0.0 │
│ 0.0 ┆ 1.0 │
└───────────┴─────────────────┘
And our join_asof
would become:
(
df
.sort('values')
.join_asof(
other=cluster_vals
.sort('bin_value'),
left_on='values',
right_on='bin_value',
strategy='backward',
)
)
shape: (10, 3)
┌────────┬───────────┬─────────────────┐
│ values ┆ bin_value ┆ clustered_value │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═════════════════╡
│ -1.0 ┆ null ┆ null │
│ -0.75 ┆ null ┆ null │
│ -0.5 ┆ -0.5 ┆ -1.0 │
│ -0.25 ┆ -0.5 ┆ -1.0 │
│ 0.0 ┆ 0.0 ┆ 0.0 │
│ 0.25 ┆ 0.0 ┆ 0.0 │
│ 0.5 ┆ 0.0 ┆ 0.0 │
│ 0.75 ┆ 0.0 ┆ 0.0 │
│ 1.0 ┆ 0.0 ┆ 0.0 │
│ 1.25 ┆ 0.0 ┆ 0.0 │
└────────┴───────────┴─────────────────┘
Note how our upper break points of each bin are no longer included in the bin.
when/then/otherwise
Just a quick note: when/then statements can be chained so that only one otherwise is needed:
df.with_columns(
pl.when(pl.col("values") <= -0.5)
.then(-1)
.when(pl.col("values") <= 0)
.then(0.0)
.otherwise(1)
.alias('result')
)