Search code examples
python-polars

Cluster a column


I have a column I want to cluster:

df = pl.DataFrame({"values": [0.1, 0.5, 0.7, -0.2, 0.4, -0.7, 0.05]})
shape: (7, 1)
┌────────┐
│ values │
│ ---    │
│ f64    │
╞════════╡
│ 0.1    │
│ 0.5    │
│ 0.7    │
│ -0.2   │
│ 0.4    │
│ -0.7   │
│ 0.05   │
└────────┘

If a value is less than -0.5 it will be -1, zero if it is lesser than zero and 1 otherwise.

df = df.with_columns(
    pl.when(pl.col("values") <= -0.5)
    .then(-1)
    .otherwise(
        pl.when(pl.col("values") <= 0)
        .then(0.0)
        .otherwise(1)
    )
)
shape: (7, 2)
┌────────┬─────────┐
│ values ┆ literal │
│ ---    ┆ ---     │
│ f64    ┆ f64     │
╞════════╪═════════╡
│ 0.1    ┆ 1.0     │
│ 0.5    ┆ 1.0     │
│ 0.7    ┆ 1.0     │
│ -0.2   ┆ 0.0     │
│ 0.4    ┆ 1.0     │
│ -0.7   ┆ -1.0    │
│ 0.05   ┆ 1.0     │
└────────┴─────────┘

Of course if I have more bins the code will be really a mess. Any advice for the general case?


Solution

  • One great way to accomplish this when there numerous bins is to use a join_asof.

    First, I'll modify your input data (so we can see how to handle the case when break points are <= versus strictly <).

    df = pl.DataFrame({"values": pl.int_range(0, 10, eager=True) / 4 - 1.0})
    df
    
    shape: (10, 1)
    ┌────────┐
    │ values │
    │ ---    │
    │ f64    │
    ╞════════╡
    │ -1.0   │
    │ -0.75  │
    │ -0.5   │
    │ -0.25  │
    │ 0.0    │
    │ 0.25   │
    │ 0.5    │
    │ 0.75   │
    │ 1.0    │
    │ 1.25   │
    └────────┘
    

    Upper value included (<=)

    Now we'll create a DataFrame that contains our upper limit break points, and the clustered values we want. For example:

    cluster_vals = (
        pl.DataFrame({
            'bin_value': [-0.5, 0.0, float('Inf')],
            'clustered_value': [-1.0, 0.0, 1.0],
        })
    )
    cluster_vals
    
    >>> cluster_vals
    shape: (3, 2)
    ┌───────────┬─────────────────┐
    │ bin_value ┆ clustered_value │
    │ ---       ┆ ---             │
    │ f64       ┆ f64             │
    ╞═══════════╪═════════════════╡
    │ -0.5      ┆ -1.0            │
    │ 0.0       ┆ 0.0             │
    │ inf       ┆ 1.0             │
    └───────────┴─────────────────┘
    

    Then we perform our join_asof. The join_asof with a strategy=forward will include the upper break points in the bin.

    Note that both DataFrames in a join_asof must be pre-sorted by the asof column. (I've explicitly sorted both here, even though our clustered_vals DataFrame is already sorted - just as a reminder).

    (
        df
        .sort('values')
        .join_asof(
            other=cluster_vals
                  .sort('bin_value'),
            left_on='values',
            right_on='bin_value',
            strategy='forward',
        )
    )
    
    shape: (10, 3)
    ┌────────┬───────────┬─────────────────┐
    │ values ┆ bin_value ┆ clustered_value │
    │ ---    ┆ ---       ┆ ---             │
    │ f64    ┆ f64       ┆ f64             │
    ╞════════╪═══════════╪═════════════════╡
    │ -1.0   ┆ -0.5      ┆ -1.0            │
    │ -0.75  ┆ -0.5      ┆ -1.0            │
    │ -0.5   ┆ -0.5      ┆ -1.0            │
    │ -0.25  ┆ 0.0       ┆ 0.0             │
    │ 0.0    ┆ 0.0       ┆ 0.0             │
    │ 0.25   ┆ inf       ┆ 1.0             │
    │ 0.5    ┆ inf       ┆ 1.0             │
    │ 0.75   ┆ inf       ┆ 1.0             │
    │ 1.0    ┆ inf       ┆ 1.0             │
    │ 1.25   ┆ inf       ┆ 1.0             │
    └────────┴───────────┴─────────────────┘
    

    Notice how our upper break points of each bin (-0.5 and 0.0) are included in the bin.

    Upper value not included (strictly <)

    If we do not want the upper value to be included in the bin, we can switch to a strategy=backward and change our bins to be the lower endpoint of our bin.

    Our revised clustered values DataFrame would be:

    cluster_vals = (
        pl.DataFrame({
            'bin_value': [float('-Inf'), -0.5, 0.0],
            'clustered_value': [-1.0, 0.0, 1.0],
        })
    )
    cluster_vals
    
    shape: (3, 2)
    ┌───────────┬─────────────────┐
    │ bin_value ┆ clustered_value │
    │ ---       ┆ ---             │
    │ f64       ┆ f64             │
    ╞═══════════╪═════════════════╡
    │ -inf      ┆ -1.0            │
    │ -0.5      ┆ 0.0             │
    │ 0.0       ┆ 1.0             │
    └───────────┴─────────────────┘
    

    And our join_asof would become:

    (
        df
        .sort('values')
        .join_asof(
            other=cluster_vals
                  .sort('bin_value'),
            left_on='values',
            right_on='bin_value',
            strategy='backward',
        )
    )
    
    shape: (10, 3)
    ┌────────┬───────────┬─────────────────┐
    │ values ┆ bin_value ┆ clustered_value │
    │ ---    ┆ ---       ┆ ---             │
    │ f64    ┆ f64       ┆ f64             │
    ╞════════╪═══════════╪═════════════════╡
    │ -1.0   ┆ null      ┆ null            │
    │ -0.75  ┆ null      ┆ null            │
    │ -0.5   ┆ -0.5      ┆ -1.0            │
    │ -0.25  ┆ -0.5      ┆ -1.0            │
    │ 0.0    ┆ 0.0       ┆ 0.0             │
    │ 0.25   ┆ 0.0       ┆ 0.0             │
    │ 0.5    ┆ 0.0       ┆ 0.0             │
    │ 0.75   ┆ 0.0       ┆ 0.0             │
    │ 1.0    ┆ 0.0       ┆ 0.0             │
    │ 1.25   ┆ 0.0       ┆ 0.0             │
    └────────┴───────────┴─────────────────┘
    

    Note how our upper break points of each bin are no longer included in the bin.

    Note on when/then/otherwise

    Just a quick note: when/then statements can be chained so that only one otherwise is needed:

    df.with_columns(
        pl.when(pl.col("values") <= -0.5)
        .then(-1)
        .when(pl.col("values") <= 0)
        .then(0.0)
        .otherwise(1)
        .alias('result')
    )