A standard way to achieve this on R is by using cut()
and in pandas using pd.cut()
. This is a straight forward way to covert numerical data into categories than can be defined on the fly (usually referred to as breaks
).
To not rewrite the wheel, I will refer to the pandas documentation and R for examples since they are much better than any one I could just improvise now.
Update: Polars now has cut
and qcut
functions.
Original answer:
Polars does not have a cut function, per se. However, one simple (and very performant) way to cut data is using join_asof
.
Let's start with this data.
import polars as pl
import numpy as np
sample_size = 10
df = pl.DataFrame(
{
"var1": np.random.default_rng(seed=0).normal(0, 1, sample_size),
}
)
df
shape: (10, 1)
┌───────────┐
│ var1 │
│ --- │
│ f64 │
╞═══════════╡
│ 0.12573 │
│ -0.132105 │
│ 0.640423 │
│ 0.1049 │
│ -0.535669 │
│ 0.361595 │
│ 1.304 │
│ 0.947081 │
│ -0.703735 │
│ -1.265421 │
└───────────┘
First, we'll create a DataFrame containing our break points, along with our categorical variable. Let's say we want break points of -1, 0, and 1. We'll supply these in the constructor.
I'll also use with_row_index
to automatically generate our categorical values. (If you like, you can assign something else for a categorical variable in the constructor.)
Note that the data type of your break points must match the data type of the variable that you are cutting. Hence, in this simple example, I'm writing the first break point as "-1.0" (so that Polars automatically creates break_pt
as Float64
, rather than an integer.)
break_df = pl.DataFrame(
{
"break_pt": [-1.0, 0, 1],
}
).with_row_count("binned")
break_df
shape: (3, 2)
┌────────┬──────────┐
│ binned ┆ break_pt │
│ --- ┆ --- │
│ u32 ┆ f64 │
╞════════╪══════════╡
│ 0 ┆ -1.0 │
│ 1 ┆ 0.0 │
│ 2 ┆ 1.0 │
└────────┴──────────┘
join_asof
Now we can perform the join_asof
. Note that both datasets must be sorted by the as_of keys, so we'll need to sort our DataFrame of random numbers by our continuous variable (var1
) before the join. (break_df
is already sorted.).
(
df
.sort("var1")
.join_asof(
break_df,
left_on="var1",
right_on="break_pt",
strategy="forward",
)
)
shape: (10, 3)
┌───────────┬────────┬──────────┐
│ var1 ┆ binned ┆ break_pt │
│ --- ┆ --- ┆ --- │
│ f64 ┆ u32 ┆ f64 │
╞═══════════╪════════╪══════════╡
│ -1.265421 ┆ 0 ┆ -1.0 │
│ -0.703735 ┆ 1 ┆ 0.0 │
│ -0.535669 ┆ 1 ┆ 0.0 │
│ -0.132105 ┆ 1 ┆ 0.0 │
│ 0.1049 ┆ 2 ┆ 1.0 │
│ 0.12573 ┆ 2 ┆ 1.0 │
│ 0.361595 ┆ 2 ┆ 1.0 │
│ 0.640423 ┆ 2 ┆ 1.0 │
│ 0.947081 ┆ 2 ┆ 1.0 │
│ 1.304 ┆ null ┆ null │
└───────────┴────────┴──────────┘
This leaves us with the last bin (the values above the last breakpoint) as null
. To fill these null
values with a proper binned
value, we can use fill_null
.
(
df
.sort("var1")
.join_asof(
break_df,
left_on="var1",
right_on="break_pt",
strategy="forward",
)
.with_columns(pl.col("binned").fill_null(pl.col("binned").max() + 1))
)
shape: (10, 3)
┌───────────┬────────┬──────────┐
│ var1 ┆ binned ┆ break_pt │
│ --- ┆ --- ┆ --- │
│ f64 ┆ u32 ┆ f64 │
╞═══════════╪════════╪══════════╡
│ -1.265421 ┆ 0 ┆ -1.0 │
│ -0.703735 ┆ 1 ┆ 0.0 │
│ -0.535669 ┆ 1 ┆ 0.0 │
│ -0.132105 ┆ 1 ┆ 0.0 │
│ 0.1049 ┆ 2 ┆ 1.0 │
│ 0.12573 ┆ 2 ┆ 1.0 │
│ 0.361595 ┆ 2 ┆ 1.0 │
│ 0.640423 ┆ 2 ┆ 1.0 │
│ 0.947081 ┆ 2 ┆ 1.0 │
│ 1.304 ┆ 3 ┆ null │
└───────────┴────────┴──────────┘
So how well does this perform? Let's increase our random sample to 100 million values. And let's expand our breaklist to ~6,000 break points.
sample_size = 100_000_000
df = pl.DataFrame(
{
"var1": np.random.default_rng(seed=0).normal(0, 1, sample_size),
}
)
df
shape: (100000000, 1)
┌───────────┐
│ var1 │
│ --- │
│ f64 │
╞═══════════╡
│ 0.1257 │
│ -0.132105 │
│ 0.640423 │
│ 0.1049 │
│ … │
│ -0.714924 │
│ 0.269947 │
│ -2.3158 │
│ -0.383743 │
└───────────┘
break_list = [next_val / 1000 for next_val in range(-3000, 3001)]
break_df = pl.DataFrame(
{
"break_pt": break_list,
}
).with_row_index("binned")
break_df
shape: (6001, 2)
┌────────┬──────────┐
│ binned ┆ break_pt │
│ --- ┆ --- │
│ u32 ┆ f64 │
╞════════╪══════════╡
│ 0 ┆ -3.0 │
│ 1 ┆ -2.999 │
│ 2 ┆ -2.998 │
│ 3 ┆ -2.997 │
│ … ┆ … │
│ 5997 ┆ 2.997 │
│ 5998 ┆ 2.998 │
│ 5999 ┆ 2.999 │
│ 6000 ┆ 3.0 │
└────────┴──────────┘
And now timing the algorithm itself...
import time
start = time.perf_counter()
(
df.sort("var1")
.join_asof(
break_df,
left_on="var1",
right_on="break_pt",
strategy="forward",
)
.with_columns(pl.col("binned").fill_null(pl.col("binned").max() + 1))
)
print(time.perf_counter() - start)
shape: (100000000, 3)
┌───────────┬────────┬──────────┐
│ var1 ┆ binned ┆ break_pt │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ f64 │
╞═══════════╪════════╪══════════╡
│ -5.666706 ┆ 0 ┆ -3.0 │
│ -5.6048 ┆ 0 ┆ -3.0 │
│ -5.428571 ┆ 0 ┆ -3.0 │
│ -5.350106 ┆ 0 ┆ -3.0 │
│ … ┆ … ┆ … │
│ 5.327897 ┆ 6001 ┆ null │
│ 5.344677 ┆ 6001 ┆ null │
│ 5.386379 ┆ 6001 ┆ null │
│ 5.4829 ┆ 6001 ┆ null │
└───────────┴────────┴──────────┘
>>> print(time.perf_counter() - start)
3.107058142999449
Just over 3 seconds.
Some improvements that we can make:
from typing import List
def cut_dataframe(_df: pl.DataFrame,
var_nm: str,
bins: List[float],
labels: List[str] = None) -> pl.DataFrame:
cuts_df = pl.DataFrame([
pl.Series(
name="break_pt",
values=bins,
dtype=pl.Float64
).extend_constant(np.Inf, 1)
])
if labels:
cuts_df = cuts_df.with_column(
pl.Series(
name="category",
values=labels
)
)
else:
cuts_df = cuts_df.with_columns(
pl.format(
"({}, {}]",
pl.col("break_pt").shift(1, fill_value=-np.Inf),
pl.col("break_pt"),
)
.alias("category")
)
cuts_df = cuts_df.with_columns(pl.col("category").cast(pl.Categorical))
result = (
_df.sort([var_nm]).join_asof(
cuts_df,
left_on=var_nm,
right_on="break_pt",
strategy="forward",
)
)
return result
We can now cut our DataFrame with a single call.
Below, we'll call the function without specifying any labels, allowing the function to create default labels. And notice that we don't have to worry about our list of break points being floats - the function will automatically cast the values to pl.Float64
.
cut_dataframe(df, "var1", [-1, 1])
shape: (10, 3)
┌───────────┬──────────┬──────────────┐
│ var1 ┆ break_pt ┆ category │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ cat │
╞═══════════╪══════════╪══════════════╡
│ -1.265421 ┆ -1.0 ┆ (-inf, -1.0] │
│ -0.703735 ┆ 1.0 ┆ (-1.0, 1.0] │
│ -0.535669 ┆ 1.0 ┆ (-1.0, 1.0] │
│ -0.132105 ┆ 1.0 ┆ (-1.0, 1.0] │
│ 0.1049 ┆ 1.0 ┆ (-1.0, 1.0] │
│ 0.1257 ┆ 1.0 ┆ (-1.0, 1.0] │
│ 0.361595 ┆ 1.0 ┆ (-1.0, 1.0] │
│ 0.640423 ┆ 1.0 ┆ (-1.0, 1.0] │
│ 0.947081 ┆ 1.0 ┆ (-1.0, 1.0] │
│ 1.304 ┆ inf ┆ (1.0, inf] │
└───────────┴──────────┴──────────────┘
And here, we'll pass in a list of labels.
cut_dataframe(df, "var1", [-1, 1], ["low", "med", "hi"])
shape: (10, 3)
┌───────────┬──────────┬──────────┐
│ var1 ┆ break_pt ┆ category │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ cat │
╞═══════════╪══════════╪══════════╡
│ -1.265421 ┆ -1.0 ┆ low │
│ -0.703735 ┆ 1.0 ┆ med │
│ -0.535669 ┆ 1.0 ┆ med │
│ -0.132105 ┆ 1.0 ┆ med │
│ 0.1049 ┆ 1.0 ┆ med │
│ 0.1257 ┆ 1.0 ┆ med │
│ 0.361595 ┆ 1.0 ┆ med │
│ 0.640423 ┆ 1.0 ┆ med │
│ 0.947081 ┆ 1.0 ┆ med │
│ 1.304 ┆ inf ┆ hi │
└───────────┴──────────┴──────────┘
Hopefully, the above is more helpful.