Polars Query Optimisation re: String Operations on Categorical Columns

Do string operations on Categorical columns cast the entire column to String first and then perform the operation, or where possible do they operate on the (presumably [much] smaller) Categorical Dictionary directly?

e.g. df.filter(pl.col('my_category').cast(pl.String).str.contains(...)) (also str.starts_with(...) and friends etc) or df.with_columns(pl.col('my_category').cast(pl.String).str.replace(...).cast(pl.Categorical))

Solution

It casts everything.

Here's a little experiment

n=1_000_000
df=pl.DataFrame([
    pl.Series('my_categories', ['applezzzzzzzzzzz']*n+
              ['bananazzzzzzzzzzz']*n+
              ['carrotzzzzzzzzzzzzz']*n, dtype=pl.Categorical)
]).sample(n, shuffle=True)


## Intuitive approach
%%timeit
df.filter(pl.col('my_categories').cast(pl.String).str.contains('pl'))
## 50.8 ms ± 7.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

## Convoluted approach that takes advantage of Categories
%%timeit
df.filter(
    pl.col('my_categories').to_physical().is_in(
        df.unique()
        .filter(pl.col('my_categories').cast(pl.String).str.contains('pl'))
        .select(pl.col('my_categories').to_physical())
        .to_series()
        )
    )
## 28.7 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If I bump n down to 100k then the simple approach is faster with it taking 5.27ms and the other approach taking 6.18ms.

The length of the strings also impacts the timing.