Do string operations on Categorical
columns cast the entire column to String
first and then perform the operation, or where possible do they operate on the (presumably [much] smaller) Categorical Dictionary directly?
e.g. df.filter(pl.col('my_category').cast(pl.String).str.contains(...))
(also str.starts_with(...)
and friends etc) or df.with_columns(pl.col('my_category').cast(pl.String).str.replace(...).cast(pl.Categorical))
It casts everything.
Here's a little experiment
n=1_000_000
df=pl.DataFrame([
pl.Series('my_categories', ['applezzzzzzzzzzz']*n+
['bananazzzzzzzzzzz']*n+
['carrotzzzzzzzzzzzzz']*n, dtype=pl.Categorical)
]).sample(n, shuffle=True)
## Intuitive approach
%%timeit
df.filter(pl.col('my_categories').cast(pl.String).str.contains('pl'))
## 50.8 ms ± 7.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
## Convoluted approach that takes advantage of Categories
%%timeit
df.filter(
pl.col('my_categories').to_physical().is_in(
df.unique()
.filter(pl.col('my_categories').cast(pl.String).str.contains('pl'))
.select(pl.col('my_categories').to_physical())
.to_series()
)
)
## 28.7 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If I bump n
down to 100k then the simple approach is faster with it taking 5.27ms and the other approach taking 6.18ms.
The length of the strings also impacts the timing.