Search code examples
python-polars

Polars Query Optimisation re: String Operations on Categorical Columns


Do string operations on Categorical columns cast the entire column to String first and then perform the operation, or where possible do they operate on the (presumably [much] smaller) Categorical Dictionary directly?

e.g. df.filter(pl.col('my_category').cast(pl.String).str.contains(...)) (also str.starts_with(...) and friends etc) or df.with_columns(pl.col('my_category').cast(pl.String).str.replace(...).cast(pl.Categorical))


Solution

  • It casts everything.

    Here's a little experiment

    n=1_000_000
    df=pl.DataFrame([
        pl.Series('my_categories', ['applezzzzzzzzzzz']*n+
                  ['bananazzzzzzzzzzz']*n+
                  ['carrotzzzzzzzzzzzzz']*n, dtype=pl.Categorical)
    ]).sample(n, shuffle=True)
    
    
    ## Intuitive approach
    %%timeit
    df.filter(pl.col('my_categories').cast(pl.String).str.contains('pl'))
    ## 50.8 ms ± 7.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    ## Convoluted approach that takes advantage of Categories
    %%timeit
    df.filter(
        pl.col('my_categories').to_physical().is_in(
            df.unique()
            .filter(pl.col('my_categories').cast(pl.String).str.contains('pl'))
            .select(pl.col('my_categories').to_physical())
            .to_series()
            )
        )
    ## 28.7 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    If I bump n down to 100k then the simple approach is faster with it taking 5.27ms and the other approach taking 6.18ms.

    The length of the strings also impacts the timing.