Search code examples
python-polars

Is it semantically possible to optimize LazyFrame -> Fill Null -> Cast to Categorical?


Here is a trivial benchmark based on a real-life workload.

import gc
import time
import numpy as np
import polars as pl

df = (  # I have a dataframe like this from reading a csv.
    pl.Series(
        name="x",
        values=np.random.choice(
            ["ASPARAGUS", "BROCCOLI", ""], size=30_000_000
        ),
    )
    .to_frame()
    .with_columns(
        pl.when(pl.col("x") == "").then(None).otherwise(pl.col("x"))
    )
)

start = time.time()
df.lazy().with_columns(
    pl.col("x").cast(pl.Categorical).fill_null("MISSING")
).collect()
end = time.time()
print(f"Cast then fill_null took {end-start:.2f} seconds.")

Cast then fill_null took 0.93 seconds.

gc.collect()
start = time.time()
df.lazy().with_columns(
    pl.col("x").fill_null("MISSING").cast(pl.Categorical)
).collect()
end = time.time()
print(f"Fill_null then cast took {end-start:.2f} seconds.")

Fill_null then cast took 1.36 seconds.

(1) Am I correct to think that casting to categorical then filling null will always be faster?
(2) Am I correct to think that the result will always be identical regardless of the order?
(3) If the answers are "yes" and "yes", is it possible that someday polars will do this rearrangement automatically? Or is it actually impossible try all these sorts of permutations in a general query optimizer?

Thanks.


Solution

  • 1: yes

    2: somewhat. The logical categorcal representatition will always be the same. The physical changes by the order of occurrence of the string values. Doing fill_null before the cast, means "MISSING" will be found earlier. But this should be seen as an implementation detail.

    3: Yes, this is something we can automatically optimize. Just today we merged something similar: https://github.com/pola-rs/polars/pull/4883