Search code examples
pythonpython-polars

Polars expression when().then().otherwise is slow


I noticed a thing in python polars. I’m not sure but seems that pl.when().then().otherwise() is slow somewhere. For instance, for dataframe:

df = pl.DataFrame({
    'A': [randint(1, 10**15) for _ in range(30_000_000)],
    'B': [randint(1, 10**15) for _ in range(30_000_000)],
}, schema={
    'A': pl.UInt64,
    'B': pl.UInt64,
})

Horizontal min with pl.min_horizontal:

df.with_columns(
    pl.min_horizontal(['A', 'B']).alias('min_column')
)
92.4 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And the same with when().then().otherwise():

df.with_columns(
    pl.when(
        pl.col('A') < pl.col('B')
    ).then(pl.col('A')).otherwise(pl.col('B')).alias('min_column'),
)
458 ms ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I measure explicitly the when part and seems that it is not a bottleneck.

df.with_columns((pl.col('A') < pl.col('B')).alias('column_comparison'))
49.2 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If remove otherwise() it will be even slower.

df.with_columns(
    pl.when(
        pl.col('A') < pl.col('B')
    ).then(pl.col('A')).alias('min_column')
)
664 ms ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I also have tried some other methods for horizontal reducing such as pl.reduce or pl.fold and seems that they all are much faster than when().then().

So the questions here:

  1. Is it expected behavior?
  2. Why pl.when().then() is much slower than other expressions?
  3. In which cases should we avoid when().then().otherwise()?

Solution

  • I've got some comments from Polars developers at discord.

    I don't see anything out of the ordinary here. Removing otherwise just means .otherwise(pl.lit(None)) is called in the background. It will have to create that column rather than using the existing one. So it will be slower. If you can write your expression as a fold it might be faster, as you have noticed with min_horizontal.

    So my conclusion here: when you have a task to reduce several columns into one column, it is better choice to use fold or reduce methods when possible, instead of when().then().

    EDIT

    Sinse polars 0.20.17 there is huge speed up in when then operations, caused by refactoring if-then-else kernels. Benchmarks: https://github.com/pola-rs/polars/pull/15131 So now it is not an issue to use if-then-else if it is necessary.