I noticed a thing in python polars. I’m not sure but seems that pl.when().then().otherwise() is slow somewhere. For instance, for dataframe:
df = pl.DataFrame({
'A': [randint(1, 10**15) for _ in range(30_000_000)],
'B': [randint(1, 10**15) for _ in range(30_000_000)],
}, schema={
'A': pl.UInt64,
'B': pl.UInt64,
})
Horizontal min with pl.min_horizontal:
df.with_columns(
pl.min_horizontal(['A', 'B']).alias('min_column')
)
92.4 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
And the same with when().then().otherwise():
df.with_columns(
pl.when(
pl.col('A') < pl.col('B')
).then(pl.col('A')).otherwise(pl.col('B')).alias('min_column'),
)
458 ms ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I measure explicitly the when part and seems that it is not a bottleneck.
df.with_columns((pl.col('A') < pl.col('B')).alias('column_comparison'))
49.2 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If remove otherwise() it will be even slower.
df.with_columns(
pl.when(
pl.col('A') < pl.col('B')
).then(pl.col('A')).alias('min_column')
)
664 ms ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also have tried some other methods for horizontal reducing such as pl.reduce or pl.fold and seems that they all are much faster than when().then().
So the questions here:
when().then().otherwise()
?I've got some comments from Polars developers at discord.
I don't see anything out of the ordinary here. Removing otherwise just means .otherwise(pl.lit(None)) is called in the background. It will have to create that column rather than using the existing one. So it will be slower. If you can write your expression as a fold it might be faster, as you have noticed with min_horizontal.
So my conclusion here: when you have a task to reduce several columns into one column, it is better choice to use fold
or reduce
methods when possible, instead of when().then()
.
EDIT
Sinse polars 0.20.17 there is huge speed up in when then operations, caused by refactoring if-then-else kernels. Benchmarks: https://github.com/pola-rs/polars/pull/15131 So now it is not an issue to use if-then-else if it is necessary.