This Polars User Guide page says
Common subplan elimination - Cache subtrees/file scans that are used by multiple subtrees in the query plan.
I don't know what does a subtree mean in this context, but I was hoping that Polars would try to reuse the result of a subexpression evaluation if it was used multiple times within one context. But the example below shows that it's not true.
# polars==0.18.3
import polars as pl
import time
import os
def compute_heavily(initial_value: pl.Expr) -> pl.Expr:
"""Waste some CPU power to compute the golden ratio"""
result = initial_value
for i in range(100):
result = (1 + result) ** 0.5
return result
# Block parallel execution of polars expressions
os.environ["POLARS_MAX_THREADS"] = "1"
# Create a dataframe
df = pl.DataFrame(pl.repeat(0.0, 10_000_000, eager=True).alias("value"))
# Case A
start_time = time.perf_counter()
df_a = (
df.lazy().with_columns(compute_heavily(pl.col("value")).alias("result_1")).collect()
)
delta_time = time.perf_counter() - start_time
print(f"Case A: function called once: {delta_time:.3f} s.")
# Case B
start_time = time.perf_counter()
df_b = (
df.lazy()
.with_columns(compute_heavily(pl.col("value")).alias("result_1"))
.with_columns((pl.col("result_1")).alias("result_2"))
.collect()
)
delta_time = time.perf_counter() - start_time
print(
f"Case B: function called once in one context "
f"and the result is used in an updated context: {delta_time:.3f} s."
)
# Case C
start_time = time.perf_counter()
df_c = (
df.lazy()
.with_columns(compute_heavily(pl.col("value")).alias("result_1"))
.with_columns(compute_heavily(pl.col("value")).alias("result_2"))
.collect()
)
delta_time = time.perf_counter() - start_time
print(
f"Case C: function called twice, "
f"but each time in a different context: {delta_time:.3f} s."
)
# Case D
start_time = time.perf_counter()
df_d = (
df.lazy()
.with_columns(
(compute_heavily(pl.col("value")) + compute_heavily(pl.col("value"))).alias(
"result_1"
)
)
.collect()
)
delta_time = time.perf_counter() - start_time
print(
f"Case D: function called twice within one context "
f"to make one column: {delta_time:.3f} s."
)
# Case E
start_time = time.perf_counter()
df_e = (
df.lazy()
.with_columns(
[
compute_heavily(pl.col("value").alias("result_1")),
compute_heavily(pl.col("value")).alias("result_2"),
]
)
.collect()
)
delta_time = time.perf_counter() - start_time
print(
f"Case E: function called twice within one context, "
f"but each time for a different column: {delta_time:.3f} s."
)
Case A: function called once: 2.124 s.
Case B: function called once in one context and the result is used in an updated context: 2.208 s.
Case C: function called twice, but each time in a different context: 4.454 s.
Case D: function called twice within one context to make one column: 4.390 s.
Case E: function called twice within one context, but each time for a different column: 4.299 s.
Cases A, B and C work as I expected: A and B take 1 time unit (as the function is explicitly called once), and C takes 2 time units (as the function is explicitly called twice within different contexts). Cases D and E take 2 time units, but I thought they would take only 1 thanks to optimization.
Common subplan elimination
mean?There now exists, as @ritchie46 alluded to, Common Sub Expressions which does exactly what OP is asking about. For example if we do:
(
df.lazy()
.with_columns(
(compute_heavily(pl.col("value")) + compute_heavily(pl.col("value"))).alias(
"result_1"
)
)
.explain()
)
# WITH_COLUMNS:
# [[(col("__POLARS_CSER_3495242331465746361")) + (col("__POLARS_CSER_3495242331465746361"))].alias("result_1"), [(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + (col("value"))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]).alias("__POLARS_CSER_3495242331465746361")]
# DF ["value"]; PROJECT */1 COLUMNS; SELECTION: "None"
You can see reference to col("__POLARS_CSER_3495242331465746361")
, that's the cached value of the calc which it only does once.
I'm on version 0.19.11.