Search code examples
optimizationpython-polars

Why doesn't polars reuse the already evaluated results of repeated expressions within one context?


This Polars User Guide page says

Common subplan elimination - Cache subtrees/file scans that are used by multiple subtrees in the query plan.

I don't know what does a subtree mean in this context, but I was hoping that Polars would try to reuse the result of a subexpression evaluation if it was used multiple times within one context. But the example below shows that it's not true.

# polars==0.18.3

import polars as pl
import time
import os


def compute_heavily(initial_value: pl.Expr) -> pl.Expr:
    """Waste some CPU power to compute the golden ratio"""
    result = initial_value
    for i in range(100):
        result = (1 + result) ** 0.5
    return result


# Block parallel execution of polars expressions
os.environ["POLARS_MAX_THREADS"] = "1"
# Create a dataframe
df = pl.DataFrame(pl.repeat(0.0, 10_000_000, eager=True).alias("value"))

# Case A
start_time = time.perf_counter()
df_a = (
    df.lazy().with_columns(compute_heavily(pl.col("value")).alias("result_1")).collect()
)
delta_time = time.perf_counter() - start_time
print(f"Case A: function called once: {delta_time:.3f} s.")

# Case B
start_time = time.perf_counter()
df_b = (
    df.lazy()
    .with_columns(compute_heavily(pl.col("value")).alias("result_1"))
    .with_columns((pl.col("result_1")).alias("result_2"))
    .collect()
)
delta_time = time.perf_counter() - start_time
print(
    f"Case B: function called once in one context "
    f"and the result is used in an updated context: {delta_time:.3f} s."
)

# Case C
start_time = time.perf_counter()
df_c = (
    df.lazy()
    .with_columns(compute_heavily(pl.col("value")).alias("result_1"))
    .with_columns(compute_heavily(pl.col("value")).alias("result_2"))
    .collect()
)
delta_time = time.perf_counter() - start_time
print(
    f"Case C: function called twice, "
    f"but each time in a different context: {delta_time:.3f} s."
)

# Case D
start_time = time.perf_counter()
df_d = (
    df.lazy()
    .with_columns(
        (compute_heavily(pl.col("value")) + compute_heavily(pl.col("value"))).alias(
            "result_1"
        )
    )
    .collect()
)
delta_time = time.perf_counter() - start_time
print(
    f"Case D: function called twice within one context "
    f"to make one column: {delta_time:.3f} s."
)

# Case E
start_time = time.perf_counter()
df_e = (
    df.lazy()
    .with_columns(
        [
            compute_heavily(pl.col("value").alias("result_1")),
            compute_heavily(pl.col("value")).alias("result_2"),
        ]
    )
    .collect()
)
delta_time = time.perf_counter() - start_time
print(
    f"Case E: function called twice within one context, "
    f"but each time for a different column: {delta_time:.3f} s."
)

Case A: function called once: 2.124 s.
Case B: function called once in one context and the result is used in an updated context: 2.208 s.
Case C: function called twice, but each time in a different context: 4.454 s.
Case D: function called twice within one context to make one column: 4.390 s.
Case E: function called twice within one context, but each time for a different column: 4.299 s.

Cases A, B and C work as I expected: A and B take 1 time unit (as the function is explicitly called once), and C takes 2 time units (as the function is explicitly called twice within different contexts). Cases D and E take 2 time units, but I thought they would take only 1 thanks to optimization.

  1. Is this how Polars is meant to work?
  2. What does a Common subplan elimination mean?

Solution

  • There now exists, as @ritchie46 alluded to, Common Sub Expressions which does exactly what OP is asking about. For example if we do:

    (
        df.lazy()
        .with_columns(
            (compute_heavily(pl.col("value")) + compute_heavily(pl.col("value"))).alias(
                "result_1"
            )
        )
        .explain()
    )
    # WITH_COLUMNS:
    #  [[(col("__POLARS_CSER_3495242331465746361")) + (col("__POLARS_CSER_3495242331465746361"))].alias("result_1"), [(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + ([(1.0) + (col("value"))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]))].pow([0.5]).alias("__POLARS_CSER_3495242331465746361")]
    #   DF ["value"]; PROJECT */1 COLUMNS; SELECTION: "None"
    

    You can see reference to col("__POLARS_CSER_3495242331465746361"), that's the cached value of the calc which it only does once.

    I'm on version 0.19.11.