Search code examples
pythonpython-polars

Declare intermediate variables in select statements in Polars


When writing select or with_columns statements in Polars I often wish to declare intermediate variables to make the code more readable. I'd also like to be able to query a column in a context and reuse it in another column's expression. I am currently forced to chain multiple select/with_columns calls which lacks elegance. Here is a fictive example of what I would like to do:

df.with_columns(
    [
       <some expression>.alias('step_1'), # here I want step_1 to become a column in the output table
       temporary_variable = <some other expression>, # here I want this variable not to be present in the output table
       pl.col(['step_1']).some_function(temporary_variable).alias('step_2'), # this column's expression uses both the first column: 'step_1' and temporary_variable
       pl.col(...).some_other_function(temporary_variable).alias('another_column') # temporary_variable might need to be used in multiple column's expression, being able to declare the temporary variable and reuse it makes the code shorter, more modular and avoids copy pasts
    ]
)

My question is: is there any way to do this in Polars?


Solution

  • We can actually use walrus assignment to get very close to what you're after

    Say you have this:

    df = pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5]})
    

    You can do:

    df.with_columns(
        (stp_1 := (pl.col("a") * 2)).alias("step_1"),
        stp_1.pow(tempvar := (pl.col("b") + 1.5)).alias("step_2"),
        (pl.col("c") + tempvar).alias("another_column"),
    )
    

    Note that I intentionally named the walrus assigned variable stp_1 to distinguish it from the alias for the column. There's no way to have the walrus also give the column its name.

    What is assigned to stp1 and tempvar aren't data but expressions which will be resolved by the engine and is computationally equivalent to typing out:

    df.with_columns(
        (pl.col('a') * 2).alias('step_1'),
        (pl.col('a') * 2).pow((pl.col('b') + 1.5)).alias('step_2'),
        (pl.col('c') + (pl.col('b') + 1.5)).alias('another_column')
        )
    

    For performance concerns, also remember::

    All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.

    That quote is from before CSER was implemented for lazy, which will detect that (pl.col("a") * 2)) and (pl.col('b') + 1.5) occur twice (or more) and will only compute them once, caching the result for reuse. You can see that with explain:

    print(df.lazy().with_columns(
        (pl.col('a') * 2).alias('step_1'),
        (pl.col('a') * 2).pow((pl.col('b') + 1.5)).alias('step_2'),
        (pl.col('c') + (pl.col('b') + 1.5)).alias('another_column')
        ).explain())
    
    simple π 6/8 ["a", "b", "c", "step_1", ... 2 other columns]
       WITH_COLUMNS:
       [col("__POLARS_CSER_0xf9e008489610e0a5").alias("step_1"), 
        col("__POLARS_CSER_0xf9e008489610e0a5")
            .pow([col("__POLARS_CSER_0x545c2bc7d61b77c5")]).alias("step_2"), 
        [(col("c").cast(Unknown(Float)))
            + (col("__POLARS_CSER_0x545c2bc7d61b77c5"))].alias("another_column")] 
         WITH_COLUMNS:
         [[(col("a")) * (2)].alias("__POLARS_CSER_0xf9e008489610e0a5"), 
          [(col("b").cast(Unknown(Float)))
            + (dyn float: 1.5)].alias("__POLARS_CSER_0x545c2bc7d61b77c5")] 
          DF ["a", "b", "c"]; PROJECT */3 COLUMNS
    

    As a tangent, instead of using alias you can also name the columns by using named parameters like this

    df.with_columns(
        step_1 = (stp_1 := (pl.col('a') * 2)),
        step_2 = stp_1.pow(tempvar := ((pl.col('b') + 1.5))),
        another_column = (pl.col('c') + tempvar)
        )