When writing select
or with_columns
statements in Polars I often wish to declare intermediate variables to make the code more readable. I'd also like to be able to query a column in a context and reuse it in another column's expression. I am currently forced to chain multiple select
/with_columns
calls which lacks elegance. Here is a fictive example of what I would like to do:
df.with_columns(
[
<some expression>.alias('step_1'), # here I want step_1 to become a column in the output table
temporary_variable = <some other expression>, # here I want this variable not to be present in the output table
pl.col(['step_1']).some_function(temporary_variable).alias('step_2'), # this column's expression uses both the first column: 'step_1' and temporary_variable
pl.col(...).some_other_function(temporary_variable).alias('another_column') # temporary_variable might need to be used in multiple column's expression, being able to declare the temporary variable and reuse it makes the code shorter, more modular and avoids copy pasts
]
)
My question is: is there any way to do this in Polars?
We can actually use walrus assignment to get very close to what you're after
Say you have this:
df = pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5]})
You can do:
df.with_columns(
(stp_1 := (pl.col("a") * 2)).alias("step_1"),
stp_1.pow(tempvar := (pl.col("b") + 1.5)).alias("step_2"),
(pl.col("c") + tempvar).alias("another_column"),
)
Note that I intentionally named the walrus assigned variable stp_1
to distinguish it from the alias for the column. There's no way to have the walrus also give the column its name.
What is assigned to stp1
and tempvar
aren't data but expressions which will be resolved by the engine and is computationally equivalent to typing out:
df.with_columns(
(pl.col('a') * 2).alias('step_1'),
(pl.col('a') * 2).pow((pl.col('b') + 1.5)).alias('step_2'),
(pl.col('c') + (pl.col('b') + 1.5)).alias('another_column')
)
For performance concerns, also remember::
All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.
That quote is from before CSER was implemented for lazy, which will detect that (pl.col("a") * 2))
and (pl.col('b') + 1.5)
occur twice (or more) and will only compute them once, caching the result for reuse. You can see that with explain:
print(df.lazy().with_columns(
(pl.col('a') * 2).alias('step_1'),
(pl.col('a') * 2).pow((pl.col('b') + 1.5)).alias('step_2'),
(pl.col('c') + (pl.col('b') + 1.5)).alias('another_column')
).explain())
simple π 6/8 ["a", "b", "c", "step_1", ... 2 other columns]
WITH_COLUMNS:
[col("__POLARS_CSER_0xf9e008489610e0a5").alias("step_1"),
col("__POLARS_CSER_0xf9e008489610e0a5")
.pow([col("__POLARS_CSER_0x545c2bc7d61b77c5")]).alias("step_2"),
[(col("c").cast(Unknown(Float)))
+ (col("__POLARS_CSER_0x545c2bc7d61b77c5"))].alias("another_column")]
WITH_COLUMNS:
[[(col("a")) * (2)].alias("__POLARS_CSER_0xf9e008489610e0a5"),
[(col("b").cast(Unknown(Float)))
+ (dyn float: 1.5)].alias("__POLARS_CSER_0x545c2bc7d61b77c5")]
DF ["a", "b", "c"]; PROJECT */3 COLUMNS
As a tangent, instead of using alias you can also name the columns by using named parameters like this
df.with_columns(
step_1 = (stp_1 := (pl.col('a') * 2)),
step_2 = stp_1.pow(tempvar := ((pl.col('b') + 1.5))),
another_column = (pl.col('c') + tempvar)
)