Problem
Is there some nice/efficient/best way how to get a dict of polars expressions being applied (and evaluated) on a dataframe (given a column value for match and same+other column values as a part of the expression evaluation)?
Setup
import polars as pl
pl.Config.set_fmt_str_lengths(100)
# base data
df = pl.DataFrame(
{
"a":[1,2,3],
"b":[2,3,4]
}
)
# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")),
2: pl.format("Another message having a={}'", pl.col("a"))
}
What I want to get to
exp_df = pl.DataFrame(
{
"a":[1,2,3],
"b":[2,3,4],
"message":["My message about a=1 and b=2", "Another message having a=2", "Default message having a=3, b=4"]
}
)
print(exp_df)
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a ┆ b ┆ message │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════════════════════════════════╡
│ 1 ┆ 2 ┆ My message about a=1 and b=2 │
│ 2 ┆ 3 ┆ Another message having a=2 │
│ 3 ┆ 4 ┆ Default message having a=3, b=4 │
└─────┴─────┴─────────────────────────────────┘
What I tried
df_achieved = df.with_columns(
[
pl.col("a").map_elements(
lambda value: dct.get(
value,
pl.format("Default message having a={}, b={}", pl.col("a"), pl.col("b"))
)
).alias("message")
]
)
print(df_achieved)
shape: (3, 3)
┌─────┬─────┬───────────────────────────────────────────────────────────────┐
│ a ┆ b ┆ message │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ object │
╞═════╪═════╪═══════════════════════════════════════════════════════════════╡
│ 1 ┆ 2 ┆ String(My message about A ').str.concat_horizontal([col("a"), │
│ ┆ ┆ String(' and B '), col("b"), String(')… │
│ 2 ┆ 3 ┆ String(Another message having │
│ ┆ ┆ a=).str.concat_horizontal([col("a"), String(')]) │
│ 3 ┆ 4 ┆ String(Default message having │
│ ┆ ┆ a=).str.concat_horizontal([col("a"), String(, b=), col("b")]) │
└─────┴─────┴───────────────────────────────────────────────────────────────┘
As long as you don't have too many different messages, I suggest converting your dictionary to a when/then/otherwise
expression programmatically. (Using map_elements
can be painfully single-threaded and slow.)
For example, let's expand your example slightly:
# base data
df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")),
2: pl.format("Another message having a='{}'", pl.col("a")),
3: pl.format("Still Another message having a='{}'", pl.col("a")),
}
We can convert the dictionary to a when/then/otherwise
expression as follows:
(val_a, msg_expr), *remaining_msgs = dct.items()
when_expr = pl.when(pl.col('a') == val_a).then(msg_expr)
while remaining_msgs:
(val_a, msg_expr), *remaining_msgs = remaining_msgs
when_expr = when_expr.when(pl.col('a') == val_a).then(msg_expr)
when_expr = when_expr.otherwise(
pl.format("Default message having a='{}', b='{}'", pl.col('a'), pl.col("b"))
)
The above code does not significantly slow our performance because the Python bytecode is merely generating the expression; it is not being applied to any data.
Then to run our when/then/otherwise
expression, we simply use a with_columns
context.
(
df
.with_columns(when_expr.alias('message'))
)
shape: (4, 3)
┌─────┬─────┬─────────────────────────────────────┐
│ a ┆ b ┆ message │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════════════════════════════════════╡
│ 1 ┆ 2 ┆ My message about A '1' and B '2' │
│ 2 ┆ 3 ┆ Another message having a='2' │
│ 3 ┆ 4 ┆ Still Another message having a='3' │
│ 4 ┆ 5 ┆ Default message having a='4', b='5' │
└─────┴─────┴─────────────────────────────────────┘
This approach should work as long as you don't have a large number of different messages. In that case, we may need to use partition_by
to partition our DataFrame (by a
), attach each message separately to each partition, and then concatenate the result at the end.