Search code examples
python-polars

How to map a dict of expressions to a dataframe


Problem

Is there some nice/efficient/best way how to get a dict of polars expressions being applied (and evaluated) on a dataframe (given a column value for match and same+other column values as a part of the expression evaluation)?

Setup

import polars as pl
pl.Config.set_fmt_str_lengths(100)

# base data
df = pl.DataFrame(
    {
        "a":[1,2,3], 
        "b":[2,3,4]
    }
)

# dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
dct = {
    1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")), 
    2: pl.format("Another message having a={}'", pl.col("a"))
}

What I want to get to

exp_df = pl.DataFrame(
    {
        "a":[1,2,3], 
        "b":[2,3,4], 
        "message":["My message about a=1 and b=2", "Another message having a=2", "Default message having a=3, b=4"]
    }
)
print(exp_df)
shape: (3, 3)
┌─────┬─────┬─────────────────────────────────┐
│ a   ┆ b   ┆ message                         │
│ --- ┆ --- ┆ ---                             │
│ i64 ┆ i64 ┆ str                             │
╞═════╪═════╪═════════════════════════════════╡
│ 1   ┆ 2   ┆ My message about a=1 and b=2    │
│ 2   ┆ 3   ┆ Another message having a=2      │
│ 3   ┆ 4   ┆ Default message having a=3, b=4 │
└─────┴─────┴─────────────────────────────────┘

What I tried

df_achieved = df.with_columns(
    [
        pl.col("a").map_elements(
            lambda value: dct.get(
                value,
                pl.format("Default message having a={}, b={}", pl.col("a"), pl.col("b"))
            )
        ).alias("message")
    ]
)
print(df_achieved)
shape: (3, 3)
┌─────┬─────┬───────────────────────────────────────────────────────────────┐
│ a   ┆ b   ┆ message                                                       │
│ --- ┆ --- ┆ ---                                                           │
│ i64 ┆ i64 ┆ object                                                        │
╞═════╪═════╪═══════════════════════════════════════════════════════════════╡
│ 1   ┆ 2   ┆ String(My message about A ').str.concat_horizontal([col("a"), │
│     ┆     ┆ String(' and B '), col("b"), String(')…                       │
│ 2   ┆ 3   ┆ String(Another message having                                 │
│     ┆     ┆ a=).str.concat_horizontal([col("a"), String(')])              │
│ 3   ┆ 4   ┆ String(Default message having                                 │
│     ┆     ┆ a=).str.concat_horizontal([col("a"), String(, b=), col("b")]) │
└─────┴─────┴───────────────────────────────────────────────────────────────┘

Solution

  • As long as you don't have too many different messages, I suggest converting your dictionary to a when/then/otherwise expression programmatically. (Using map_elements can be painfully single-threaded and slow.)

    For example, let's expand your example slightly:

    # base data
    df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [2, 3, 4, 5]})
    
    # dict I want to use to map expression based on column `a`, and customize the message based on other columns from `df`
    dct = {
        1: pl.format("My message about A '{}' and B '{}'", pl.col("a"), pl.col("b")),
        2: pl.format("Another message having a='{}'", pl.col("a")),
        3: pl.format("Still Another message having a='{}'", pl.col("a")),
    }
    

    We can convert the dictionary to a when/then/otherwise expression as follows:

    (val_a, msg_expr), *remaining_msgs = dct.items()
    when_expr = pl.when(pl.col('a') == val_a).then(msg_expr)
    while remaining_msgs:
        (val_a, msg_expr), *remaining_msgs = remaining_msgs
        when_expr = when_expr.when(pl.col('a') == val_a).then(msg_expr)
    
    when_expr = when_expr.otherwise(
        pl.format("Default message having a='{}', b='{}'", pl.col('a'), pl.col("b"))
    )
    

    The above code does not significantly slow our performance because the Python bytecode is merely generating the expression; it is not being applied to any data.

    Then to run our when/then/otherwise expression, we simply use a with_columns context.

    (
        df
        .with_columns(when_expr.alias('message'))
    )
    
    shape: (4, 3)
    ┌─────┬─────┬─────────────────────────────────────┐
    │ a   ┆ b   ┆ message                             │
    │ --- ┆ --- ┆ ---                                 │
    │ i64 ┆ i64 ┆ str                                 │
    ╞═════╪═════╪═════════════════════════════════════╡
    │ 1   ┆ 2   ┆ My message about A '1' and B '2'    │
    │ 2   ┆ 3   ┆ Another message having a='2'        │
    │ 3   ┆ 4   ┆ Still Another message having a='3'  │
    │ 4   ┆ 5   ┆ Default message having a='4', b='5' │
    └─────┴─────┴─────────────────────────────────────┘
    

    This approach should work as long as you don't have a large number of different messages. In that case, we may need to use partition_by to partition our DataFrame (by a), attach each message separately to each partition, and then concatenate the result at the end.