Search code examples
pythondataframepython-polarskeyerrorlazyframe

KeyError when applying with_columns iteratively over different columns when using pl.struct on Polars LazyFrame


I have the following issue with Polars's LazyFrame "Structs" (pl.struct) and "apply" (a.k.a. map_elements) in with_columns

The idea here is trying to apply a custom logic to a group of values that belong to more than one column

I have been able to achieve this using DataFrames; however, when switching to LazyFrames, a KeyError is raised whenever I try to access a column in the dictionary sent by the struct to the function. I'm looping through columns, one by one, in order to apply different functions (mapped elsewhere to their names, but in the examples below I'll just use the same one for simplicity)

  • Working DataFrame implementation
my_df = pl.DataFrame(
    {
        "foo": ["a", "b", "c", "d"], 
        "bar": ["w", "x", "y", "z"], 
        "notes": ["1", "2", "3", "4"]
    }
)

print(my_df)

cols_to_validate = ("foo", "bar")

def validate_stuff(value, notes):
    # Any custom logic
    if value not in ["a", "b", "x"]:
        return f"FAILED {value} - PREVIOUS ({notes})"
    else:
        return notes

for col in cols_to_validate:
    my_df = my_df.with_columns(
        pl.struct([col, "notes"]).map_elements(
            lambda row: validate_stuff(row[col], row["notes"])
        ).alias("notes")
    )

print(my_df)
  • Broken LazyFrame implementation
my_lf = pl.DataFrame(
    {
        "foo": ["a", "b", "c", "d"], 
        "bar": ["w", "x", "y", "z"], 
        "notes": ["1", "2", "3", "4"]
    }
).lazy()

def validate_stuff(value, notes):
    # Any custom logic
    if value not in ["a", "b", "x"]:
        return f"FAILED {value} - PREVIOUS ({notes})"
    else:
        return notes

cols_to_validate = ("foo", "bar")

for col in cols_to_validate:
    my_lf = my_lf.with_columns(
        pl.struct([col, "notes"]).map_elements(
            lambda row: validate_stuff(row[col], row["notes"])
        ).alias("notes")
    )

print(my_lf.collect())

(Ah, yeah, do notice that individually executing each iteration does work, so it's not making any sense to me why the for loop breaks)

my_lf = my_lf.with_columns(
    pl.struct(["foo", "notes"]).map_elements(
        lambda row: validate_stuff(row["foo"], row["notes"])
    ).alias("notes")
)

my_lf = my_lf.with_columns(
    pl.struct(["bar", "notes"]).map_elements(
        lambda row: validate_stuff(row["bar"], row["notes"])
    ).alias("notes")
)

I have found a workaround using pl.col instead to achieve my desired result, but I would like to know whether Structs can be used the same way with LazyFrames right as I did with DataFrames, or it's actually a bug in this Polars version

I'm using Polars 0.19.13, BTW. Thank you for your attention


Solution

  • It's more of a general "gotcha" with Python itself: Official Python FAQ

    It breaks because col ends up with the same value for every lambda

    One approach is to use a named/keyword arg:

    lambda row, col=col: validate_stuff(row[col], row["notes"])
    
    shape: (4, 3)
    ┌─────┬─────┬───────────────────────────────────┐
    │ foo ┆ bar ┆ notes                             │
    │ --- ┆ --- ┆ ---                               │
    │ str ┆ str ┆ str                               │
    ╞═════╪═════╪═══════════════════════════════════╡
    │ a   ┆ w   ┆ FAILED w - PREVIOUS (1)           │
    │ b   ┆ x   ┆ 2                                 │
    │ c   ┆ y   ┆ FAILED y - PREVIOUS (FAILED c - … │
    │ d   ┆ z   ┆ FAILED z - PREVIOUS (FAILED d - … │
    └─────┴─────┴───────────────────────────────────┘