Search code examples
pythonpython-polars

Standard way to use a udf in polars


def udf(row):
    print(row)
    print(row[0])
    print(row[1])
    return row

df = pl.DataFrame({'a': [1], 'b': [2]})
df = df.map_rows(udf)

gives output,

(1, 2)
1
2

but I would like to use the [] notation, is there a specific reason that it comes as a tuple by default as when I use,

def udf(row):
    print(row['a'])
    print(row['b'])
    return row
df = pl.DataFrame({'a': [1], 'b': [2]})
df = df.map_rows(udf)

I get

TypeError: tuple indices must be integers or slices, not str

how do I make the [] notation work for custom udfs?


Solution

  • For a start, you should always prefer to use native polars expressions vs custom python functions. But if you absolutely know that you need it, then here it is.

    From documentation of map_rows():

    • The frame-level map_rows cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level map_elements syntax instead.
    • map_elements() to apply function.

    Solution 1

    def udf(row):
        print(row['a'])
        print(row['b'])
        return row
    
    df = pl.DataFrame({'a': [1], 'b': [2]})
    
    df.select(pl.struct(pl.all()).map_elements(udf))
    
    Output:
    1
    2
    

    Solution 2

    You can also adjust your function to so you can convert column names to indices:

    def udf(row, cols):
        print(row)
        print(row[cols['a']])
        print(row[cols['b']])
        return row
    
    df = pl.DataFrame({'a': [1], 'b': [2]})
    cols = {v: i for i,v in enumerate(df.columns)}
    
    df = df.map_rows(lambda x: udf(x, cols))
    

    Solution 3

    • You can use rows() method with named = True.
    • Or, as @Henry Harbeck mentioned in comments, use iter_rows() so the rows are not materialized at once
    def udf(row):
        print(row['a'])
        print(row['b'])
        return row
    df = pl.DataFrame({'a': [1], 'b': [2]})
    
    df = pl.DataFrame(udf(r) for r in df.iter_rows(named=True))