def udf(row):
print(row)
print(row[0])
print(row[1])
return row
df = pl.DataFrame({'a': [1], 'b': [2]})
df = df.map_rows(udf)
gives output,
(1, 2)
1
2
but I would like to use the []
notation, is there a specific reason that it comes as a tuple by default as when I use,
def udf(row):
print(row['a'])
print(row['b'])
return row
df = pl.DataFrame({'a': [1], 'b': [2]})
df = df.map_rows(udf)
I get
TypeError: tuple indices must be integers or slices, not str
how do I make the []
notation work for custom udfs?
For a start, you should always prefer to use native polars expressions vs custom python functions. But if you absolutely know that you need it, then here it is.
From documentation of map_rows()
:
map_rows
cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level map_elements
syntax instead.map_elements()
to apply function.Solution 1
def udf(row):
print(row['a'])
print(row['b'])
return row
df = pl.DataFrame({'a': [1], 'b': [2]})
df.select(pl.struct(pl.all()).map_elements(udf))
Output:
1
2
Solution 2
You can also adjust your function to so you can convert column names to indices:
def udf(row, cols):
print(row)
print(row[cols['a']])
print(row[cols['b']])
return row
df = pl.DataFrame({'a': [1], 'b': [2]})
cols = {v: i for i,v in enumerate(df.columns)}
df = df.map_rows(lambda x: udf(x, cols))
Solution 3
rows()
method with named = True
.iter_rows()
so the rows are not materialized at oncedef udf(row):
print(row['a'])
print(row['b'])
return row
df = pl.DataFrame({'a': [1], 'b': [2]})
df = pl.DataFrame(udf(r) for r in df.iter_rows(named=True))