I have a wide dataframe, and I'm applying some custom logic to some columns to generate a new column. This works, and returns a dataframe with a single column with my desired values.
How can I get this as a new column in the original dataframe?
I tried various forms of .with_columns
but none did the trick; and without row identifiers I don't feel at ease doing a concatenation.
Any ideas?
I'm trying to solve for a generic UDF, not one that I plan to express as a polars expression.
df = pl.DataFrame({
"foo": [16, 28, 0 ],
"bar": [None, 4,17 ],
"yat": [41, 174,15 ],
"tar": [None, 4,0 ],
})
def udf(row: tuple[float])->str:
'''This code is illustrative - meant to be a row-wise UDF'''
return ' + '.join([f'{x}x{"^"+str(i) if i>0 else ""}' for i,x in enumerate(row) if x!=0 and x is not None])
df.select('foo','bar', 'yat', 'tar').map_rows(udf).rename({'map':'poly'})
shape: (3, 1)
┌────────────────────────────┐
│ poly │
│ --- │
│ str │
╞════════════════════════════╡
│ 16x + 41x^2 │
│ 28x + 4x^1 + 174x^2 + 4x^3 │
│ 17x^1 + 15x^2 │
└────────────────────────────┘
shape: (3, 5)
┌─────┬──────┬─────┬──────┬────────────────────────────┐
│ foo ┆ bar ┆ yat ┆ tar ┆ poly │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪══════╪═════╪══════╪════════════════════════════╡
│ 16 ┆ null ┆ 41 ┆ null ┆ 16x + 41x^2 │
│ 28 ┆ 4 ┆ 174 ┆ 4 ┆ 28x + 4x^1 + 174x^2 + 4x^3 │
│ 0 ┆ 17 ┆ 15 ┆ 0 ┆ 17x^1 + 15x^2 │
└─────┴──────┴─────┴──────┴────────────────────────────┘
In pandas you'd do this as:
df.assign(poly = lambda x: x.apply(udf, axis=1))
(df.with_columns(
pl.struct('foo','bar','yat','tar')
.map_elements(lambda x: udf(x.values()))
.alias('poly'))
)
Or
(df.with_columns(
pl.struct(pl.all())
.map_elements(lambda x: udf(x.values()))
.alias('poly'))
)
If you want to apply a function over multiple columns you need to pack them into a struct
type. This packing is free, but is needed to suffice the expression rules, that every expressions input only consist of a single datatype. E.g. an expression is Fn(Expr) -> Expr
.
Below shows an example of using map_elements
to compute the horizontal sum and the more idiomatic way to compute a horizontal sum.
df = pl.DataFrame({
"foo": [1, 2],
"bar": [.1, .2],
})
def mysum(row: tuple[float])->float:
'''This code is illustrative - meant to be a row-wise UDF'''
return sum(row)
df.with_columns(
# horizontal sum with a custom UDF
pl.struct("foo", "bar").map_elements(lambda x: mysum((x["foo"], x["bar"]))).alias("foo+bar (slow)"),
# idiomatic way to do a horizontal sum
pl.sum_horizontal("foo", "bar").alias("foo+bar (fast)")
)
If you want to do more complicated horizontal aggregations, but want to keep the code fast (as using a python function in map_elements
is not), you can use fold
. Below I show how to compute a horizontal sum with a fold.
df.with_columns(
pl.fold(
acc=0,
function=lambda a, b: a + b,
exprs=pl.all()
).alias("foo+bar")
)