Search code examples
pythonpython-polars

How do I add the result of an apply/map_rows as a new column in polars?


I have a wide dataframe, and I'm applying some custom logic to some columns to generate a new column. This works, and returns a dataframe with a single column with my desired values.

How can I get this as a new column in the original dataframe?

I tried various forms of .with_columns but none did the trick; and without row identifiers I don't feel at ease doing a concatenation.

Any ideas?

I'm trying to solve for a generic UDF, not one that I plan to express as a polars expression.

df = pl.DataFrame({
    "foo": [16, 28, 0 ],
    "bar": [None, 4,17 ],
    "yat": [41, 174,15 ],
    "tar": [None, 4,0 ],
})

def udf(row: tuple[float])->str: 
    '''This code is illustrative - meant to be a row-wise UDF'''
    return ' + '.join([f'{x}x{"^"+str(i) if i>0 else ""}' for i,x in enumerate(row) if x!=0 and x is not None])

df.select('foo','bar', 'yat', 'tar').map_rows(udf).rename({'map':'poly'})
shape: (3, 1)
┌────────────────────────────┐
│ poly                       │
│ ---                        │
│ str                        │
╞════════════════════════════╡
│ 16x + 41x^2                │
│ 28x + 4x^1 + 174x^2 + 4x^3 │
│ 17x^1 + 15x^2              │
└────────────────────────────┘

My desired output is

shape: (3, 5)
┌─────┬──────┬─────┬──────┬────────────────────────────┐
│ foo ┆ bar  ┆ yat ┆ tar  ┆ poly                       │
│ --- ┆ ---  ┆ --- ┆ ---  ┆ ---                        │
│ i64 ┆ i64  ┆ i64 ┆ i64  ┆ str                        │
╞═════╪══════╪═════╪══════╪════════════════════════════╡
│ 16  ┆ null ┆ 41  ┆ null ┆ 16x + 41x^2                │
│ 28  ┆ 4    ┆ 174 ┆ 4    ┆ 28x + 4x^1 + 174x^2 + 4x^3 │
│ 0   ┆ 17   ┆ 15  ┆ 0    ┆ 17x^1 + 15x^2              │
└─────┴──────┴─────┴──────┴────────────────────────────┘

In pandas you'd do this as: df.assign(poly = lambda x: x.apply(udf, axis=1))

SOLVED:

(df.with_columns(
   pl.struct('foo','bar','yat','tar')
     .map_elements(lambda x: udf(x.values()))
     .alias('poly'))
)

Or

(df.with_columns(
   pl.struct(pl.all())
     .map_elements(lambda x: udf(x.values()))
     .alias('poly'))
)

Solution

  • If you want to apply a function over multiple columns you need to pack them into a struct type. This packing is free, but is needed to suffice the expression rules, that every expressions input only consist of a single datatype. E.g. an expression is Fn(Expr) -> Expr.

    Below shows an example of using map_elements to compute the horizontal sum and the more idiomatic way to compute a horizontal sum.

    df = pl.DataFrame({
        "foo": [1, 2],
        "bar": [.1, .2],
    })
    
    def mysum(row: tuple[float])->float: 
        '''This code is illustrative - meant to be a row-wise UDF'''
        return sum(row)
    
    
    df.with_columns(
        # horizontal sum with a custom UDF
        pl.struct("foo", "bar").map_elements(lambda x: mysum((x["foo"], x["bar"]))).alias("foo+bar (slow)"),
        # idiomatic way to do a horizontal sum
        pl.sum_horizontal("foo", "bar").alias("foo+bar (fast)")
    )
    

    Folds

    If you want to do more complicated horizontal aggregations, but want to keep the code fast (as using a python function in map_elements is not), you can use fold. Below I show how to compute a horizontal sum with a fold.

    df.with_columns(
        pl.fold(
            acc=0,
            function=lambda a, b: a + b,
            exprs=pl.all()
        ).alias("foo+bar")
    )