I've been learning how to write expression functions in polars which are wonderful for creating self-documenting chained operations. But I'm struggling with more complex functions. Let's say I want to replace a value in column bar
with the first value in column baz
when baz
is empty, over the group column foo
.
To state more clearly: I have a set of columns which form a a sorted group (in my example only foo
). I have another column bar
which may or may not have empty values. If the first value in bar
for the group is empty ('' or NULL) take the corresponding value from a another column baz
and apply to every bar
in the group. If the first value in bar
is not empty then do nothing to the group.
The below works correctly.
Initial DataFrame:
import polars as pl
df = pl.DataFrame({'foo': [1, 1, 1, 2, 2, 2, 3, 3],
'bar': ['a', 'a', 'a', None, None, None, 'c', 'c'],
'baz': ['x', None, 'q', 'z', 'r', None, 'y', 's']})
shape: (8, 3)
┌─────┬──────┬──────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪══════╪══════╡
│ 1 ┆ a ┆ x │
│ 1 ┆ a ┆ null │
│ 1 ┆ a ┆ q │
│ 2 ┆ null ┆ z │
│ 2 ┆ null ┆ r │
│ 2 ┆ null ┆ null │
│ 3 ┆ c ┆ y │
│ 3 ┆ c ┆ s │
└─────┴──────┴──────┘
Transformation I want to perform:
df = (df.with_columns(pl.col('baz').first().over(['foo']).alias('temp'))
.with_columns(pl.when((pl.col('bar') == '') | (pl.col('bar').is_null()))
.then(pl.col('temp'))
.otherwise(pl.col('bar')).alias('bar2'))
.with_columns(pl.col('bar2').alias('bar'))
.drop(['temp', 'bar2'])
)
Expected Result:
┌─────┬──────┬──────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪══════╪══════╡
│ 1 ┆ a ┆ x │
│ 1 ┆ a ┆ null │
│ 1 ┆ a ┆ q │
│ 2 ┆ z ┆ z │
│ 2 ┆ z ┆ r │
│ 2 ┆ z ┆ null │
│ 3 ┆ c ┆ y │
│ 3 ┆ c ┆ s │
└─────┴──────┴──────┘
In my actual problem, this chain would be only a subset of a larger chain, so it would be great if I could write
def update_bar() -> pl.expr:
return (#some voodoo)
and then:
df = (df.with_columns(update_bar())
.drop(['temp', 'bar2'])
)
or even
df = (df.with_columns(update_bar())
.with_columns(pl.col('bar2').alias('bar'))
.drop(['temp', 'bar2'])
)
The first two operations at the top go together so I'd really like to avoid writing two functions. Any guidance on how to go about this?
Or maybe someone even has a more clever way to accomplish what I need all this code to do? Note that foo
and bar
having matching grouping is only true in this simplified example. In my real case foo
is 3 columns and bar cannot be used as the group alone.
It seems like the task could be achieved in a single expression?
df.with_columns(
pl.when((pl.col("bar").first() == "") | pl.col("bar").first().is_null())
.then(pl.col("baz").first())
.otherwise(pl.col("bar"))
.over("foo")
.alias("new_bar")
)
shape: (8, 4)
┌─────┬──────┬──────┬─────────┐
│ foo ┆ bar ┆ baz ┆ new_bar │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═════╪══════╪══════╪═════════╡
│ 1 ┆ a ┆ x ┆ a │
│ 1 ┆ a ┆ null ┆ a │
│ 1 ┆ a ┆ q ┆ a │
│ 2 ┆ null ┆ z ┆ z │
│ 2 ┆ null ┆ r ┆ z │
│ 2 ┆ null ┆ null ┆ z │
│ 3 ┆ c ┆ y ┆ c │
│ 3 ┆ c ┆ s ┆ c │
└─────┴──────┴──────┴─────────┘
As an expression inside a function:
def update_bar() -> pl.Expr:
bar_is_empty = pl.any_horizontal( # "easier" way to build | chains
pl.col("bar").first() == "",
pl.col("bar").first().is_null()
)
return (
pl.when(bar_is_empty)
.then(pl.col("baz").first())
.otherwise(pl.col("bar"))
.over("foo")
.alias("bar")
)
>>> df.with_columns(update_bar())
shape: (8, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪══════╡
│ 1 ┆ a ┆ x │
│ 1 ┆ a ┆ null │
│ 1 ┆ a ┆ q │
│ 2 ┆ z ┆ z │
│ 2 ┆ z ┆ r │
│ 2 ┆ z ┆ null │
│ 3 ┆ c ┆ y │
│ 3 ┆ c ┆ s │
└─────┴─────┴──────┘
There are also Frame/Expression pipe methods which may be relevant: