Search code examples
python-polars

Is there a way to put multiple steps into a polars expression function?


I've been learning how to write expression functions in polars which are wonderful for creating self-documenting chained operations. But I'm struggling with more complex functions. Let's say I want to replace a value in column bar with the first value in column baz when baz is empty, over the group column foo. To state more clearly: I have a set of columns which form a a sorted group (in my example only foo). I have another column bar which may or may not have empty values. If the first value in bar for the group is empty ('' or NULL) take the corresponding value from a another column baz and apply to every bar in the group. If the first value in bar is not empty then do nothing to the group.

The below works correctly.

Initial DataFrame:

import polars as pl

df = pl.DataFrame({'foo': [1, 1, 1, 2, 2, 2, 3, 3],
                   'bar': ['a', 'a', 'a', None, None, None, 'c', 'c'],
                   'baz': ['x', None, 'q', 'z', 'r', None, 'y', 's']})
shape: (8, 3)
┌─────┬──────┬──────┐
│ foo ┆ bar  ┆ baz  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ 1   ┆ a    ┆ x    │
│ 1   ┆ a    ┆ null │
│ 1   ┆ a    ┆ q    │
│ 2   ┆ null ┆ z    │
│ 2   ┆ null ┆ r    │
│ 2   ┆ null ┆ null │
│ 3   ┆ c    ┆ y    │
│ 3   ┆ c    ┆ s    │
└─────┴──────┴──────┘

Transformation I want to perform:

df = (df.with_columns(pl.col('baz').first().over(['foo']).alias('temp'))
        .with_columns(pl.when((pl.col('bar') == '') | (pl.col('bar').is_null()))
                      .then(pl.col('temp'))
                      .otherwise(pl.col('bar')).alias('bar2'))
        .with_columns(pl.col('bar2').alias('bar'))
        .drop(['temp', 'bar2'])
      )

Expected Result:

┌─────┬──────┬──────┐
│ foo ┆ bar  ┆ baz  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ 1   ┆ a    ┆ x    │
│ 1   ┆ a    ┆ null │
│ 1   ┆ a    ┆ q    │
│ 2   ┆ z    ┆ z    │
│ 2   ┆ z    ┆ r    │
│ 2   ┆ z    ┆ null │
│ 3   ┆ c    ┆ y    │
│ 3   ┆ c    ┆ s    │
└─────┴──────┴──────┘

In my actual problem, this chain would be only a subset of a larger chain, so it would be great if I could write

def update_bar() -> pl.expr:
   return (#some voodoo)

and then:

df = (df.with_columns(update_bar())
        .drop(['temp', 'bar2'])
     )

or even

df = (df.with_columns(update_bar())
        .with_columns(pl.col('bar2').alias('bar'))
        .drop(['temp', 'bar2'])
     )

The first two operations at the top go together so I'd really like to avoid writing two functions. Any guidance on how to go about this?

Or maybe someone even has a more clever way to accomplish what I need all this code to do? Note that foo and bar having matching grouping is only true in this simplified example. In my real case foo is 3 columns and bar cannot be used as the group alone.


Solution

  • It seems like the task could be achieved in a single expression?

    df.with_columns(
       pl.when((pl.col("bar").first() == "") | pl.col("bar").first().is_null())
         .then(pl.col("baz").first())
         .otherwise(pl.col("bar"))
         .over("foo")
         .alias("new_bar")
    )
    
    shape: (8, 4)
    ┌─────┬──────┬──────┬─────────┐
    │ foo ┆ bar  ┆ baz  ┆ new_bar │
    │ --- ┆ ---  ┆ ---  ┆ ---     │
    │ i64 ┆ str  ┆ str  ┆ str     │
    ╞═════╪══════╪══════╪═════════╡
    │ 1   ┆ a    ┆ x    ┆ a       │
    │ 1   ┆ a    ┆ null ┆ a       │
    │ 1   ┆ a    ┆ q    ┆ a       │
    │ 2   ┆ null ┆ z    ┆ z       │
    │ 2   ┆ null ┆ r    ┆ z       │
    │ 2   ┆ null ┆ null ┆ z       │
    │ 3   ┆ c    ┆ y    ┆ c       │
    │ 3   ┆ c    ┆ s    ┆ c       │
    └─────┴──────┴──────┴─────────┘
    

    As an expression inside a function:

    def update_bar() -> pl.Expr:
        bar_is_empty = pl.any_horizontal( # "easier" way to build | chains
            pl.col("bar").first() == "",
            pl.col("bar").first().is_null()
        )
       
        return (
            pl.when(bar_is_empty)
              .then(pl.col("baz").first())
              .otherwise(pl.col("bar"))
              .over("foo")
              .alias("bar")
        )
    
    >>> df.with_columns(update_bar())
    shape: (8, 3)
    ┌─────┬─────┬──────┐
    │ foo ┆ bar ┆ baz  │
    │ --- ┆ --- ┆ ---  │
    │ i64 ┆ str ┆ str  │
    ╞═════╪═════╪══════╡
    │ 1   ┆ a   ┆ x    │
    │ 1   ┆ a   ┆ null │
    │ 1   ┆ a   ┆ q    │
    │ 2   ┆ z   ┆ z    │
    │ 2   ┆ z   ┆ r    │
    │ 2   ┆ z   ┆ null │
    │ 3   ┆ c   ┆ y    │
    │ 3   ┆ c   ┆ s    │
    └─────┴─────┴──────┘
    

    There are also Frame/Expression pipe methods which may be relevant: