Search code examples
python-polars

How to assign multiple values to specific locations in a series in an expr?


In pandas, one can use logical indexing to assign items:

s = pd.Series(['a', 'b', 'c', 'd', 'e'])
idx = [True, False, True, False, True]
s[idx] = ['x', 'y', 'z']

In polars, we can do this with scatter:

s = pl.Series(['a', 'b', 'c', 'd', 'e'])
idx = pl.Series([True, False, True, False, True])
s = s.scatter(idx.arg_true(), ['x', 'y', 'z']) # arg_true() gives us integer indices

However, I'm struggling to figure out how to do this in an expression context, as I cannot do df['col'] = df['col'].scatter.... Instead, polars suggests I use with_columns:

import polars as pl
from polars import col, when

df = pl.DataFrame({
    "my_col": ['a', 'b', 'c', 'd', 'e'],
    "idx": [True, False, True, False, True]
})

new_values = ['x', 'y', 'z']
df = df.with_columns(
    when(col("idx")).then(new_values) # how to do this?
    .otherwise(col("my_col"))
)

My method of getting around this is somewhat long-winded and there must be an easier way:

s = df["my_col"].clone()
s = s.scatter(df["idx"].arg_true(), new_values)
df = df.with_columns(s.alias("my_col"))
shape: (5, 2)
┌────────┬───────┐
│ my_col ┆ idx   │
│ ---    ┆ ---   │
│ str    ┆ bool  │
╞════════╪═══════╡
│ x      ┆ true  │
│ b      ┆ false │
│ y      ┆ true  │
│ d      ┆ false │
│ z      ┆ true  │
└────────┴───────┘

Syntactically it's not horrible, but is there an easier way to simply update a series with a list of values (or other series)?


Solution

  • I'm not aware of an elegant way to convert your code using Series directly to Expressions. But we won't let that stop us.

    One underappreciated aspect of the Polars architecture is the ability to compose our own Expressions using existing Polars API Expressions. (Perhaps there should be a section in the User Guide for this.)

    So, let's create an Expression to do what we need.

    The code below may seem overwhelming at first. We'll look at examples and I'll explain how it works in detail below.

    Expression: set_by_mask

    Here's a custom Expression that will set values based on a mask. For lack of a better name I've called it set_by_mask. The Expression is a bit rough (e.g., it does zero error-checking), but it should act as a good starting point.

    Notice at the end that we will assign this function as a method of the Expr class, so that it can be used just like any other Expression (e.g., it can participate in any valid chain of Expressions, be used within a groupby, etc..)

    Much of the code below deals with "convenience" methods (e.g., allowing the mask parameters to be a list/tuple or an Expression or a Series). Later, we'll go through how the heart of the algorithm works.

    Here's the code:

    from typing import Any, Sequence
    import polars as pl
    
    
    def set_by_mask(
        self: pl.Expr,
        mask: str | Sequence[bool] | pl.Series | pl.Expr,
        values: Sequence[Any] | pl.Series | pl.Expr,
    ) -> pl.Expr:
        """
        Set values at mask locations.
    
        Parameters
        ----------
        mask
            Indices with True values are replaced with values.
    
            Sequence[bool]: list or tuple of boolean values
            str: column name of boolean Expression
            Series | Expr: Series or Expression that evaluates to boolean
    
        values
            Values to replace where mask is True
    
        Notes:
            The number of elements in values must match the number of
            True values in mask.
    
            The mask Expression/list/tuple must match the length of the
            Expression for which values are being set.
    
        """
    
        if isinstance(mask, str):
            mask = pl.col(mask)
    
        if isinstance(mask, Sequence):
            mask = pl.Series("", mask)
    
        if isinstance(values, Sequence):
            values = pl.Series("", values)
    
        if isinstance(mask, pl.Series):
            mask = pl.lit(mask)
    
        if isinstance(values, pl.Series):
            values = pl.lit(values)
    
        result = (
            self
            .sort_by(mask)
            .slice(0, mask.not_().sum())
            .append(values)
            .sort_by(mask.arg_sort())
        )
    
        return self._from_pyexpr(result._pyexpr)
    
    
    pl.Expr.set_by_mask = set_by_mask
    

    Examples

    First, let's run through some examples of how this works.

    In the example below, we'll pass a string as our mask parameter -- indicating the column name of df to be used as a mask. And we'll pass a simple Python list of string values as our values parameter.

    Remember to run the code above first before running the examples below. We need the set_by_mask function to be a method of the Expr class. (Don't worry, it's not permanent - when the Python interpreter exits, the Expr class will be restored to its original state.)

    import polars as pl
    
    df = pl.DataFrame({
        "my_col": ["a", "b", "c", "d", "e"],
        "idx": [True, False, True, False, True],
    })
    
    new_values = ("x", "y", "z")
    (
        df
        .with_columns(
            pl.col('my_col').set_by_mask("idx", new_values).alias('result')
        )
    )
    
    
    shape: (5, 3)
    ┌────────┬───────┬────────┐
    │ my_col ┆ idx   ┆ result │
    │ ---    ┆ ---   ┆ ---    │
    │ str    ┆ bool  ┆ str    │
    ╞════════╪═══════╪════════╡
    │ a      ┆ true  ┆ x      │
    │ b      ┆ false ┆ b      │
    │ c      ┆ true  ┆ y      │
    │ d      ┆ false ┆ d      │
    │ e      ┆ true  ┆ z      │
    └────────┴───────┴────────┘
    

    We see that the values for a, c, and e have been replaced, consistent with where the mask evaluated to True.

    As another example, let's pass the mask and values parameter as external Series (ignoring the idx column).

    new_values = pl.Series("", ['1', '2', '4', '5'])
    mask = pl.Series("", [True, True, False, True, True])
    (
        df
        .with_columns(
            pl.col('my_col').set_by_mask(mask, new_values).alias('result')
        )
    )
    
    shape: (5, 3)
    ┌────────┬───────┬────────┐
    │ my_col ┆ idx   ┆ result │
    │ ---    ┆ ---   ┆ ---    │
    │ str    ┆ bool  ┆ str    │
    ╞════════╪═══════╪════════╡
    │ a      ┆ true  ┆ 1      │
    │ b      ┆ false ┆ 2      │
    │ c      ┆ true  ┆ c      │
    │ d      ┆ false ┆ 4      │
    │ e      ┆ true  ┆ 5      │
    └────────┴───────┴────────┘
    

    How it works

    The heart of the algorithm is this snippet.

        result = (
            self
            .sort_by(mask)
            .slice(0, mask.not_().sum())
            .append(values)
            .sort_by(mask.arg_sort())
        )
    

    To see how it works, we'll use the first example and watch how the algorithm builds the answer.

    First, we'll need to change the with_columns in your query to a select, because the intermediate steps in the algorithm won't produce a column whose length matches the other columns, which will lead to an error.

    Here's the code that we'll run to observe the steps in the algorithm

    import polars as pl
    new_values = ("x", "y", "z")
    (
        pl.DataFrame({
            "my_col": ["a", "b", "c", "d", "e"],
            "idx": [True, False, True, False, True],
        })
        .select(
            pl.col('my_col').set_by_mask("idx", new_values).alias('result')
        )
    )
    
    shape: (5, 1)
    ┌────────┐
    │ result │
    │ ---    │
    │ str    │
    ╞════════╡
    │ x      │
    │ b      │
    │ y      │
    │ d      │
    │ z      │
    └────────┘
    

    With that in place, let's look at how the algorithm evolves.

    The Algorithm in steps

    The first step of the algorithm is to sort the original column so that non-changing values (those corresponding to a mask value of False) are sorted to the top. We'll accomplish this using the sort_by Expression, and pass mask as our sorting criterion.

    I'll change the heart of the algorithm to only these steps.

        result = (
            self
            .sort_by(mask)
        )
    

    Here's the result.

    shape: (5, 1)
    ┌────────┐
    │ result │
    │ ---    │
    │ str    │
    ╞════════╡
    │ b      │
    │ d      │
    │ a      │
    │ c      │
    │ e      │
    └────────┘
    

    In our example, values b and d are not changing and are sorted to the top; values a, c, and e are being replaced and are sorted to the bottom.

    In the next step, we'll use the slice Expression to eliminate those values that will be replaced.

        result = (
            self
            .sort_by(mask)
            .slice(0, mask.not_().sum())
        )
    
    shape: (2, 1)
    ┌────────┐
    │ result │
    │ ---    │
    │ str    │
    ╞════════╡
    │ b      │
    │ d      │
    └────────┘
    

    In the next step, we'll use the append Expression to place the new values at the bottom.

        result = (
            self
            .sort_by(mask)
            .slice(0, mask.not_().sum())
            .append(values)
        )
    
    shape: (5, 1)
    ┌────────┐
    │ result │
    │ ---    │
    │ str    │
    ╞════════╡
    │ b      │
    │ d      │
    │ x      │
    │ y      │
    │ z      │
    └────────┘
    

    Now for the tricky step: how to get the values sorted in the proper order.

    We're going to use arg_sort to accomplish this. One property of an arg_sort is that it can restore a sorted column back to its original un-sorted state.

    If we look at the values above, the non-replaced values are at the top (corresponding to a mask value of False). And the replaced values are at the bottom (corresponding to a mask value of True). This corresponds to a mask of [False, False, True, True, True].

    That, in turn, corresponds to the mask expression when it is sorted. (False sorts before True). Hence, sorting the column by the arg_sort of the mask will restore the column to correspond to the original un-sorted maskcolumn.

        result = (
            self
            .sort_by(mask)
            .slice(0, mask.not_().sum())
            .append(values)
            .sort_by(mask.arg_sort())
        )
    
    shape: (5, 1)
    ┌────────┐
    │ result │
    │ ---    │
    │ str    │
    ╞════════╡
    │ x      │
    │ b      │
    │ y      │
    │ d      │
    │ z      │
    └────────┘
    

    It subtle, but it works.

    I appreciate the above may be more than you wanted. But hopefully, it demonstrates how we can compose our own Expressions using existing Polars Expressions.