Search code examples
pythonpython-polars

Polars Replacing Values of Other groups to the Values of a Certain Group


I have the following Polars.DataFrame:

df = pl.DataFrame(
    {
        "timestamp": [1, 2, 3, 1, 2, 3],
        "var1": [1, 2, 3, 3, 4, 5], 
        "group": ["a", "a", "a", "b", "b", "b"],
        }
    )
print(df)

out:
shape: (6, 3)
┌───────────┬──────┬───────┐
│ timestamp ┆ var1 ┆ group │
│ ---       ┆ ---  ┆ ---   │
│ i64       ┆ i64  ┆ str   │
╞═══════════╪══════╪═══════╡
│ 1         ┆ 1    ┆ a     │
│ 2         ┆ 2    ┆ a     │
│ 3         ┆ 3    ┆ a     │
│ 1         ┆ 3    ┆ b     │
│ 2         ┆ 4    ┆ b     │
│ 3         ┆ 5    ┆ b     │
└───────────┴──────┴───────┘

I want to replace the values of group b with the values of group a that are having the same timestamps.

Desired output:

shape: (6, 3)
┌───────────┬──────┬───────┐
│ timestamp ┆ var1 ┆ group │
│ ---       ┆ ---  ┆ ---   │
│ i64       ┆ i64  ┆ str   │
╞═══════════╪══════╪═══════╡
│ 1         ┆ 1    ┆ a     │
│ 2         ┆ 2    ┆ a     │
│ 3         ┆ 3    ┆ a     │
│ 1         ┆ 1    ┆ b     │
│ 2         ┆ 2    ┆ b     │
│ 3         ┆ 3    ┆ b     │
└───────────┴──────┴───────┘

I have the current solution with generating a helper df:

def group_value_replacer(
    df: pl.DataFrame, 
    target_group_col: str,
    target_var: str,
    target_group: str,
    ):
    
    helper_df = df.filter(pl.col(target_group_col) == target_group)
    df = df.drop(target_var).join(
        helper_df.drop(target_group_col),
        on=["timestamp"],
        how="left",
        )
    
    return df

group_value_replacer(df, "group", "var1", "a")

out:
shape: (6, 3)
┌───────────┬───────┬──────┐
│ timestamp ┆ group ┆ var1 │
│ ---       ┆ ---   ┆ ---  │
│ i64       ┆ str   ┆ i64  │
╞═══════════╪═══════╪══════╡
│ 1         ┆ a     ┆ 1    │
│ 2         ┆ a     ┆ 2    │
│ 3         ┆ a     ┆ 3    │
│ 1         ┆ b     ┆ 1    │
│ 2         ┆ b     ┆ 2    │
│ 3         ┆ b     ┆ 3    │
└───────────┴───────┴──────┘

I want to improve the solution by using Polars.Expr: For example, is there a way for me to achieve the same operation using expressions like df.with_columns(pl.col(target_var).operationxx).


Solution

  • I think for generic solution your approach with join works fine, you could probably try something like this as well:

    • filter() to filter var1 column to leave only values where group == a
    • first() to get the value.
    • over() to limit it to certain timestamp.
    • coalesce() to fallback to actual value if value for group == a doesn't exist.
    df.with_columns(
        pl.coalesce(
            pl.col.var1.filter(pl.col.group == "a").first().over("timestamp"),
            pl.col.var1
        )
    )
    
    ┌───────────┬──────┬───────┐
    │ timestamp ┆ var1 ┆ group │
    │ ---       ┆ ---  ┆ ---   │
    │ i64       ┆ i64  ┆ str   │
    ╞═══════════╪══════╪═══════╡
    │ 1         ┆ 1    ┆ a     │
    │ 2         ┆ 2    ┆ a     │
    │ 3         ┆ 3    ┆ a     │
    │ 1         ┆ 1    ┆ b     │
    │ 2         ┆ 2    ┆ b     │
    │ 3         ┆ 3    ┆ b     │
    └───────────┴──────┴───────┘
    

    You can also skip coalesce() if you don't need a fallback.

    df.with_columns(
        pl.col.var1.filter(pl.col.group == "a").first().over("timestamp")
    )