I have the following Polars.DataFrame
:
df = pl.DataFrame(
{
"timestamp": [1, 2, 3, 1, 2, 3],
"var1": [1, 2, 3, 3, 4, 5],
"group": ["a", "a", "a", "b", "b", "b"],
}
)
print(df)
out:
shape: (6, 3)
┌───────────┬──────┬───────┐
│ timestamp ┆ var1 ┆ group │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════════╪══════╪═══════╡
│ 1 ┆ 1 ┆ a │
│ 2 ┆ 2 ┆ a │
│ 3 ┆ 3 ┆ a │
│ 1 ┆ 3 ┆ b │
│ 2 ┆ 4 ┆ b │
│ 3 ┆ 5 ┆ b │
└───────────┴──────┴───────┘
I want to replace the values of group b
with the values of group a
that are having the same timestamps.
Desired output:
shape: (6, 3)
┌───────────┬──────┬───────┐
│ timestamp ┆ var1 ┆ group │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════════╪══════╪═══════╡
│ 1 ┆ 1 ┆ a │
│ 2 ┆ 2 ┆ a │
│ 3 ┆ 3 ┆ a │
│ 1 ┆ 1 ┆ b │
│ 2 ┆ 2 ┆ b │
│ 3 ┆ 3 ┆ b │
└───────────┴──────┴───────┘
I have the current solution with generating a helper df:
def group_value_replacer(
df: pl.DataFrame,
target_group_col: str,
target_var: str,
target_group: str,
):
helper_df = df.filter(pl.col(target_group_col) == target_group)
df = df.drop(target_var).join(
helper_df.drop(target_group_col),
on=["timestamp"],
how="left",
)
return df
group_value_replacer(df, "group", "var1", "a")
out:
shape: (6, 3)
┌───────────┬───────┬──────┐
│ timestamp ┆ group ┆ var1 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪══════╡
│ 1 ┆ a ┆ 1 │
│ 2 ┆ a ┆ 2 │
│ 3 ┆ a ┆ 3 │
│ 1 ┆ b ┆ 1 │
│ 2 ┆ b ┆ 2 │
│ 3 ┆ b ┆ 3 │
└───────────┴───────┴──────┘
I want to improve the solution by using Polars.Expr
: For example, is there a way for me to achieve the same operation using expressions like df.with_columns(pl.col(target_var).operationxx)
.
I think for generic solution your approach with join
works fine, you could probably try something like this as well:
filter()
to filter var1
column to leave only values where group == a
first()
to get the value.over()
to limit it to certain timestamp
.coalesce()
to fallback to actual value if value for group == a
doesn't exist.df.with_columns(
pl.coalesce(
pl.col.var1.filter(pl.col.group == "a").first().over("timestamp"),
pl.col.var1
)
)
┌───────────┬──────┬───────┐
│ timestamp ┆ var1 ┆ group │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════════╪══════╪═══════╡
│ 1 ┆ 1 ┆ a │
│ 2 ┆ 2 ┆ a │
│ 3 ┆ 3 ┆ a │
│ 1 ┆ 1 ┆ b │
│ 2 ┆ 2 ┆ b │
│ 3 ┆ 3 ┆ b │
└───────────┴──────┴───────┘
You can also skip coalesce()
if you don't need a fallback.
df.with_columns(
pl.col.var1.filter(pl.col.group == "a").first().over("timestamp")
)