I have 2 DataFrames:
import polars as pl
df1 = pl.DataFrame(
{
"group": ["A", "A", "A", "B", "B", "B"],
"index": [1, 3, 5, 1, 3, 8],
}
)
df2 = pl.DataFrame(
{
"group": ["A", "A", "A", "B", "B", "B"],
"index": [3, 4, 7, 2, 7, 10],
}
)
I want to cap the index
in df2
using the largest index of each group in df1
. The groups in two DataFrames are the same.
expected output for df2
:
shape: (6, 2)
┌───────┬───────┐
│ group ┆ index │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ A ┆ 3 │
│ A ┆ 4 │
│ A ┆ 5 │
│ B ┆ 2 │
│ B ┆ 7 │
│ B ┆ 8 │
└───────┴───────┘
You can compute the max per group over df1, then clip
df2:
out = df2.with_columns(
pl.col('index').clip(
upper_bound=df1.select(pl.col('index').max().over('group'))['index']
)
)
Output:
shape: (6, 2)
┌───────┬───────┐
│ group ┆ index │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ A ┆ 3 │
│ A ┆ 4 │
│ A ┆ 5 │
│ B ┆ 2 │
│ B ┆ 7 │
│ B ┆ 8 │
└───────┴───────┘
Alternatively, if the two groups are not necessarily the same in both dataframes, you could group_by.max
then align with join
:
df1 = pl.DataFrame(
{
"group": ["A", "A", "A", "B", "B", "B"],
"index": [1, 3, 5, 1, 3, 7],
}
)
df2 = pl.DataFrame(
{
"group": ["A", "A", "A", "B", "B", "B", "B"],
"index": [3, 4, 7, 2, 7, 8, 9],
}
)
out = df2.with_columns(
pl.col('index').clip(
upper_bound=df2.join(df1.group_by('group').max(), on='group')['index_right']
)
)
Output:
shape: (7, 2)
┌───────┬───────┐
│ group ┆ index │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ A ┆ 3 │
│ A ┆ 4 │
│ A ┆ 5 │
│ B ┆ 2 │
│ B ┆ 7 │
│ B ┆ 7 │
│ B ┆ 7 │
└───────┴───────┘