I started using polars and I am still learning the syntax.
I would like to achieve a groupby followed by a cross join (cartesian product), where I want to run a calculation between every combination within each group. (It could also be understood as a conditional join).
I got a sample dataframe below:
import polars as pl
df = pl.DataFrame(
{
"garment": ["tshirt", "tshirt", "tshirt", "pants", "pants", "pants", "socks", "socks", "socks"],
"color": ["red", "blue", "green", "yellow", "orange", "purple", "black", "white", "grey"],
"price": [100, 50, 75, 101, 51, 69, 99, 44, 32]
}
)
print(df)
I figured out how to select one group (in this instance: all pants), and how to calculate the price difference between all items.
# for one group
df_pants = df.filter(pl.col('garment') == 'pants')
df_pants.join(df_pants, how="cross").filter(
pl.col('color') != pl.col('color_right') # avoid self-joins
).with_columns(
(pl.col('price') - pl.col('price_right')).abs().alias('difference')
)
But I would like to write a polars expression that first groups the entire dataset by "garment" and then performs joins within each group; and finally runs the calculations (in this example, the price difference).
Any help would be appreciated!
Edit:
Additionally, I would like to have only unique combinations. So only "orange" and "yellow", but not "yellow" and "orange".
Update: .join_where()
has since been added.
You can put the "filter" in the join predicate.
df.join_where(df,
pl.col.garment == pl.col.garment_right,
pl.col.index < pl.col.index_right
)
You can perform a self-join on the garment
column.
If you add a row index before joining, you can use that in the filter condition to prevent "duplicates".
df = df.with_row_index()
(df.join(df, on="garment")
.filter(pl.col.index < pl.col.index_right)
)
shape: (9, 7)
┌───────┬─────────┬────────┬───────┬─────────────┬─────────────┬─────────────┐
│ index ┆ garment ┆ color ┆ price ┆ index_right ┆ color_right ┆ price_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ i64 ┆ u32 ┆ str ┆ i64 │
╞═══════╪═════════╪════════╪═══════╪═════════════╪═════════════╪═════════════╡
│ 0 ┆ tshirt ┆ red ┆ 100 ┆ 1 ┆ blue ┆ 50 │
│ 0 ┆ tshirt ┆ red ┆ 100 ┆ 2 ┆ green ┆ 75 │
│ 1 ┆ tshirt ┆ blue ┆ 50 ┆ 2 ┆ green ┆ 75 │
│ 3 ┆ pants ┆ yellow ┆ 101 ┆ 4 ┆ orange ┆ 51 │
│ 3 ┆ pants ┆ yellow ┆ 101 ┆ 5 ┆ purple ┆ 69 │
│ 4 ┆ pants ┆ orange ┆ 51 ┆ 5 ┆ purple ┆ 69 │
│ 6 ┆ socks ┆ black ┆ 99 ┆ 7 ┆ white ┆ 44 │
│ 6 ┆ socks ┆ black ┆ 99 ┆ 8 ┆ grey ┆ 32 │
│ 7 ┆ socks ┆ white ┆ 44 ┆ 8 ┆ grey ┆ 32 │
└───────┴─────────┴────────┴───────┴─────────────┴─────────────┴─────────────┘