Search code examples
pythondataframepython-polars

How to groupby and then cross join with polars?


I started using polars and I am still learning the syntax.

I would like to achieve a groupby followed by a cross join (cartesian product), where I want to run a calculation between every combination within each group. (It could also be understood as a conditional join).

I got a sample dataframe below:

import polars as pl

df = pl.DataFrame(
    {
        "garment": ["tshirt", "tshirt", "tshirt", "pants", "pants", "pants", "socks", "socks", "socks"],
        "color": ["red", "blue", "green", "yellow", "orange", "purple", "black", "white", "grey"],
        "price": [100, 50, 75, 101, 51, 69, 99, 44, 32]
    }
)
print(df)

I figured out how to select one group (in this instance: all pants), and how to calculate the price difference between all items.

# for one group
df_pants = df.filter(pl.col('garment') == 'pants')
df_pants.join(df_pants, on = "color", how="cross").filter(
    pl.col('color') != pl.col('color_right') # avoid self-joins
  ).with_columns(
    (pl.col('price') - pl.col('price_right')).abs().alias('difference')
)

But I would like to write a polars expression that first groups the entire dataset by "garment" and then performs joins within each group; and finally runs the calculations (in this example, the price difference).

Any help would be appreciated!

Edit:

Additionally, I would like to have only unique combinations. So only "orange" and "yellow", but not "yellow" and "orange".


Solution

  • You can perform a self-join on the garment column.

    If you add a row count before joining, you can use that in the filter condition to prevent "duplicates".

    df = df.with_row_count()
    
    (df.join(df, on="garment")
       .filter(pl.col("row_nr") < pl.col("row_nr_right"))
    )
    
    shape: (9, 7)
    ┌────────┬─────────┬────────┬───────┬──────────────┬─────────────┬─────────────┐
    │ row_nr ┆ garment ┆ color  ┆ price ┆ row_nr_right ┆ color_right ┆ price_right │
    │ ---    ┆ ---     ┆ ---    ┆ ---   ┆ ---          ┆ ---         ┆ ---         │
    │ u32    ┆ str     ┆ str    ┆ i64   ┆ u32          ┆ str         ┆ i64         │
    ╞════════╪═════════╪════════╪═══════╪══════════════╪═════════════╪═════════════╡
    │ 0      ┆ tshirt  ┆ red    ┆ 100   ┆ 1            ┆ blue        ┆ 50          │
    │ 0      ┆ tshirt  ┆ red    ┆ 100   ┆ 2            ┆ green       ┆ 75          │
    │ 1      ┆ tshirt  ┆ blue   ┆ 50    ┆ 2            ┆ green       ┆ 75          │
    │ 3      ┆ pants   ┆ yellow ┆ 101   ┆ 4            ┆ orange      ┆ 51          │
    │ 3      ┆ pants   ┆ yellow ┆ 101   ┆ 5            ┆ purple      ┆ 69          │
    │ 4      ┆ pants   ┆ orange ┆ 51    ┆ 5            ┆ purple      ┆ 69          │
    │ 6      ┆ socks   ┆ black  ┆ 99    ┆ 7            ┆ white       ┆ 44          │
    │ 6      ┆ socks   ┆ black  ┆ 99    ┆ 8            ┆ grey        ┆ 32          │
    │ 7      ┆ socks   ┆ white  ┆ 44    ┆ 8            ┆ grey        ┆ 32          │
    └────────┴─────────┴────────┴───────┴──────────────┴─────────────┴─────────────┘