Search code examples
pythondataframepython-polars

How to group_by and then cross join with polars?


I started using polars and I am still learning the syntax.

I would like to achieve a groupby followed by a cross join (cartesian product), where I want to run a calculation between every combination within each group. (It could also be understood as a conditional join).

I got a sample dataframe below:

import polars as pl

df = pl.DataFrame(
    {
        "garment": ["tshirt", "tshirt", "tshirt", "pants", "pants", "pants", "socks", "socks", "socks"],
        "color": ["red", "blue", "green", "yellow", "orange", "purple", "black", "white", "grey"],
        "price": [100, 50, 75, 101, 51, 69, 99, 44, 32]
    }
)
print(df)

I figured out how to select one group (in this instance: all pants), and how to calculate the price difference between all items.

# for one group
df_pants = df.filter(pl.col('garment') == 'pants')
df_pants.join(df_pants, how="cross").filter(
    pl.col('color') != pl.col('color_right') # avoid self-joins
  ).with_columns(
    (pl.col('price') - pl.col('price_right')).abs().alias('difference')
)

But I would like to write a polars expression that first groups the entire dataset by "garment" and then performs joins within each group; and finally runs the calculations (in this example, the price difference).

Any help would be appreciated!

Edit:

Additionally, I would like to have only unique combinations. So only "orange" and "yellow", but not "yellow" and "orange".


Solution

  • Update: .join_where() has since been added.

    You can put the "filter" in the join predicate.

    df.join_where(df, 
       pl.col.garment == pl.col.garment_right,
       pl.col.index < pl.col.index_right
    )
    

    You can perform a self-join on the garment column.

    If you add a row index before joining, you can use that in the filter condition to prevent "duplicates".

    df = df.with_row_index()
    
    (df.join(df, on="garment")
       .filter(pl.col.index < pl.col.index_right)
    )
    
    shape: (9, 7)
    ┌───────┬─────────┬────────┬───────┬─────────────┬─────────────┬─────────────┐
    │ index ┆ garment ┆ color  ┆ price ┆ index_right ┆ color_right ┆ price_right │
    │ ---   ┆ ---     ┆ ---    ┆ ---   ┆ ---         ┆ ---         ┆ ---         │
    │ u32   ┆ str     ┆ str    ┆ i64   ┆ u32         ┆ str         ┆ i64         │
    ╞═══════╪═════════╪════════╪═══════╪═════════════╪═════════════╪═════════════╡
    │ 0     ┆ tshirt  ┆ red    ┆ 100   ┆ 1           ┆ blue        ┆ 50          │
    │ 0     ┆ tshirt  ┆ red    ┆ 100   ┆ 2           ┆ green       ┆ 75          │
    │ 1     ┆ tshirt  ┆ blue   ┆ 50    ┆ 2           ┆ green       ┆ 75          │
    │ 3     ┆ pants   ┆ yellow ┆ 101   ┆ 4           ┆ orange      ┆ 51          │
    │ 3     ┆ pants   ┆ yellow ┆ 101   ┆ 5           ┆ purple      ┆ 69          │
    │ 4     ┆ pants   ┆ orange ┆ 51    ┆ 5           ┆ purple      ┆ 69          │
    │ 6     ┆ socks   ┆ black  ┆ 99    ┆ 7           ┆ white       ┆ 44          │
    │ 6     ┆ socks   ┆ black  ┆ 99    ┆ 8           ┆ grey        ┆ 32          │
    │ 7     ┆ socks   ┆ white  ┆ 44    ┆ 8           ┆ grey        ┆ 32          │
    └───────┴─────────┴────────┴───────┴─────────────┴─────────────┴─────────────┘