I have a Polars Dataframe, and I would like to contrust a new Dataframe that consists of all possible combinations choosing 1 element from each row.
Visually like so:
An input Dataframe
| Column A | Column B | Column C |
| -------- | -------- | -------- |
| A1 | B1 | C1 |
| A2 | B2 | C2 |
| A3 | B3 | C3 |
would give
| Column A | Column B | Column A |
| -------- | -------- | -------- |
| A1 | A2 | A3 |
| A1 | A2 | B3 |
| A1 | A2 | C3 |
| A1 | B2 | A3 |
| A1 | B2 | B3 |
| A1 | B2 | C3 |
| A1 | C2 | A3 |
| A1 | C2 | B3 |
| A1 | C2 | C3 |
| B1 | A2 | A3 |
| B1 | A2 | B3 |
| B1 | A2 | C3 |
| B1 | B2 | A3 |
| B1 | B2 | B3 |
| B1 | B2 | C3 |
| B1 | C2 | A3 |
| B1 | C2 | B3 |
| B1 | C2 | C3 |
etc...
I have tried to implement this by simply using a 2D array and a double for loop which is fairly simple, however, I would really like to implment this using Polars' Dataframes and being compliant with the way Polars is built as I'm hoping it can compute much faster than a double for loop would. This library is still fairly new to me so please let me know if there is some kind of misunderstanding on my end.
There's no direct "combinations" functionality as far as I am aware.
One possible approach is to .implode()
then .explode()
each column.
df = pl.from_repr("""
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ A1 ┆ B1 ┆ C1 │
│ A2 ┆ B2 ┆ C2 │
│ A3 ┆ B3 ┆ C3 │
└──────────┴──────────┴──────────┘
""")
(df.select(pl.all().implode())
.explode("Column A")
.explode("Column B")
.explode("Column C")
)
shape: (27, 3)
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ A1 ┆ B1 ┆ C1 │
│ A1 ┆ B1 ┆ C2 │
│ A1 ┆ B1 ┆ C3 │
│ A1 ┆ B2 ┆ C1 │
│ … ┆ … ┆ … │
│ A3 ┆ B2 ┆ C3 │
│ A3 ┆ B3 ┆ C1 │
│ A3 ┆ B3 ┆ C2 │
│ A3 ┆ B3 ┆ C3 │
└──────────┴──────────┴──────────┘
You can add .unique()
in the case of any duplicate values.
Instead of having to name each column, you could use the Lazy API and a loop:
out = df.lazy().select(pl.all().implode())
for col in df.columns:
out = out.explode(col)
out.collect()