Search code examples
pythoncombinationspython-polars

Find all combinations of a Polars Dataframe selecting one element from each row


I have a Polars Dataframe, and I would like to contrust a new Dataframe that consists of all possible combinations choosing 1 element from each row.

Visually like so:

An input Dataframe

| Column A | Column B | Column C |
| -------- | -------- | -------- |
| A1       | B1       | C1       |
| A2       | B2       | C2       |
| A3       | B3       | C3       |

would give

| Column A | Column B | Column A |
| -------- | -------- | -------- |
| A1       | A2       | A3       |
| A1       | A2       | B3       |
| A1       | A2       | C3       |
| A1       | B2       | A3       |
| A1       | B2       | B3       |
| A1       | B2       | C3       |
| A1       | C2       | A3       |
| A1       | C2       | B3       |
| A1       | C2       | C3       |
| B1       | A2       | A3       |
| B1       | A2       | B3       |
| B1       | A2       | C3       |
| B1       | B2       | A3       |
| B1       | B2       | B3       |
| B1       | B2       | C3       |
| B1       | C2       | A3       |
| B1       | C2       | B3       |
| B1       | C2       | C3       |

etc...

I have tried to implement this by simply using a 2D array and a double for loop which is fairly simple, however, I would really like to implment this using Polars' Dataframes and being compliant with the way Polars is built as I'm hoping it can compute much faster than a double for loop would. This library is still fairly new to me so please let me know if there is some kind of misunderstanding on my end.


Solution

  • There's no direct "combinations" functionality as far as I am aware.

    One possible approach is to .implode() then .explode() each column.

    df = pl.from_repr("""
    ┌──────────┬──────────┬──────────┐
    │ Column A ┆ Column B ┆ Column C │
    │ ---      ┆ ---      ┆ ---      │
    │ str      ┆ str      ┆ str      │
    ╞══════════╪══════════╪══════════╡
    │ A1       ┆ B1       ┆ C1       │
    │ A2       ┆ B2       ┆ C2       │
    │ A3       ┆ B3       ┆ C3       │
    └──────────┴──────────┴──────────┘
    """)
    
    (df.select(pl.all().implode())
       .explode("Column A")
       .explode("Column B")
       .explode("Column C")
    )
    
    shape: (27, 3)
    ┌──────────┬──────────┬──────────┐
    │ Column A ┆ Column B ┆ Column C │
    │ ---      ┆ ---      ┆ ---      │
    │ str      ┆ str      ┆ str      │
    ╞══════════╪══════════╪══════════╡
    │ A1       ┆ B1       ┆ C1       │
    │ A1       ┆ B1       ┆ C2       │
    │ A1       ┆ B1       ┆ C3       │
    │ A1       ┆ B2       ┆ C1       │
    │ …        ┆ …        ┆ …        │
    │ A3       ┆ B2       ┆ C3       │
    │ A3       ┆ B3       ┆ C1       │
    │ A3       ┆ B3       ┆ C2       │
    │ A3       ┆ B3       ┆ C3       │
    └──────────┴──────────┴──────────┘
    

    You can add .unique() in the case of any duplicate values.

    Instead of having to name each column, you could use the Lazy API and a loop:

    out = df.lazy().select(pl.all().implode())
    for col in df.columns:
        out = out.explode(col)
        
    out.collect()