Search code examples
python-3.xpython-polars

Explode out polars column of list items horizontally to new columns


Is there a polars native way to explode out a column of lists horizontally?

e.g. go from this:

df = pl.DataFrame(
    {
        "letters": ["a", "a", "b", "c"],
        "numbers": [[1, 1], [2, 3], [4, 5], [6, 7]],
    }
)

To this?

shape: (4, 3)
┌─────────┬───────────┬───────────┐
│ letters ┆ numbers_a ┆ numbers_b │
│ ---     ┆ ---       ┆ ---       │
│ str     ┆ i64       ┆ i64       │
╞═════════╪═══════════╪═══════════╡
│ a       ┆ 1         ┆ 1         │
│ a       ┆ 2         ┆ 3         │
│ b       ┆ 4         ┆ 5         │
│ c       ┆ 6         ┆ 7         │
└─────────┴───────────┴───────────┘

I am aware of the .explode() method but afaik this is only possible to use vertically


Solution

  • There are two ways you can do this.

    struct/unnest

    The easiest is to .list.to_struct and then unnest

    from string import ascii_lowercase
    
    (
        df
        .with_columns(
            pl.col('numbers').list.to_struct(
                fields=lambda idx: f"numbers_{ascii_lowercase[idx]}",
                n_field_strategy='max_width'
                )
            )
        .unnest('numbers')
    )
    

    get

    from string import ascii_lowercase
    max_width=df['numbers'].list.len().max()
    (
        df
        .select(
            pl.exclude('numbers'), 
            *[pl.col('numbers').list.get(x).alias(f"numbers_{ascii_lowercase[x]}") 
              for x in range(max_width)]
            )
        )
    

    In either case you're going to get:

    shape: (4, 3)
    ┌─────────┬───────────┬───────────┐
    │ letters ┆ numbers_a ┆ numbers_b │
    │ ---     ┆ ---       ┆ ---       │
    │ str     ┆ i64       ┆ i64       │
    ╞═════════╪═══════════╪═══════════╡
    │ a       ┆ 1         ┆ 1         │
    │ a       ┆ 2         ┆ 3         │
    │ b       ┆ 4         ┆ 5         │
    │ c       ┆ 6         ┆ 7         │
    └─────────┴───────────┴───────────┘
    

    In the former case you can more easily chain the methods but the second case relies on first determining the max_width which is less flexible. The second method might be more performant but I don't really know for sure so it's worth checking.