Search code examples
pythonpython-polars

How do you select fields from all structs in a list in Polars?


I'm working with a deeply nested DataFrame (not good practice, I know), and I'd like to express something like "select field X for all structs in list Y".

An example of the data structure:

import polars as pl

data = {
    "a": [
        [{
            "x": [1, 2, 3],
            "y": [4, 5, 6]
        },
        {
            "x": [2, 3, 4],
            "y": [3, 4, 5]
        }
        ]
    ],
}
df = pl.DataFrame(data)

In this case, I'd like to select field "x" in both of the structs, and gather them into a df with two series, call them"x_1" and "x_2".

In other words, the desired output is:

┌───────────┬───────────┐
│ x_1       ┆ x_2       │
│ ---       ┆ ---       │
│ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╡
│ [1, 2, 3] ┆ [2, 3, 4] │
└───────────┴───────────┘

I don't know the length of the list ahead of time, and I'd like to do this dynamically (i.e. without hard-coding the field names). I'm not sure whether this is possible using Polars expressions?

Thanks in advance!


Solution

  • Update: Perhaps a simpler approach using .unstack()

    (df.select(pl.col("a").flatten().struct.field("x"))
       .unstack(1)
    )
    
    shape: (1, 2)
    ┌───────────┬───────────┐
    │ x_0       ┆ x_1       │
    │ ---       ┆ ---       │
    │ list[i64] ┆ list[i64] │
    ╞═══════════╪═══════════╡
    │ [1, 2, 3] ┆ [2, 3, 4] │
    └───────────┴───────────┘
    

    Original answer:

    df.select(
       pl.col("a").list.eval(pl.element().struct["x"])
         .list.to_struct("max_width", lambda idx: f"x_{idx + 1}")
    ).unnest("a")
    
    shape: (1, 2)
    ┌───────────┬───────────┐
    │ x_1       ┆ x_2       │
    │ ---       ┆ ---       │
    │ list[i64] ┆ list[i64] │
    ╞═══════════╪═══════════╡
    │ [1, 2, 3] ┆ [2, 3, 4] │
    └───────────┴───────────┘
    

    Explanation

    • .list.eval() to loop through each list element, we extract each struct field.
    df.select(
       pl.col("a").list.eval(pl.element().struct["x"])
    )
    
    # shape: (1, 1)
    # ┌────────────────────────┐
    # │ a                      │
    # │ ---                    │
    # │ list[list[i64]]        │
    # ╞════════════════════════╡
    # │ [[1, 2, 3], [2, 3, 4]] │
    # └────────────────────────┘
    
    • .list.to_struct() to convert to a struct which will allow us to turn each inner list into its own column.
    df.select(
       pl.col("a").list.eval(pl.element().struct["x"])
         .list.to_struct("max_width", lambda idx: f"x_{idx + 1}")
    )
    
    # shape: (1, 1)
    # ┌───────────────────────┐
    # │ a                     │
    # │ ---                   │
    # │ struct[2]             │
    # ╞═══════════════════════╡
    # │ {[1, 2, 3],[2, 3, 4]} │
    # └───────────────────────┘
    
    • .unnest() the struct to create individual columns.