Search code examples
pythondataframepython-polars

How to destructure nested structs in polars (python api)?


I am unfortunately having to work with some nested data in a polars dataframe. (I know it is bad practice) Consider data:

data = {
    "positions": [
        {
            "company": {
                "companyName": "name1"
            },
        },
        {
            "company": {
                "companyName": "name2"
            },
        },
        {
            "company": {
                "companyName": "name3"
            },
        }
    ]
}

positions is a column in the dataframe. I have explored the polars python api docs but cannot figure out how to extract out just the companyName fields into a separate list column.

I want to achieve the same that this comprehension does:


names = (
    [
        p["company"]["companyName"]
        for p in data["positions"]
        if p.get("company") and p.get("company").get("companyName")
    ]
    if data.get("positions")
    else None
)

Note the null checks.

I get a sense that I have to use the pl.list.eval function along with pl.element but I am a bit foggy on the api.

Before:
shape: (3, 1)
┌─────────────┐
│ positions   │
│ ---         │
│ struct[1]   │
╞═════════════╡
│ {{"name1"}} │
│ {{"name2"}} │
│ {{"name3"}} │
└─────────────┘

After:
shape: (3, 1)
┌───────┐
│ names │
│ ---   │
│ str   │
╞═══════╡
│ name1 │
│ name2 │
│ name3 │
└───────┘

Solution

  • Structs

    You can use .struct.field() or .struct[] syntax to extract struct fields.

    df = pl.DataFrame(data)
    
    df.with_columns(
        pl.col("positions").struct["company"].struct["companyName"]
    )
    
    shape: (3, 2)
    ┌─────────────┬─────────────┐
    │ positions   ┆ companyName │
    │ ---         ┆ ---         │
    │ struct[1]   ┆ str         │
    ╞═════════════╪═════════════╡
    │ {{"name1"}} ┆ name1       │
    │ {{"name2"}} ┆ name2       │
    │ {{"name3"}} ┆ name3       │
    └─────────────┴─────────────┘
    

    Alternatively, you can work at the frame-level and .unnest() the structs into columns.

    df.unnest("positions").unnest("company")
    
    shape: (3, 1)
    ┌─────────────┐
    │ companyName │
    │ ---         │
    │ str         │
    ╞═════════════╡
    │ name1       │
    │ name2       │
    │ name3       │
    └─────────────┘
    

    List of structs

    If working with a list of structs you could use the .list.eval() API:

    df = pl.DataFrame([data])
    
    df.with_columns(
       pl.col("positions").list.eval(
          pl.element().struct["company"].struct["companyName"]
       )
    )
    
    shape: (1, 1)
    ┌─────────────────────────────┐
    │ positions                   │
    │ ---                         │
    │ list[str]                   │
    ╞═════════════════════════════╡
    │ ["name1", "name2", "name3"] │
    └─────────────────────────────┘
    

    Or at the frame-level using .explode() and .unnest()

    df.explode("positions").unnest("positions").unnest("company")
    
    shape: (3, 1)
    ┌─────────────┐
    │ companyName │
    │ ---         │
    │ str         │
    ╞═════════════╡
    │ name1       │
    │ name2       │
    │ name3       │
    └─────────────┘