Search code examples
pythonpython-polars

Efficiently reparsing string series (in a dataframe) into a struct, recasting the fields of the struct and then unnesting it


Consider the following toy example:

import polars as pl

xs = pl.DataFrame(
    [
        pl.Series(
            "date",
            ["2024 Jan", "2024 Feb", "2024 Jan", "2024 Jan"],
            dtype=pl.String,
        )
    ]
)
ys = (
    xs.with_columns(
        pl.col("date").str.split(" ").list.to_struct(fields=["year", "month"]),
    )
    .with_columns(
        pl.col("date").struct.with_fields(pl.field("year").cast(pl.Int16()))
    )
    .unnest("date")
)
ys
shape: (4, 2)
┌──────┬───────┐
│ year ┆ month │
│ ---  ┆ ---   │
│ i16  ┆ str   │
╞══════╪═══════╡
│ 2024 ┆ Jan   │
│ 2024 ┆ Feb   │
│ 2024 ┆ Jan   │
│ 2024 ┆ Jan   │
└──────┴───────┘

I think it would be more efficient to do the operations on a unique series of date data (I could use map_dict, but I have opted for join for no good reason):

unique_dates = (
    pl.DataFrame([xs["date"].unique()])
    .with_columns(
        pl.col("date")
        .str.split(" ")
        .list.to_struct(fields=["year", "month"])
        .alias("struct_date")
    )
    .with_columns(
        pl.col("struct_date").struct.with_fields(
            pl.field("year").cast(pl.Int16())
        )
    )
)
unique_dates
shape: (2, 2)
┌──────────┬──────────────┐
│ date     ┆ struct_date  │
│ ---      ┆ ---          │
│ str      ┆ struct[2]    │
╞══════════╪══════════════╡
│ 2024 Jan ┆ {2024,"Jan"} │
│ 2024 Feb ┆ {2024,"Feb"} │
└──────────┴──────────────┘
zs = (
    xs.join(unique_dates, on="date", left_on="date", right_on="struct_date")
    .drop("date")
    .rename({"struct_date": "date"})
    .unnest("date")
)

zs
shape: (4, 2)
┌──────┬───────┐
│ year ┆ month │
│ ---  ┆ ---   │
│ i16  ┆ str   │
╞══════╪═══════╡
│ 2024 ┆ Jan   │
│ 2024 ┆ Feb   │
│ 2024 ┆ Jan   │
│ 2024 ┆ Jan   │
└──────┴───────┘

What can I do to improve the efficiency of this operation even further? Am I using polars idiomatically enough?


Solution

  • .str.splitn() should be more efficient as it avoids the List creation + .list.to_struct()

    .struct.field() can also be used to "unnest" the fields directly.

    xs.select(
        pl.col.date.str.splitn(" ", 2)
          .struct.rename_fields(["year", "month"])
          .struct.with_fields(pl.field("year").cast(pl.Int16))
          .struct.field("year", "month")
    )
    
    shape: (4, 2)
    ┌──────┬───────┐
    │ year ┆ month │
    │ ---  ┆ ---   │
    │ i16  ┆ str   │
    ╞══════╪═══════╡
    │ 2024 ┆ Jan   │
    │ 2024 ┆ Feb   │
    │ 2024 ┆ Jan   │
    │ 2024 ┆ Jan   │
    └──────┴───────┘