Is it possible to update the type of a field inside of a struct column?
I have an explicit pyarrow schema defined which I have used to convert pandas to pyarrow, and I use it to alert me of new columns or to fill in missing columns with nulls. I am trying to replace pandas with polars, but I am running into errors when converting the dataframe into a pyarrow Table in order to cast the data types to match my historical data.
team_info = pa.struct(
[
("_id", pa.string()),
("name", pa.string()),
("status", pa.dictionary(index_type=pa.int32(), value_type=pa.string())),
]
)
schema = pa.schema(
[
("load_timestamp", pa.timestamp(unit="ns", tz="UTC")),
...
("team_info", team_info),
...
Polars expects the data type for all 3 of the nested fields to be "large_string" when I attempt to convert the dataframe to a pa.Table with my pre-defined schema.
return df.to_arrow().cast(schema)
I tried to create a function like this which casts the status column to categorical, but unfortunately this adds the field to the dataframe as a new column, instead of casting the nested field in-place.
def update_nested_status(df: pl.DataFrame, nested_columns: list[str]) -> pl.DataFrame:
"""Fixes data types in the agent_info and monitor_info columns"""
cols = [df[col].struct.field("status").cast(pl.Categorical) for col in nested_columns]
return df.with_columns(cols)
Edit:
This is the function I ended up with. It seems to work so far and it casts the polars dtypes to what I have explicitly defined in pyarrow schema objects. It also reorders the columns based on how they are ordered in the pyarrow schema.
def align_polars_schema(df: pl.DataFrame, schema: pa.Schema) -> pl.DataFrame:
"""
Aligns the schema of a polars dataframe to a pyarrow schema
Args:
df: polars DataFrame
schema: pyarrow Schema
"""
schema = pl.from_arrow(schema.empty_table()).schema
df = df.with_columns([pl.col(col).cast(dtype) for col, dtype in schema.items()])
return df.select([pl.col(col) for col in schema.keys()])
As you've rightly pointed out, df[col].struct.field("status").cast(pl.Categorical)
extracts the status column from the struct, casts it and adds it to the df.
If you want to cast it within the struct column, you need to cast the struct column directly.
import polars as pl
df = pl.DataFrame(
{
"foo": [
{"name": "A", "status": "a"},
{"name": "B", "status": "b"},
{"name": "C", "status": "c"},
],
"bar": [
{"name": "A", "status": "a"},
{"name": "B", "status": "b"},
{"name": "C", "status": "c"},
],
"other": ["1", "2", "3"],
}
)
schema_after = pl.Struct(
[
pl.Field("name", pl.Categorical),
pl.Field("status", pl.Utf8),
]
)
df = df.with_columns([df[col].cast(schema_after) for col in ["foo", "bar"]])