Search code examples
pythonpython-polars

Concatenate polars dataframe with columns of dtype ENUM


Consider having two pl.DataFrames with identical schema. One of the columns has dtype=pl.Enum.

import polars as pl

enum_col1 = pl.Enum(["type1"])
enum_col2 = pl.Enum(["type2"])
df1 = pl.DataFrame(
    {"enum_col": "type1", "value": 10},
    schema={"enum_col": enum_col1, "value": pl.Int64},
)
df2 = pl.DataFrame(
    {"enum_col": "type2", "value": 200},
    schema={"enum_col": enum_col2, "value": pl.Int64},
)

print(df1)
print(df2)

shape: (1, 2)
┌──────────┬───────┐
│ enum_col ┆ value │
│ ---      ┆ ---   │
│ enum     ┆ i64   │
╞══════════╪═══════╡
│ type1    ┆ 10    │
└──────────┴───────┘
shape: (1, 2)
┌──────────┬───────┐
│ enum_col ┆ value │
│ ---      ┆ ---   │
│ enum     ┆ i64   │
╞══════════╪═══════╡
│ type2    ┆ 200   │
└──────────┴───────┘

If I try to do a simple pl.concat([df1, df2]), I get the following error:

polars.exceptions.SchemaError: type Enum(Some(local), Physical) is incompatible with expected type Enum(Some(local), Physical)

You can get around this issue by "enlarging" the enums like this:

pl.concat(
    [
        df1.with_columns(pl.col("enum_col").cast(pl.Enum(["type1", "type2"]))),
        df2.with_columns(pl.col("enum_col").cast(pl.Enum(["type1", "type2"]))),
    ]
)

shape: (2, 2)
┌──────────┬───────┐
│ enum_col ┆ value │
│ ---      ┆ ---   │
│ enum     ┆ i64   │
╞══════════╪═══════╡
│ type1    ┆ 10    │
│ type2    ┆ 200   │
└──────────┴───────┘

I guess, there is a more pythonic way to do this?


Solution

  • you can cast enum_col to combined enum type:

    enum_col = enum_col1 | enum_col2
    
    pl.concat(
        df.with_columns(pl.col.enum_col.cast(enum_col)) for df in [df1, df2]
    )
    
    shape: (2, 2)
    ┌──────────┬───────┐
    │ enum_col ┆ value │
    │ ---      ┆ ---   │
    │ enum     ┆ i64   │
    ╞══════════╪═══════╡
    │ type1    ┆ 10    │
    │ type2    ┆ 200   │
    └──────────┴───────┘
    

    You can also create new enum_col dynamically, for example:

    from functools import reduce
    
    enum_col = reduce(lambda x,y: x | y, [df.schema["enum_col"] for df in [df1, df2]])
    
    Enum(categories=['type1', 'type2'])