Search code examples
python-polarsrust-polars

In polars, can I create a categorical type with levels myself?


In Pandas, I can specify the levels of a Categorical type myself:

MyCat = pd.CategoricalDtype(categories=['A','B','C'], ordered=True)
my_data = pd.Series(['A','A','B'], dtype=MyCat)

This means that

  1. I can make sure that different columns and sets use the same dtype
  2. I can specify an ordering for the levels.

Is there a way to do this with Polars? I know you can use the string cache feature to achieve 1) in a different way, however I'm interested if my dtype/levels can be specified directly. I'm not aware of any way to achieve 2), however I think the categorical dtypes in Arrow do allow an optional ordering, so maybe it's possible?


Solution

  • EDIT 2024-02-29:

    This answer is outdated. You should use Polars Enum type for this.

    Old answer

    Not directly, but we can influence how the global string cache is filled. The global string cache simply increments a counter for every new category added.

    So if we start with an empty cache and we do a pre-fill in the order that we think is important, the later categories use the cached integer.

    Here is an example:

    import string
    import polars as pl
    
    with pl.StringCache():
        # the first run will fill the global string cache counting from 0..25
        # for all 26 letters in the alphabet
        pl.Series(list(string.ascii_uppercase)).cast(pl.Categorical)
        
        # now the global string cache is populated with all categories
        # we cast the string columns
        df = (
            pl.DataFrame({
                "letters": ["A", "B", "D"],
                "more_letters": ["Z", "B", "J"]
            })
            .with_columns(pl.col(pl.Utf8).cast(pl.Categorical))
            .with_columns(pl.col(pl.Categorical).to_physical().suffix("_real_category"))
        )
    
    print(df)
    
    shape: (3, 4)
    ┌─────────┬──────────────┬───────────────────────┬────────────────────────────┐
    │ letters ┆ more_letters ┆ letters_real_category ┆ more_letters_real_category │
    │ ---     ┆ ---          ┆ ---                   ┆ ---                        │
    │ cat     ┆ cat          ┆ u32                   ┆ u32                        │
    ╞═════════╪══════════════╪═══════════════════════╪════════════════════════════╡
    │ A       ┆ Z            ┆ 0                     ┆ 25                         │
    ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ B       ┆ B            ┆ 1                     ┆ 1                          │
    ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ D       ┆ J            ┆ 3                     ┆ 9                          │
    └─────────┴──────────────┴───────────────────────┴────────────────────────────┘