python-polarsrust-polars

In polars, can I create a categorical type with levels myself?


In Pandas, I can specify the levels of a Categorical type myself:

MyCat = pd.CategoricalDtype(categories=['A','B','C'], ordered=True)
my_data = pd.Series(['A','A','B'], dtype=MyCat)

This means that

  1. I can make sure that different columns and sets use the same dtype
  2. I can specify an ordering for the levels.

Is there a way to do this with Polars? I know you can use the string cache feature to achieve 1) in a different way, however I'm interested if my dtype/levels can be specified directly. I'm not aware of any way to achieve 2), however I think the categorical dtypes in Arrow do allow an optional ordering, so maybe it's possible?


Solution

  • Not directly, but we can influence how the global string cache is filled. The global string cache simply increments a counter for every new category added.

    So if we start with an empty cache and we do a pre-fill in the order that we think is important, the later categories use the cached integer.

    Here is an example:

    import string
    import polars as pl
    
    with pl.StringCache():
        # the first run will fill the global string cache counting from 0..25
        # for all 26 letters in the alphabet
        pl.Series(list(string.ascii_uppercase)).cast(pl.Categorical)
        
        # now the global string cache is populated with all categories
        # we cast the string columns
        df = (
            pl.DataFrame({
                "letters": ["A", "B", "D"],
                "more_letters": ["Z", "B", "J"]
            })
            .with_columns(pl.col(pl.Utf8).cast(pl.Categorical))
            .with_columns(pl.col(pl.Categorical).to_physical().suffix("_real_category"))
        )
    
    print(df)
    
    shape: (3, 4)
    ┌─────────┬──────────────┬───────────────────────┬────────────────────────────┐
    │ letters ┆ more_letters ┆ letters_real_category ┆ more_letters_real_category │
    │ ---     ┆ ---          ┆ ---                   ┆ ---                        │
    │ cat     ┆ cat          ┆ u32                   ┆ u32                        │
    ╞═════════╪══════════════╪═══════════════════════╪════════════════════════════╡
    │ A       ┆ Z            ┆ 0                     ┆ 25                         │
    ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ B       ┆ B            ┆ 1                     ┆ 1                          │
    ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ D       ┆ J            ┆ 3                     ┆ 9                          │
    └─────────┴──────────────┴───────────────────────┴────────────────────────────┘