Search code examples
data-sciencecategorical-datapython-polars

How to create non-alphabetically ordered Categorical column in Polars Dataframe?


In Pandas, you can create an "ordered" Categorical column from existing string column as follows:

column_values_with_custom_order = ["B", "A", "C"]

df["Column"] = pd.Categorical(df.Column, categories=column_values_with_custom_order, ordered=True)

In Polars documentation, I couldn't find such way to create ordered columns. However, I could reproduce this by using pl.from_pandas(df) so I suspect that this is possible with Polars as well.

What would be the recommended way to this?

I tried to create new column with:

polars_df.with_columns(col("Column").cast(pl.Categorical))

but I don't know how to include the custom ordering to this.

I also checked In polars, can I create a categorical type with levels myself?, but I would prefer not to add another column to my Dataframe only for ordering.


Solution

  • Say you have

    df = pl.DataFrame(
         {"cats": ["z", "z", "k", "a", "b"], "vals": [3, 1, 2, 2, 3]}
         )
    

    and you want to make cats a categorical but you want the categorical ordered as

    myorder=["k", "z", "b", "a"]
    

    There are two ways to do this. One way is with pl.StringCache() as in the question you reference and the other is more messy. The former does not require you add any columns to your df. It's actually very succinct.

    with pl.StringCache():
        pl.Series(myorder).cast(pl.Categorical)
        df=df.with_columns(pl.col('cats').cast(pl.Categorical))
    

    What happens is that everything in the StringCache gets the same key values so when the myorder list is casted that saves what keys get allocated to each string value. When your df gets casted under the same cache it gets the same key/string values which are in the order you wanted.

    The other way to do this is as follows:

    You have to sort your df by the ordering then you can do set_ordering('physical'). If you want to maintain your original order then you just have to use with_row_count at the beginning so you can restore that order.

    Putting it all together, it looks like this:

    df=df.with_row_index('i').join(
            pl.from_dicts([{'order':x, 'cats':y} for x,y in enumerate(myorder)]), on='cats') \
        .sort('order').drop('order') \
        .with_columns(pl.col('cats').cast(pl.Categorical('physical'))) \
        .sort('i').drop('i')
    

    You can verify by doing:

    df.select('cats',pl.col('cats').to_physical().alias('phys'))
    
    shape: (5, 2)
    ┌──────┬──────┐
    │ cats ┆ phys │
    │ ---  ┆ ---  │
    │ cat  ┆ u32  │
    ╞══════╪══════╡
    │ z    ┆ 1    │
    │ z    ┆ 1    │
    │ k    ┆ 0    │
    │ a    ┆ 3    │
    │ b    ┆ 2    │
    └──────┴──────┘