Search code examples
python-polars

Mapping arbitrary list of labels to finite set of boolean columns


I have a column with a list of about 15 different labels. The number of labels can vary. Id like to map these to a column that’s true or false based on the presence of the label.

Python-Polars: How to filter categorical column with string list suggests that using the string cache is the right way.

Is the best way to have a series of columns that do

out = df.select(
    pl.col("labels").str.contains("expired", literal=True).alias("expired"),
    pl.col("labels").str.contains("discounted", literal=True).alias("discounted")
)

If labels is a categorical and I use a string cache, will this be reasonably efficient, since the number of labels permutations will be much less than the number of rows?


Solution

  • The categorical dtype is one where there is a mapping between an integer key value and a string. If you create two different dfs or even just two different columns, be default, each of them will have their own unique key string mapping. When you do equality comparisons or joins between two categorical columns then it won't work because there's no guarantee that the key-string mapping for each is the same. If the columns are created under the same string cache then they will have the same key-value mapping so that equality comparisons can work. Even if you use a string cache then you would still need to cast the categorical back to a string if you want to use string methods, such as, str.contains on them.