Search code examples
pythonpython-3.xdataframepython-polars

Why does casting a column with numeric Categorical datatype to an integer in Polars result in unexpected behavior?


I have a Categorical column named decile in my polars DataFrame df, with its values ranging from "01" to "10". When attempting to convert that column into a numerical representation via: df.with_columns(pl.col('decile').cast(pl.Int8)), the casted values are not mapped as expected (i.e., "01" doesn't get mapped to 1, and so on), and the range now also from 0 to 9, not 1 to 10.

The weird thing is that no matter what the original values of the column decile were, they will always get mapped unexpectedly, and to [0, 9] when casting it into an integer datatype.

I am trying to cast the values into integer datatype for plotting purposes.

Here is a minimal reproducible example:

size = 1e3
df = pl.DataFrame({
    "id": np.random.randint(50, size=int(size), dtype=np.uint16),
    "amount": np.round(np.random.uniform(10, 100000, int(size)).astype(np.float32), 2),
    "quantity": np.random.randint(1, 7, size=int(size), dtype=np.uint16),
})
df = (df
      .groupby("id")
      .agg(revenue=pl.sum("amount"), tot_quantity=pl.sum("quantity"))
     )
df = (df.with_columns(
    pl.col('revenue')
    .qcut(10, labels=[f'q{i:02}' for i in range(10, 0, -1)])
    .alias("decile")
))

How to have the casting be proper (as one would expect the values to be mapped), and in the same range as the original values?


Solution

  • The first cast on a pl.Categorical should always be string (pl.Utf8) first, then converting from string to int from here (in your example, a bit more than a straight cast is required to separate the q):

    pl.col('decile').cast(pl.Utf8).str.slice(1).str.parse_int(10)