I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize()
, however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
│ 8 ┆ hello │
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
│ 1 ┆ 1 │
│ 2 ┆ 0 │
└─────┴─────┘
You can join with a dummy DataFrame
that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_index(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32
integers.
df.with_columns(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
│ 8 ┆ hello ┆ 1 │
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘