Search code examples

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.

 df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
│ a   ┆ b     │
│ --- ┆ ---   │
│ i64 ┆ str   │
│ 5   ┆ hi    │
│ 8   ┆ hello │
│ 10  ┆ hi    │

desired result:

│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
│ 0   ┆ 0   │
│ 1   ┆ 1   │
│ 2   ┆ 0   │


  • You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:

    df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
    unique =
    df.join(unique, on="b")

    Or you could "misuse" the fact that categorical values are backed by u32 integers.


    Both methods output:

    shape: (3, 3)
    │ a   ┆ b     ┆ ordinal │
    │ --- ┆ ---   ┆ ---     │
    │ i64 ┆ str   ┆ u32     │
    │ 5   ┆ hi    ┆ 0       │
    │ 8   ┆ hello ┆ 1       │
    │ 10  ┆ hi    ┆ 0       │