Search code examples
python-polars

Ordinal encoding of column in polars


I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.

 df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a   ┆ b     │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 5   ┆ hi    │
│ 8   ┆ hello │
│ 10  ┆ hi    │
└─────┴───────┘

desired result:

┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 0   │
│ 1   ┆ 1   │
│ 2   ┆ 0   │
└─────┴─────┘

Solution

  • You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:

    df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
    
    unique = df.select(
        pl.col("b").unique(maintain_order=True)
    ).with_row_index(name="ordinal")
    
    df.join(unique, on="b")
    

    Or you could "misuse" the fact that categorical values are backed by u32 integers.

    df.with_columns(
        pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
    )
    

    Both methods output:

    shape: (3, 3)
    ┌─────┬───────┬─────────┐
    │ a   ┆ b     ┆ ordinal │
    │ --- ┆ ---   ┆ ---     │
    │ i64 ┆ str   ┆ u32     │
    ╞═════╪═══════╪═════════╡
    │ 5   ┆ hi    ┆ 0       │
    │ 8   ┆ hello ┆ 1       │
    │ 10  ┆ hi    ┆ 0       │
    └─────┴───────┴─────────┘