Search code examples
pythonpython-polars

Polars equivalent of pandas factorize


Does polars have the function to encode string column into integers (1, 2, 3) like pandas.factorize?

Didn't find it in the polars documentation


Solution

  • Perhaps you're looking for a dense rank or the categorical type.

    df = pl.DataFrame({"column": ["foo", "bar", "baz", "foo", "foo"]})
    
    df.with_columns(rank = pl.col("column").rank("dense"))
    
    shape: (5, 2)
    ┌────────┬──────┐
    │ column | rank │
    │ ---    | ---  │
    │ str    | u32  │
    ╞════════╪══════╡
    │ foo    | 3    │
    │ bar    | 1    │
    │ baz    | 2    │
    │ foo    | 3    │
    │ foo    | 3    │
    └────────┴──────┘
    

    AFAIK - "first seen" order is a little more involved.

    (df.with_row_index("index")
       .with_columns(rank = pl.col("index").first().over("column").rank("dense"))
    )
    
    shape: (5, 3)
    ┌───────┬────────┬──────┐
    │ index ┆ column ┆ rank │
    │ ---   ┆ ---    ┆ ---  │
    │ u32   ┆ str    ┆ u32  │
    ╞═══════╪════════╪══════╡
    │ 0     ┆ foo    ┆ 1    │
    │ 1     ┆ bar    ┆ 2    │
    │ 2     ┆ baz    ┆ 3    │
    │ 3     ┆ foo    ┆ 1    │
    │ 4     ┆ foo    ┆ 1    │
    └───────┴────────┴──────┘