Search code examples
python-polars

Is there a good way to do `zfill` in polars?


Is it proper to use pl.Expr.map_elements to throw the python function zfill at my data? I'm not looking for a performant solution.

pl.col("column").map_elements(lambda x: str(x).zfill(5))

Is there a better way to do this?

And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).


Solution

  • Edit: Polars 0.13.43 and later

    With version 0.13.43 and later, Polars has a str.zfill expression to accomplish this. str.zfill will be faster than the answer below and thus str.zfill should be preferred.


    From your question, I'm assuming that you are starting with a column of integers.

    lambda x: str(x).zfill(5)

    If so, here's one that adheres to pandas rather strictly:

    import polars as pl
    df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})
    
    z = 5
    df.with_columns(
        pl.when(pl.col("num").cast(pl.String).str.len_chars() > z)
        .then(pl.col("num").cast(pl.String))
        .otherwise(pl.concat_str(pl.lit("0" * z), pl.col("num").cast(pl.String)).str.slice(-z))
        .alias("result")
    )
    
    shape: (11, 2)
    ┌─────────┬─────────┐
    │ num     ┆ result  │
    │ ---     ┆ ---     │
    │ i64     ┆ str     │
    ╞═════════╪═════════╡
    │ -10     ┆ 00-10   │
    │ -1      ┆ 000-1   │
    │ 0       ┆ 00000   │
    │ 1       ┆ 00001   │
    │ 10      ┆ 00010   │
    │ …       ┆ …       │
    │ 1000    ┆ 01000   │
    │ 10000   ┆ 10000   │
    │ 100000  ┆ 100000  │
    │ 1000000 ┆ 1000000 │
    │ null    ┆ null    │
    └─────────┴─────────┘
    

    Comparing the output to pandas:

    df.with_columns(pl.col('num').cast(pl.String)).get_column('num').to_pandas().str.zfill(z)
    
    0       00-10
    1       000-1
    2       00000
    3       00001
    4       00010
    5       00100
    6       01000
    7       10000
    8      100000
    9     1000000
    10       None
    dtype: object
    

    If you are starting with strings, then you can simplify the code by getting rid any calls to cast.

    Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)

    Edit2: To shave off some time, you can use the following:

    result = df.lazy().with_columns(
        pl.col('num').cast(pl.String).alias('tmp')
    ).with_columns(
        pl.when(pl.col("tmp").str.len_chars() > z)
        .then(pl.col("tmp"))
        .otherwise(pl.concat_str(pl.lit("0" * z), pl.col("tmp")).str.slice(-z))
        .alias("result")
    ).drop('tmp').collect()
    

    but it didn't save that much time.