Search code examples
python-polars

Polars - Using pl.when for modification in 2 columns


I have this dataframe with values in thousands and millions:

sample = pl.DataFrame({"a": [1, 6000, 7000, 2, 3, 8000, 4, 8000]})
shape: (8, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 6000 │
│ 7000 │
│ 2    │
│ 3    │
│ 8000 │
│ 4    │
│ 8000 │
└──────┘

My aim is to obtain this dataframe:

shape: (8, 1)
┌──────────────┐
│ a            │
│ ---          │
│ str          │
╞══════════════╡
│ 1.0 thousand │
│ 6.0 million  │
│ 7.0 million  │
│ 2.0 thousand │
│ 3.0 thousand │
│ 8.0 million  │
│ 4.0 thousand │
│ 8.0 million  │
└──────────────┘

I was able to obtain the desired result. However, I did use 2 with_columns and 2 pl.when. My question is if is it possible to reduce to 1 with_columns and 1 pl.when? Obviously, focusing on efficiency. If there is another way or, for some reason, 2 pl.when is faster, I'm pretty fine with that.

I did see some solutions with map_batches, but they were with fixed values. So I wasn't able to reproduce pl.col("a")/1000 with map_batches.

The code that gave me the result:

sample.with_columns(
    pl.when(pl.col("a") >= 1000)
    .then(pl.lit(" million"))
    .otherwise(pl.lit(" thousand"))
    .alias("string")
).with_columns(
    pl.when(pl.col("a") >= 1000)
    .then(pl.col("a")/1000)
    .otherwise(pl.col("a"))
    .cast(pl.String)
    + pl.col("string")
).drop("string")

Solution

  • It seems that the 2x pl.when approach is slightly faster if you have a sufficiently large amount of data.

    You can put them into a single .with_columns

    df.with_columns(
       pl.when(pl.col("a") >= 1000).then(pl.col("a") / 1000).otherwise(pl.col("a"))
         .cast(pl.String)
       + pl.when(pl.col("a") >= 1000)
           .then(pl.lit(" million")) 
           .otherwise(pl.lit(" thousand"))
    )
    

    @Luca's suggestion reads much nicer though, and the time difference is negligible.