I have this dataframe with values in thousands and millions:
sample = pl.DataFrame({"a": [1, 6000, 7000, 2, 3, 8000, 4, 8000]})
shape: (8, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ 6000 │
│ 7000 │
│ 2 │
│ 3 │
│ 8000 │
│ 4 │
│ 8000 │
└──────┘
My aim is to obtain this dataframe:
shape: (8, 1)
┌──────────────┐
│ a │
│ --- │
│ str │
╞══════════════╡
│ 1.0 thousand │
│ 6.0 million │
│ 7.0 million │
│ 2.0 thousand │
│ 3.0 thousand │
│ 8.0 million │
│ 4.0 thousand │
│ 8.0 million │
└──────────────┘
I was able to obtain the desired result. However, I did use 2 with_columns and 2 pl.when. My question is if is it possible to reduce to 1 with_columns and 1 pl.when? Obviously, focusing on efficiency. If there is another way or, for some reason, 2 pl.when is faster, I'm pretty fine with that.
I did see some solutions with map_batches, but they were with fixed values. So I wasn't able to reproduce pl.col("a")/1000 with map_batches.
The code that gave me the result:
sample.with_columns(
pl.when(pl.col("a") >= 1000)
.then(pl.lit(" million"))
.otherwise(pl.lit(" thousand"))
.alias("string")
).with_columns(
pl.when(pl.col("a") >= 1000)
.then(pl.col("a")/1000)
.otherwise(pl.col("a"))
.cast(pl.String)
+ pl.col("string")
).drop("string")
It seems that the 2x pl.when
approach is slightly faster if you have a sufficiently large amount of data.
You can put them into a single .with_columns
df.with_columns(
pl.when(pl.col("a") >= 1000).then(pl.col("a") / 1000).otherwise(pl.col("a"))
.cast(pl.String)
+ pl.when(pl.col("a") >= 1000)
.then(pl.lit(" million"))
.otherwise(pl.lit(" thousand"))
)
@Luca's suggestion reads much nicer though, and the time difference is negligible.