I am new to Polars and I am not sure whether I am using .with_columns()
correctly.
Here's a situation I encounter frequently:
There's a dataframe and in .with_columns()
, I apply some operation to a column. For example, I convert some dates from str
to date
type and then want to compute the duration between start and end date. I'd implement this as follows.
import polars as pl
pl.DataFrame(
{
"start": ["01.01.2019", "01.01.2020"],
"end": ["11.01.2019", "01.05.2020"],
}
).with_columns(
pl.col("start").str.to_date(),
pl.col("end").str.to_date(),
).with_columns(
(pl.col("end") - pl.col("start")).alias("duration"),
)
First, I convert the two columns, next I call .with_columns()
again.
Something shorter like this does not work:
pl.DataFrame(
{
"start": ["01.01.2019", "01.01.2020"],
"end": ["11.01.2019", "01.05.2020"],
}
).with_columns(
pl.col("start").str.to_date(),
pl.col("end").str.to_date(),
(pl.col("end") - pl.col("start")).alias("duration"),
)
# InvalidOperationError: sub operation not supported for dtypes `str` and `str`
Is there a way to avoid calling .with_columns()
twice and to write this in a more compact way?
The second .with_columns()
is needed.
I don't want this extra complexity in polars. If you want to use an updated column, you need two
with_columns
. This makes it much more readable, simple, and explainable.
In the given example, passing multiple names to col()
could simplify it slightly.
(df.with_columns(pl.col("start", "end").str.to_date())
.with_columns(duration = pl.col("end") - pl.col("start"))
)
shape: (2, 3)
┌────────────┬────────────┬──────────────┐
│ start ┆ end ┆ duration │
│ --- ┆ --- ┆ --- │
│ date ┆ date ┆ duration[ms] │
╞════════════╪════════════╪══════════════╡
│ 2019-01-01 ┆ 2019-01-11 ┆ 10d │
│ 2020-01-01 ┆ 2020-05-01 ┆ 121d │
└────────────┴────────────┴──────────────┘