Search code examples
pythonpython-polars

Access newly created column in .with_columns() when using polars


I am new to Polars and I am not sure whether I am using .with_columns() correctly.

Here's a situation I encounter frequently: There's a dataframe and in .with_columns(), I apply some operation to a column. For example, I convert some dates from str to date type and then want to compute the duration between start and end date. I'd implement this as follows.

import polars as pl 

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    [
        pl.col("start").str.to_date(),
        pl.col("end").str.to_date(),
    ]
).with_columns(
    [
        (pl.col("end") - pl.col("start")).alias("duration"),
    ]
)

First, I convert the two columns, next I call .with_columns() again.

Something shorter like this does not work:

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    [
        pl.col("start").str.to_date(),
        pl.col("end").str.to_date(),
        (pl.col("end") - pl.col("start")).alias("duration"),
    ]
)
# InvalidOperationError: sub operation not supported for dtypes `str` and `str`

Is there a way to avoid calling .with_columns() twice and to write this in a more compact way?


Solution

  • The second .with_columns() is needed.

    From the GitHub Issues

    I don't want this extra complexity in polars. If you want to use an updated column, you need two with_columns. This makes it much more readable, simple, and explainable.

    In the given example, passing multiple names to col() could simplify it slightly.

    (df.with_columns(pl.col("start", "end").str.to_date())
       .with_columns(duration = pl.col("end") - pl.col("start"))
    )
    
    shape: (2, 3)
    ┌────────────┬────────────┬──────────────┐
    │ start      ┆ end        ┆ duration     │
    │ ---        ┆ ---        ┆ ---          │
    │ date       ┆ date       ┆ duration[ms] │
    ╞════════════╪════════════╪══════════════╡
    │ 2019-01-01 ┆ 2019-01-11 ┆ 10d          │
    │ 2020-01-01 ┆ 2020-05-01 ┆ 121d         │
    └────────────┴────────────┴──────────────┘