Update: This issue has since been resolved. The output is a date
type as expected.
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ --- ┆ --- ┆ --- │
│ str ┆ null ┆ date │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike ┆ null ┆ 2023-01-01 │
│ Jan ┆ null ┆ 2023-01-01 │
└──────────┴────────────┴───────────────────────────┘
Original question
I have a dataframe with customers and order dates. Order dates can be either a pl.Date or a null, but currently this column contains only null values. I want to create a new column, "startdate_before_override", which is set to either the "order_date" value of the same row or the "startdate" value (a constant of dtype pl.Date, declared at the third line).
I have tried doing this with the when/then/otherwise pattern, adding a check for nulls as the first "when" (edit: I see now that this check is actually not needed, but that's beside the point of my issue).
The problem I am trying to solve is that the values returned to startdate_before_override are pl.Float32
and not pl.Date
import polars as pl
startdate = pl.date(year=2023, month=1, day=1)
data = {"customer": ["Mike", "Jan"], "order_date": [None, None]}
df = pl.DataFrame(data)
(df
.with_columns(
pl
.when(pl.col("order_date").is_null())
.then(startdate)
.when(pl.col("order_date")>startdate)
.then(pl.col("order_date"))
.otherwise(startdate)
.alias("startdate_before_override"),
)
)
The output is not as I expect:
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ --- ┆ --- ┆ --- │
│ str ┆ f32 ┆ f32 │ # <- f32, why?
╞══════════╪════════════╪═══════════════════════════╡
│ Mike ┆ null ┆ 19358.0 │
│ Jan ┆ null ┆ 19358.0 │
└──────────┴────────────┴───────────────────────────┘
Since both rows have null values for the order_date col, I would expect the result to be 2023-01-01 in a pl.Date format.
In your example the order_date
column is all nulls and defaulting to the f32
dtype.
As your when/then can produce startdate
or order_date
, i.e. either an f32
or pl.Date
value, Polars calculates the "supertype", which in this case is f32
.
startdate
is being cast to a float, which is where 19358.0
is coming from.
>>> pl.select(startdate.cast(float))
shape: (1, 1)
┌─────────┐
│ date │
│ --- │
│ f64 │
╞═════════╡
│ 19358.0 │
└─────────┘
You can either explicitly set the type of the column to date
on creation, e.g. with schema=
/ schema_overrides=
df = pl.DataFrame(data, schema_overrides={"order_date": pl.Date})
df.with_columns(
pl.when(pl.col("order_date") > startdate)
.then(pl.col("order_date"))
.otherwise(startdate)
.alias("startdate_before_override"),
)
Or you can explicitly .cast()
the result of the .when().then()
df.with_columns(
pl.when(pl.col("order_date") > startdate)
.then(pl.col("order_date"))
.otherwise(startdate)
.cast(pl.Date)
.alias("startdate_before_override"),
)
shape: (2, 3)
┌──────────┬────────────┬───────────────────────────┐
│ customer ┆ order_date ┆ startdate_before_override │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ date │
╞══════════╪════════════╪═══════════════════════════╡
│ Mike ┆ null ┆ 2023-01-01 │
│ Jan ┆ null ┆ 2023-01-01 │
└──────────┴────────────┴───────────────────────────┘