I have a problem I'm trying to solve but can't figure it out.
I have something similar to this table:
df = pl.from_repr("""
┌─────┬────────────┬───────┐
│ id ┆ date ┆ sales │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════╪════════════╪═══════╡
│ 1 ┆ 2023-01-01 ┆ 10.0 │
│ 1 ┆ 2023-02-01 ┆ 20.0 │
│ 1 ┆ 2023-03-01 ┆ 30.0 │
│ 1 ┆ 2023-05-01 ┆ 40.0 │
│ 2 ┆ 2023-02-01 ┆ 50.0 │
│ 2 ┆ 2023-03-01 ┆ 60.6 │
│ 2 ┆ 2023-04-01 ┆ 70.2 │
│ 3 ┆ 2023-01-01 ┆ 80.5 │
│ 3 ┆ 2023-02-01 ┆ 90.0 │
└─────┴────────────┴───────┘
""")
as you can see, for each id i have dates and sales. I want to get for every id all dates from the minimum date in the data frame up to the maximum(included).
in addition, i want to fill in the sales column with 0 and the id column with the matching id, so it looks like this:
┌─────┬────────────┬───────┐
│ id ┆ date ┆ sales │
│ --- ┆ --- ┆ --- │
│ i32 ┆ date ┆ f64 │
╞═════╪════════════╪═══════╡
│ 1 ┆ 2023-01-01 ┆ 10.0 │
│ 1 ┆ 2023-01-02 ┆ 20.0 │
│ 1 ┆ 2023-01-03 ┆ 30.0 │
│ 1 ┆ 2023-01-04 ┆ 0.0 │
│ 1 ┆ 2023-01-05 ┆ 40.0 │
│ 2 ┆ 2023-01-01 ┆ 0.0 │
│ 2 ┆ 2023-01-02 ┆ 50.0 │
│ 2 ┆ 2023-01-03 ┆ 60.6 │
│ 2 ┆ 2023-01-03 ┆ 70.2 │
│ 2 ┆ 2023-01-04 ┆ 0.0 │
│ 2 ┆ 2023-01-05 ┆ 0.0 │
│ 3 ┆ 2023-01-01 ┆ 80.5 │
│ 3 ┆ 2023-01-02 ┆ 90.0 │
│ 3 ┆ 2023-01-03 ┆ 0.0 │
│ 3 ┆ 2023-01-04 ┆ 0.0 │
│ 3 ┆ 2023-01-05 ┆ 0.0 │
└─────┴────────────┴───────┘
so on and so forth.
i've tried to create a new dataframe by using the pl.date_range
function and then to join it against the main data, by using outer or cross, but to no avail, since it doesn't compute against each id.
maybe you have any ideas on how to go about it?
many thanks in advance for any input!
Try:
def fn(x, r):
print(pl.DataFrame({"id": x["id"][0], "date": r}))
print(x)
return (
pl.DataFrame({"id": x["id"][0], "date": r})
.join(x, on="date", how="left")[["id", "date", "sales"]]
.with_columns(pl.col("sales").fill_null(strategy="zero"))
)
# convert to date
df = df.with_columns(pl.col("date").str.to_date("%Y-%d-%m"))
# get min, max date
mn, mx = df["date"].min(), df["date"].max()
# construct the range
r = pl.date_range(mn, mx, "1d", eager=True)
# group by "id" and fill the missing dates
df = df.group_by("id", maintain_order=True).map_groups(lambda x: fn(x, r))
with pl.Config(tbl_rows=-1):
print(df)
Prints:
shape: (15, 3)
┌─────┬─────────────────────┬───────┐
│ id ┆ date ┆ sales │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════╪═════════════════════╪═══════╡
│ 1 ┆ 2023-01-01 00:00:00 ┆ 10.0 │
│ 1 ┆ 2023-01-02 00:00:00 ┆ 20.0 │
│ 1 ┆ 2023-01-03 00:00:00 ┆ 30.0 │
│ 1 ┆ 2023-01-04 00:00:00 ┆ 0.0 │
│ 1 ┆ 2023-01-05 00:00:00 ┆ 40.0 │
│ 2 ┆ 2023-01-01 00:00:00 ┆ 0.0 │
│ 2 ┆ 2023-01-02 00:00:00 ┆ 50.0 │
│ 2 ┆ 2023-01-03 00:00:00 ┆ 60.6 │
│ 2 ┆ 2023-01-04 00:00:00 ┆ 70.2 │
│ 2 ┆ 2023-01-05 00:00:00 ┆ 0.0 │
│ 3 ┆ 2023-01-01 00:00:00 ┆ 80.5 │
│ 3 ┆ 2023-01-02 00:00:00 ┆ 90.0 │
│ 3 ┆ 2023-01-03 00:00:00 ┆ 0.0 │
│ 3 ┆ 2023-01-04 00:00:00 ┆ 0.0 │
│ 3 ┆ 2023-01-05 00:00:00 ┆ 0.0 │
└─────┴─────────────────────┴───────┘