I am familiar with fill missing data within Polars and Pandas as in the below quick example:
df = pl.DataFrame({"date": [dt.datetime(2023, 8, 2), dt.datetime(2023, 8, 3), dt.datetime(2023, 8, 4)], "b": ["test", "test", "test2"], "a": [1, None, 3]})
date | b | a |
---|---|---|
2023-08-02 00:00:00 | test | 1 |
2023-08-03 00:00:00 | test | null |
2023-08-04 00:00:00 | test2 | 3 |
df = df.with_columns(pl.col('a').fill_null(pl.lit(0)))
date | b | a |
---|---|---|
2023-08-02 00:00:00 | test | 1 |
2023-08-03 00:00:00 | test | 0 |
2023-08-04 00:00:00 | test2 | 3 |
I know there are other ways than the above such as filling with the previous group value of the group; however, what I cannot seem to find is a means to fill edge data such as in the above.
So if the above was collected on August 10th then dates August 5th to 10th are missing. I would like to fill in those dates for each group with zeroes (0).
Desired results:
date | b | a |
---|---|---|
2023-08-02 00:00:00 | test | 1 |
2023-08-03 00:00:00 | test | 0 |
2023-08-04 00:00:00 | test2 | 3 |
2023-08-05 00:00:00 | test | 0 |
2023-08-05 00:00:00 | test2 | 0 |
2023-08-06 00:00:00 | test | 0 |
2023-08-06 00:00:00 | test2 | 0 |
2023-08-07 00:00:00 | test | 0 |
2023-08-07 00:00:00 | test2 | 0 |
2023-08-08 00:00:00 | test | 0 |
2023-08-08 00:00:00 | test2 | 0 |
2023-08-09 00:00:00 | test | 0 |
2023-08-09 00:00:00 | test2 | 0 |
2023-08-10 00:00:00 | test | 0 |
2023-08-10 00:00:00 | test2 | 0 |
An outer join is probably best:
df = pl.DataFrame(
{
"date": [date(2023, 8, 2), date(2023, 8, 3), date(2023, 8, 4)],
"b": ["test", "test", "test2"],
"a": [1, None, 3],
}
)
df2 = pl.DataFrame(
{
"date": np.tile(
pl.date_range(date(2023, 8, 2), date(2023, 8, 10), eager=True), 2
),
'b': np.tile(['test', 'test2'], 9),
}
)
df.join(df2, on=['date', 'b'], how='outer').fill_null(0)