I am trying to fill null timestamps over groups, my dataframe looks like this
df = pl.from_repr("""
┌───────────────────────────────────┬──────────────────────────────────┬───────┐
│ start ┆ stop ┆ group │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════════════════════════════════╪══════════════════════════════════╪═══════╡
│ 2021-12-08 06:40:53.734941+01:00 ┆ 2022-05-16 10:16:18.717146+02:00 ┆ 1 │
│ 2021-12-08 06:40:55.191598+01:00 ┆ null ┆ 1 │
│ 2021-12-08 10:39:12.421402+01:00 ┆ 2022-05-16 10:16:19.816922+02:00 ┆ 2 │
│ 2021-12-08 10:39:12.634873+01:00 ┆ 2022-05-16 10:16:19.817304+02:00 ┆ 1 │
│ 2021-12-08 10:49:47.392815+01:00 ┆ 2022-05-16 10:16:20.178050+02:00 ┆ 5 │
└───────────────────────────────────┴──────────────────────────────────┴───────┘
""")
The stop timestamps should be imputed from the next start.
My current implantation is:
df = df.sort("start").with_columns(
pl.col("start").fill_null(strategy="backward").over("group")
)
df = df.group_by("group").map_groups(
lambda df: df.with_columns(
pl.col("stop").fill_null(
pl.col("start").shift(-1)
)
)
)
This current implementation is slow. Is there a faster way to solve this?
Thanks in advance
You can use pl.coalesce()
.
Here is an example:
df.with_columns(
stop = pl.coalesce(
pl.col('stop'), pl.col('start').shift(-1).over('group')
)
)
shape: (5, 4)
┌─────┬──────────────────────────────────┬──────────────────────────────────┬───────┐
│ ┆ start ┆ stop ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════════════════════════════╪══════════════════════════════════╪═══════╡
│ 0 ┆ 2021-12-08 06:40:53.734941+01:00 ┆ 2022-05-16 10:16:18.717146+02:00 ┆ 1 │
│ 1 ┆ 2021-12-08 06:40:55.191598+01:00 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 1 │
│ 2 ┆ 2021-12-08 10:39:12.421402+01:00 ┆ 2022-05-16 10:16:19.816922+02:00 ┆ 2 │
│ 3 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 2022-05-16 10:16:19.817304+02:00 ┆ 1 │
│ 4 ┆ 2021-12-08 10:49:47.392815+01:00 ┆ 2022-05-16 10:16:20.178050+02:00 ┆ 5 │
└─────┴──────────────────────────────────┴──────────────────────────────────┴───────┘
In the example I used .over() so that it finds the next start within the same group.