Search code examples
pythondataframepython-polars

polars - Fill null over Groups


I am trying to fill null timestamps over groups, my dataframe looks like this

df = pl.from_repr("""
┌───────────────────────────────────┬──────────────────────────────────┬───────┐
│ start                             ┆ stop                             ┆ group │
│ ---                               ┆ ---                              ┆ ---   │
│ str                               ┆ str                              ┆ i64   │
╞═══════════════════════════════════╪══════════════════════════════════╪═══════╡
│ 2021-12-08 06:40:53.734941+01:00  ┆ 2022-05-16 10:16:18.717146+02:00 ┆ 1     │
│ 2021-12-08 06:40:55.191598+01:00  ┆ null                             ┆ 1     │
│ 2021-12-08 10:39:12.421402+01:00  ┆ 2022-05-16 10:16:19.816922+02:00 ┆ 2     │
│ 2021-12-08 10:39:12.634873+01:00  ┆ 2022-05-16 10:16:19.817304+02:00 ┆ 1     │
│ 2021-12-08 10:49:47.392815+01:00  ┆ 2022-05-16 10:16:20.178050+02:00 ┆ 5     │
└───────────────────────────────────┴──────────────────────────────────┴───────┘
""")

The stop timestamps should be imputed from the next start.

My current implantation is:

df = df.sort("start").with_columns(
    pl.col("start").fill_null(strategy="backward").over("group")
)

df = df.group_by("group").map_groups(
    lambda df: df.with_columns(
        pl.col("stop").fill_null(
            pl.col("start").shift(-1)
        )
    )
)

This current implementation is slow. Is there a faster way to solve this?

Thanks in advance


Solution

  • You can use pl.coalesce().

    Here is an example:

    df.with_columns(
        stop = pl.coalesce(
            pl.col('stop'), pl.col('start').shift(-1).over('group')
        )
    )
    
    shape: (5, 4)
    ┌─────┬──────────────────────────────────┬──────────────────────────────────┬───────┐
    │     ┆ start                            ┆ stop                             ┆ group │
    │ --- ┆ ---                              ┆ ---                              ┆ ---   │
    │ i64 ┆ str                              ┆ str                              ┆ i64   │
    ╞═════╪══════════════════════════════════╪══════════════════════════════════╪═══════╡
    │ 0   ┆ 2021-12-08 06:40:53.734941+01:00 ┆ 2022-05-16 10:16:18.717146+02:00 ┆ 1     │
    │ 1   ┆ 2021-12-08 06:40:55.191598+01:00 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 1     │
    │ 2   ┆ 2021-12-08 10:39:12.421402+01:00 ┆ 2022-05-16 10:16:19.816922+02:00 ┆ 2     │
    │ 3   ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 2022-05-16 10:16:19.817304+02:00 ┆ 1     │
    │ 4   ┆ 2021-12-08 10:49:47.392815+01:00 ┆ 2022-05-16 10:16:20.178050+02:00 ┆ 5     │
    └─────┴──────────────────────────────────┴──────────────────────────────────┴───────┘
    
    

    In the example I used .over() so that it finds the next start within the same group.