Search code examples
datepython-polarsconsecutive-months

Select the first rows with consecutive dates in Polars


I have a Polars DataFrame containing a monthly dates and a values column. I would like to select only the first rows where the months are consecutive, dropping the rest of the rows. For example, in the table that would be the first 5 rows. I tried to use .group_by_dynamic without success. Any ideas how to do this? Thank you!

from datetime import datetime
df = pl.DataFrame(
    {
        "time": pl.datetime_range(
            start=datetime(2020, 12, 1),
            end=datetime(2023, 12, 1),
            interval="1mo",
            eager=True,
        ),
        "n": range(37),
    }
)

df=df.sample(n=10, seed=0)  
df=df.sort(["time"], descending=False)
df
time:datetime[μs] n
2021-01-01 00:00:00 1
2021-02-01 00:00:00 2
2021-03-01 00:00:00 3
2021-04-01 00:00:00 4
2021-05-01 00:00:00 5
2022-03-01 00:00:00 15
2022-07-01 00:00:00 19
2023-04-01 00:00:00 28
2023-05-01 00:00:00 29
2023-06-01 00:00:00 30
2023-11-01 00:00:00 35

Solution

  • One possibility would be to filter with a boolean expression crafter from Expr.diff and cum_min:

    df.filter(pl.col('n').diff().eq(1).fill_null(True).cast(int).cum_min().cast(bool))
    

    For year-month you can do the same, building 12*year+month to be able to compute the diff:

    out = df.filter((pl.col('time').dt.year()*12+pl.col('time').dt.month()).diff()
                       .eq(1).fill_null(True).cast(int).cum_min().cast(bool)
                   )
    

    Output:

    shape: (5, 2)
    ┌─────────────────────┬─────┐
    │ time                ┆ n   │
    │ ---                 ┆ --- │
    │ datetime[μs]        ┆ i64 │
    ╞═════════════════════╪═════╡
    │ 2021-01-01 00:00:00 ┆ 1   │
    │ 2021-02-01 00:00:00 ┆ 2   │
    │ 2021-03-01 00:00:00 ┆ 3   │
    │ 2021-04-01 00:00:00 ┆ 4   │
    │ 2021-05-01 00:00:00 ┆ 5   │
    └─────────────────────┴─────┘