I have a Polars DataFrame containing a monthly dates and a values column. I would like to select only the first rows where the months are consecutive, dropping the rest of the rows. For example, in the table that would be the first 5 rows. I tried to use .group_by_dynamic without success. Any ideas how to do this? Thank you!
from datetime import datetime
df = pl.DataFrame(
{
"time": pl.datetime_range(
start=datetime(2020, 12, 1),
end=datetime(2023, 12, 1),
interval="1mo",
eager=True,
),
"n": range(37),
}
)
df=df.sample(n=10, seed=0)
df=df.sort(["time"], descending=False)
df
time:datetime[μs] | n |
---|---|
2021-01-01 00:00:00 | 1 |
2021-02-01 00:00:00 | 2 |
2021-03-01 00:00:00 | 3 |
2021-04-01 00:00:00 | 4 |
2021-05-01 00:00:00 | 5 |
2022-03-01 00:00:00 | 15 |
2022-07-01 00:00:00 | 19 |
2023-04-01 00:00:00 | 28 |
2023-05-01 00:00:00 | 29 |
2023-06-01 00:00:00 | 30 |
2023-11-01 00:00:00 | 35 |
One possibility would be to filter
with a boolean expression crafter from Expr.diff
and cum_min
:
df.filter(pl.col('n').diff().eq(1).fill_null(True).cast(int).cum_min().cast(bool))
For year-month you can do the same, building 12*year+month
to be able to compute the diff
:
out = df.filter((pl.col('time').dt.year()*12+pl.col('time').dt.month()).diff()
.eq(1).fill_null(True).cast(int).cum_min().cast(bool)
)
Output:
shape: (5, 2)
┌─────────────────────┬─────┐
│ time ┆ n │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═════╡
│ 2021-01-01 00:00:00 ┆ 1 │
│ 2021-02-01 00:00:00 ┆ 2 │
│ 2021-03-01 00:00:00 ┆ 3 │
│ 2021-04-01 00:00:00 ┆ 4 │
│ 2021-05-01 00:00:00 ┆ 5 │
└─────────────────────┴─────┘