Search code examples
datepython-polarsrle

How do I identify consecutive/contiguous dates in polars


I'm trying to isolate runs of dates using polars. For this, I've been playing with .rolling, .rle, and .rle_id but can't seem to fit them together to make it work.

Given this:

df = pl.DataFrame({
    "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-05", "2024-01-06", "2024-01-07"]
})
df = df.with_columns(pl.col("date").str.to_date())

I would like to have a 2nd column with the id of the run if they're contiguous:

[1, 1, 1, 2, 2, 2]

Solution

  • You could take the diff of the successive dates, identify those that are not 1 day and compute a cum_sum:

    (pl.DataFrame({'date': ['2024-01-01', '2024-01-02', '2024-01-03',
                            '2024-01-05', '2024-01-06', '2024-01-07']})
      .with_columns(pl.col('date').str.to_date())
      .with_columns(id=pl.col('date').diff().ne(pl.duration(days=1))
                         .fill_null(True).cum_sum())
    )
    

    Alternatively, generate a range of the step (with pl.duration), subtract it to the date and use rle_id:

    (pl.DataFrame({'date': ['2024-01-01', '2024-01-02', '2024-01-03',
                            '2024-01-05', '2024-01-06', '2024-01-07']})
      .with_columns(pl.col('date').str.to_date())
      .with_columns(id=(pl.col('date')
                       -pl.duration(days=pl.arange(0, pl.col('date').len()))
                       ).rle_id()+1)
    )
    

    Output:

    ┌────────────┬─────┐
    │ date       ┆ id  │
    │ ---        ┆ --- │
    │ date       ┆ u32 │
    ╞════════════╪═════╡
    │ 2024-01-01 ┆ 1   │
    │ 2024-01-02 ┆ 1   │
    │ 2024-01-03 ┆ 1   │
    │ 2024-01-05 ┆ 2   │
    │ 2024-01-06 ┆ 2   │
    │ 2024-01-07 ┆ 2   │
    └────────────┴─────┘