Search code examples
python-polars

Get the first, last occurrence in a time series df in polars


How do I get the first, last occurrence (>0) of a particular column in a time series df,

df = pl.from_repr("""
┌────────────┬────────────┬────────────┐
│ date       ┆ column_one ┆ column_two │
│ ---        ┆ ---        ┆ ---        │
│ date       ┆ f64        ┆ i64        │
╞════════════╪════════════╪════════════╡
│ 2024-06-01 ┆ 0.0        ┆ 0          │
│ 2024-06-02 ┆ 0.0        ┆ 1          │
│ 2024-06-03 ┆ 1.0        ┆ 2          │
│ 2024-06-04 ┆ 1.2        ┆ 3          │
└────────────┴────────────┴────────────┘
""")

Desired result:

shape: (2, 3)
┌────────────┬──────────────────┬─────────────────┐
│ columns    ┆ first_appearance ┆ last_appearance │
│ ---        ┆ ---              ┆ ---             │
│ str        ┆ date             ┆ date            │
╞════════════╪══════════════════╪═════════════════╡
│ column_one ┆ 2024-06-03       ┆ 2024-06-04      │
│ column_two ┆ 2024-06-02       ┆ 2024-06-04      │
└────────────┴──────────────────┴─────────────────┘

Solution

  • The desired result can be obtained by first using pl.DataFrame.unpivot to obtain a long frame with columns date, columns, and value. Then, grouping by columns gives first and last appearance > 0 within each time-series.

    (
        df
        .unpivot(index="date", variable_name="columns")
        .group_by("columns")
        .agg(
            pl.col("date").filter(pl.col("value") > 0).min().alias("first_appearance"),
            pl.col("date").filter(pl.col("value") > 0).max().alias("last_appearance"),
        )
    )
    
    shape: (2, 3)
    ┌────────────┬──────────────────┬─────────────────┐
    │ columns    ┆ first_appearance ┆ last_appearance │
    │ ---        ┆ ---              ┆ ---             │
    │ str        ┆ date             ┆ date            │
    ╞════════════╪══════════════════╪═════════════════╡
    │ column_one ┆ 2024-06-03       ┆ 2024-06-04      │
    │ column_two ┆ 2024-06-02       ┆ 2024-06-04      │
    └────────────┴──────────────────┴─────────────────┘
    

    Note. If the dataframe is sorted by date, then pl.Expr.first / pl.Expr.last can be used instead of pl.Expr.min / pl.Expr.max. Moreover, the two filter expressions within the aggrecation could be replaced by a single pl.DataFrame.filter before the aggregation.