How do I get the first, last occurrence (>0) of a particular column in a time series df,
df = pl.from_repr("""
┌────────────┬────────────┬────────────┐
│ date ┆ column_one ┆ column_two │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪════════════╪════════════╡
│ 2024-06-01 ┆ 0.0 ┆ 0 │
│ 2024-06-02 ┆ 0.0 ┆ 1 │
│ 2024-06-03 ┆ 1.0 ┆ 2 │
│ 2024-06-04 ┆ 1.2 ┆ 3 │
└────────────┴────────────┴────────────┘
""")
Desired result:
shape: (2, 3)
┌────────────┬──────────────────┬─────────────────┐
│ columns ┆ first_appearance ┆ last_appearance │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ date │
╞════════════╪══════════════════╪═════════════════╡
│ column_one ┆ 2024-06-03 ┆ 2024-06-04 │
│ column_two ┆ 2024-06-02 ┆ 2024-06-04 │
└────────────┴──────────────────┴─────────────────┘
The desired result can be obtained by first using pl.DataFrame.unpivot
to obtain a long frame with columns date
, columns
, and value
. Then, grouping by columns
gives first and last appearance > 0 within each time-series.
(
df
.unpivot(index="date", variable_name="columns")
.group_by("columns")
.agg(
pl.col("date").filter(pl.col("value") > 0).min().alias("first_appearance"),
pl.col("date").filter(pl.col("value") > 0).max().alias("last_appearance"),
)
)
shape: (2, 3)
┌────────────┬──────────────────┬─────────────────┐
│ columns ┆ first_appearance ┆ last_appearance │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ date │
╞════════════╪══════════════════╪═════════════════╡
│ column_one ┆ 2024-06-03 ┆ 2024-06-04 │
│ column_two ┆ 2024-06-02 ┆ 2024-06-04 │
└────────────┴──────────────────┴─────────────────┘
Note. If the dataframe is sorted by date
, then pl.Expr.first
/ pl.Expr.last
can be used instead of pl.Expr.min
/ pl.Expr.max
. Moreover, the two filter expressions within the aggrecation could be replaced by a single pl.DataFrame.filter
before the aggregation.