I am using polars with python and have this sample dataset:
df = pl.from_repr("""
┌─────────────────────┬────────────┬────────┐
│ Timestamps ┆ Date ┆ Values │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ date ┆ i64 │
╞═════════════════════╪════════════╪════════╡
│ 2022-10-01 04:00:00 ┆ 2022-10-01 ┆ 1 │
│ 2022-10-01 06:00:00 ┆ 2022-10-01 ┆ 2 │
│ 2022-10-01 08:30:00 ┆ 2022-10-01 ┆ 2 │
│ 2022-12-02 09:00:00 ┆ 2022-12-02 ┆ 5 │
│ 2022-12-02 09:15:00 ┆ 2022-12-02 ┆ 4 │
│ 2022-12-02 09:30:00 ┆ 2022-12-02 ┆ 3 │
│ 2022-12-02 09:45:00 ┆ 2022-12-02 ┆ 2 │
│ 2022-12-02 10:00:00 ┆ 2022-12-02 ┆ 1 │
│ 2022-12-02 10:15:00 ┆ 2022-12-02 ┆ 3 │
└─────────────────────┴────────────┴────────┘
""")
I would like to get the max and min value for every single Date together with the corresponding timestamp. The result for f.e the max value should look like this:
┌────────────┬────────┬─────────────────────┐
│ Date ┆ Values ┆ Timestamps │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ datetime[μs] │
╞════════════╪════════╪═════════════════════╡
│ 2022-12-02 ┆ 5 ┆ 2022-12-02 09:00:00 │
│ 2022-10-01 ┆ 2 ┆ 2022-10-01 06:00:00 │
└────────────┴────────┴─────────────────────┘
Currently I've got this line but I can't figure out how to extract a value (timestamp) out of a column next to the aggregated value:
df.group_by('Date').agg(pl.max('Values', 'Timestamps')).sort("Date", descending=True)
Let's say we start with...
dataset = pl.DataFrame(
{
"Timestamps": ["2022-10-01 06:00:00", "2022-10-01 08:30:00", "2022-12-02 09:00:00", "2022-12-02 09:15:00", "2022-12-02 09:30:00", "2022-12-02 09:45:00", "2022-12-02 10:00:00", "2022-12-02 10:15:00"],
"Date": ["2022-10-01", "2022-10-01", "2022-12-02", "2022-12-02", "2022-12-02", "2022-12-02", "2022-12-02", "2022-12-02"],
"Values": [2, 2, 1, 4, 3, 2, 1, 3],
}
)
qa = dataset.lazy().with_columns(pl.col("Timestamps").str.to_datetime(), pl.col("Date").str.to_date())
df1 = qa.collect()
shape: (8, 3)
┌─────────────────────┬────────────┬────────┐
│ Timestamps ┆ Date ┆ Values │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ date ┆ i64 │
╞═════════════════════╪════════════╪════════╡
│ 2022-10-01 06:00:00 ┆ 2022-10-01 ┆ 2 │
│ 2022-10-01 08:30:00 ┆ 2022-10-01 ┆ 2 │
│ 2022-12-02 09:00:00 ┆ 2022-12-02 ┆ 1 │
│ 2022-12-02 09:15:00 ┆ 2022-12-02 ┆ 4 │
│ 2022-12-02 09:30:00 ┆ 2022-12-02 ┆ 3 │
│ 2022-12-02 09:45:00 ┆ 2022-12-02 ┆ 2 │
│ 2022-12-02 10:00:00 ┆ 2022-12-02 ┆ 1 │
│ 2022-12-02 10:15:00 ┆ 2022-12-02 ┆ 3 │
└─────────────────────┴────────────┴────────┘
In this case we see that on 10-01 there's a tie for Values and on 12-02 the highest Values corresponds to 9:15.
You could try two approaches that start the same...
(
df1
.group_by('Date')
.agg(pl.col('Values').max())
.join(df1, on=['Date','Values'])
.sort('Timestamps')
.unique(subset=['Date','Values'])
.sort("Date", descending=True)
)
or
(
df1
.group_by('Date')
.agg(pl.col('Values').max())
.join(df1, on=['Date','Values'])
.group_by(['Date','Values'])
.agg(pl.col('Timestamps').min())
.sort("Date", descending=True)
)
I'm not which is more efficient between the sort/unique or the second groupby/agg but in either case you'd get:
shape: (2, 3)
┌────────────┬────────┬─────────────────────┐
│ Date ┆ Values ┆ Timestamps │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ datetime[μs] │
╞════════════╪════════╪═════════════════════╡
│ 2022-10-01 ┆ 2 ┆ 2022-10-01 06:00:00 │
│ 2022-12-02 ┆ 4 ┆ 2022-12-02 09:15:00 │
└────────────┴────────┴─────────────────────┘