Search code examples
pythonpython-polars

How do I get corresponding values after group_by and agg


I am using polars with python and have this sample dataset:

df = pl.from_repr("""
┌─────────────────────┬────────────┬────────┐
│ Timestamps          ┆ Date       ┆ Values │
│ ---                 ┆ ---        ┆ ---    │
│ datetime[μs]        ┆ date       ┆ i64    │
╞═════════════════════╪════════════╪════════╡
│ 2022-10-01 04:00:00 ┆ 2022-10-01 ┆ 1      │
│ 2022-10-01 06:00:00 ┆ 2022-10-01 ┆ 2      │
│ 2022-10-01 08:30:00 ┆ 2022-10-01 ┆ 2      │
│ 2022-12-02 09:00:00 ┆ 2022-12-02 ┆ 5      │
│ 2022-12-02 09:15:00 ┆ 2022-12-02 ┆ 4      │
│ 2022-12-02 09:30:00 ┆ 2022-12-02 ┆ 3      │
│ 2022-12-02 09:45:00 ┆ 2022-12-02 ┆ 2      │
│ 2022-12-02 10:00:00 ┆ 2022-12-02 ┆ 1      │
│ 2022-12-02 10:15:00 ┆ 2022-12-02 ┆ 3      │
└─────────────────────┴────────────┴────────┘
""")

I would like to get the max and min value for every single Date together with the corresponding timestamp. The result for f.e the max value should look like this:

┌────────────┬────────┬─────────────────────┐
│ Date       ┆ Values ┆ Timestamps          │
│ ---        ┆ ---    ┆ ---                 │
│ date       ┆ i64    ┆ datetime[μs]        │
╞════════════╪════════╪═════════════════════╡
│ 2022-12-02 ┆ 5      ┆ 2022-12-02 09:00:00 │
│ 2022-10-01 ┆ 2      ┆ 2022-10-01 06:00:00 │
└────────────┴────────┴─────────────────────┘

Currently I've got this line but I can't figure out how to extract a value (timestamp) out of a column next to the aggregated value:

df.group_by('Date').agg(pl.max('Values', 'Timestamps')).sort("Date", descending=True)

Solution

  • Let's say we start with...

    dataset = pl.DataFrame(
        {
            "Timestamps": ["2022-10-01 06:00:00", "2022-10-01 08:30:00", "2022-12-02 09:00:00", "2022-12-02 09:15:00", "2022-12-02 09:30:00", "2022-12-02 09:45:00", "2022-12-02 10:00:00", "2022-12-02 10:15:00"],
            "Date": ["2022-10-01", "2022-10-01", "2022-12-02", "2022-12-02", "2022-12-02", "2022-12-02", "2022-12-02", "2022-12-02"],
            "Values": [2, 2, 1, 4, 3, 2, 1, 3],
        }
    )
    qa = dataset.lazy().with_columns(pl.col("Timestamps").str.to_datetime(), pl.col("Date").str.to_date())
    df1 = qa.collect()
    
    
    shape: (8, 3)
    ┌─────────────────────┬────────────┬────────┐
    │ Timestamps          ┆ Date       ┆ Values │
    │ ---                 ┆ ---        ┆ ---    │
    │ datetime[μs]        ┆ date       ┆ i64    │
    ╞═════════════════════╪════════════╪════════╡
    │ 2022-10-01 06:00:00 ┆ 2022-10-01 ┆ 2      │
    │ 2022-10-01 08:30:00 ┆ 2022-10-01 ┆ 2      │
    │ 2022-12-02 09:00:00 ┆ 2022-12-02 ┆ 1      │
    │ 2022-12-02 09:15:00 ┆ 2022-12-02 ┆ 4      │
    │ 2022-12-02 09:30:00 ┆ 2022-12-02 ┆ 3      │
    │ 2022-12-02 09:45:00 ┆ 2022-12-02 ┆ 2      │
    │ 2022-12-02 10:00:00 ┆ 2022-12-02 ┆ 1      │
    │ 2022-12-02 10:15:00 ┆ 2022-12-02 ┆ 3      │
    └─────────────────────┴────────────┴────────┘
    

    In this case we see that on 10-01 there's a tie for Values and on 12-02 the highest Values corresponds to 9:15.

    You could try two approaches that start the same...

    (
        df1
            .group_by('Date')
            .agg(pl.col('Values').max())
            .join(df1, on=['Date','Values'])
            .sort('Timestamps')
            .unique(subset=['Date','Values'])
            .sort("Date", descending=True)
    )
    

    or

    (
        df1
            .group_by('Date')
            .agg(pl.col('Values').max())
            .join(df1, on=['Date','Values'])
            .group_by(['Date','Values'])
            .agg(pl.col('Timestamps').min())
            .sort("Date", descending=True)
    
    )
    

    I'm not which is more efficient between the sort/unique or the second groupby/agg but in either case you'd get:

    shape: (2, 3)
    ┌────────────┬────────┬─────────────────────┐
    │ Date       ┆ Values ┆ Timestamps          │
    │ ---        ┆ ---    ┆ ---                 │
    │ date       ┆ i64    ┆ datetime[μs]        │
    ╞════════════╪════════╪═════════════════════╡
    │ 2022-10-01 ┆ 2      ┆ 2022-10-01 06:00:00 │
    │ 2022-12-02 ┆ 4      ┆ 2022-12-02 09:15:00 │
    └────────────┴────────┴─────────────────────┘