Search code examples
pythonpython-polars

polars equivalent to groupby.last


Say i have a polars dataframe:

import polars as pl 

df = pl.DataFrame({'index': [1,2,3,2,1],
                   'object': [1, 1, 1, 2, 2], 
                   'period': [1, 2, 4, 4, 23],
                   'value': [24, 67, 89, 5, 23],
})

How would i get dict of index -> to last value?

df.col('value').last().over('index').alias('last') is the last value but that requires a lot of extra computation and more work to get to the key value pairs.


Solution

  • The over function will keep all rows, which is probably not what you want. An easy way to get just the last value for index is to use unique.

    (
        df
        .select("index", "value")
        .unique(subset="index", keep="last")
    )
    
    shape: (3, 2)
    ┌───────┬───────┐
    │ index ┆ value │
    │ ---   ┆ ---   │
    │ i64   ┆ i64   │
    ╞═══════╪═══════╡
    │ 1     ┆ 23    │
    │ 2     ┆ 5     │
    │ 3     ┆ 89    │
    └───────┴───────┘
    

    From this point, you can use the to_dicts method to convert the DataFrame to a list of dictionaries.

    last_values = (
        df
        .select("index", "value")
        .unique(subset="index", keep="last")
        .to_dicts()
    )
    last_values
    
    [{'index': 1, 'value': 23}, {'index': 2, 'value': 5}, {'index': 3, 'value': 89}]
    

    If you are looking to later import this into a DataFrame, you'll want to stop at this point. For example:

    pl.DataFrame(last_values)
    
    >>> pl.DataFrame(last_values)
    shape: (3, 2)
    ┌───────┬───────┐
    │ index ┆ value │
    │ ---   ┆ ---   │
    │ i64   ┆ i64   │
    ╞═══════╪═══════╡
    │ 1     ┆ 23    │
    │ 2     ┆ 5     │
    │ 3     ┆ 89    │
    └───────┴───────┘
    

    However, if you want to collapse this into a single dictionary of index:value pairs, you can use a dictionary comprehension.

    
    {
        next_dict["index"]: next_dict["value"]
        for next_dict in last_values
    }
    
    {1: 23, 2: 5, 3: 89}
    

    Edit: Updating based on date

    Let's assume that we have this data:

    import polars as pl
    import datetime
    
    df = pl.DataFrame({
        "index": [1, 2, 3],
        "value": [10, 20, 30],
    }).join(
        pl.DataFrame({
            'date': pl.date_range(datetime.date(2021, 1, 1), datetime.date(2023, 1, 1), "1y", eager=True)
        }),
        how="cross"
    )
    df
    
    shape: (9, 3)
    ┌───────┬───────┬────────────┐
    │ index ┆ value ┆ date       │
    │ ---   ┆ ---   ┆ ---        │
    │ i64   ┆ i64   ┆ date       │
    ╞═══════╪═══════╪════════════╡
    │ 1     ┆ 10    ┆ 2021-01-01 │
    │ 1     ┆ 10    ┆ 2022-01-01 │
    │ 1     ┆ 10    ┆ 2023-01-01 │
    │ 2     ┆ 20    ┆ 2021-01-01 │
    │ 2     ┆ 20    ┆ 2022-01-01 │
    │ 2     ┆ 20    ┆ 2023-01-01 │
    │ 3     ┆ 30    ┆ 2021-01-01 │
    │ 3     ┆ 30    ┆ 2022-01-01 │
    │ 3     ┆ 30    ┆ 2023-01-01 │
    └───────┴───────┴────────────┘
    

    And we have these values that we want to update.

    update_df = pl.DataFrame({
        "index": [2, 3],
        "value": [200, 300],
    })
    update_df
    
    shape: (2, 2)
    ┌───────┬───────┐
    │ index ┆ value │
    │ ---   ┆ ---   │
    │ i64   ┆ i64   │
    ╞═══════╪═══════╡
    │ 2     ┆ 200   │
    │ 3     ┆ 300   │
    └───────┴───────┘
    

    Note: I've purposely left out index "1" (to show what will happen).

    If we want to update the value associated with each index, but only beyond a certain date, we can use a join_asof.

    Since this is an advanced method, we'll take it in steps.

    We'll add the current_date to the update_df as a literal. (The same value for all rows.)

    We also need to make sure both our DataFrames are sorted by the "as_of" column (date, not index). (update_df will already be sorted because its the same date on each row.)

    I'll also sort after the join_asof so that we can see what is happening more clearly. (You don't need to do this step.)

    current_date = datetime.date(2022, 1, 1)
    (
        df
        .sort("date")
        .rename({"value": "prev_value"})
        .join_asof(
            update_df.with_columns(pl.lit(current_date).alias("date")),
            on="date",
            by=["index"],
            strategy="backward"
        )
        .sort("index", "date")
    )
    
    shape: (9, 4)
    ┌───────┬────────────┬────────────┬───────┐
    │ index ┆ prev_value ┆ date       ┆ value │
    │ ---   ┆ ---        ┆ ---        ┆ ---   │
    │ i64   ┆ i64        ┆ date       ┆ i64   │
    ╞═══════╪════════════╪════════════╪═══════╡
    │ 1     ┆ 10         ┆ 2021-01-01 ┆ null  │
    │ 1     ┆ 10         ┆ 2022-01-01 ┆ null  │
    │ 1     ┆ 10         ┆ 2023-01-01 ┆ null  │
    │ 2     ┆ 20         ┆ 2021-01-01 ┆ null  │
    │ 2     ┆ 20         ┆ 2022-01-01 ┆ 200   │
    │ 2     ┆ 20         ┆ 2023-01-01 ┆ 200   │
    │ 3     ┆ 30         ┆ 2021-01-01 ┆ null  │
    │ 3     ┆ 30         ┆ 2022-01-01 ┆ 300   │
    │ 3     ┆ 30         ┆ 2023-01-01 ┆ 300   │
    └───────┴────────────┴────────────┴───────┘
    

    Notice that only those rows with a date >= 2022-01-01 have a non-null value for value. (I'll show how to do a > 2022-01-01 at the end.)

    Next we'll use fill_null to fill the null values in value with the prev_value column.

    current_date = datetime.date(2022, 1, 1)
    (
        df
        .sort("date")
        .rename({"value": "prev_value"})
        .join_asof(
            update_df.with_columns(pl.lit(current_date).alias("date")),
            on="date",
            by=["index"],
            strategy="backward"
        )
        .sort("index", "date")
        .with_columns(pl.col("value").fill_null(pl.col("prev_value")))
    )
    
    shape: (9, 4)
    ┌───────┬────────────┬────────────┬───────┐
    │ index ┆ prev_value ┆ date       ┆ value │
    │ ---   ┆ ---        ┆ ---        ┆ ---   │
    │ i64   ┆ i64        ┆ date       ┆ i64   │
    ╞═══════╪════════════╪════════════╪═══════╡
    │ 1     ┆ 10         ┆ 2021-01-01 ┆ 10    │
    │ 1     ┆ 10         ┆ 2022-01-01 ┆ 10    │
    │ 1     ┆ 10         ┆ 2023-01-01 ┆ 10    │
    │ 2     ┆ 20         ┆ 2021-01-01 ┆ 20    │
    │ 2     ┆ 20         ┆ 2022-01-01 ┆ 200   │
    │ 2     ┆ 20         ┆ 2023-01-01 ┆ 200   │
    │ 3     ┆ 30         ┆ 2021-01-01 ┆ 30    │
    │ 3     ┆ 30         ┆ 2022-01-01 ┆ 300   │
    │ 3     ┆ 30         ┆ 2023-01-01 ┆ 300   │
    └───────┴────────────┴────────────┴───────┘
    

    Now, to clean up, we can drop the prev_value column, and re-arrange the columns.

    current_date = datetime.date(2022, 1, 1)
    (
        df
        .sort("date")
        .rename({"value": "prev_value"})
        .join_asof(
            update_df.with_columns(pl.lit(current_date).alias("date")),
            on="date",
            by=["index"],
            strategy="backward"
        )
        .sort("index", "date")
        .with_columns(pl.col("value").fill_null(pl.col("prev_value")))
        .drop("prev_value")
        .select(
            pl.exclude("date"),
            pl.col("date")
        )
    )
    
    shape: (9, 3)
    ┌───────┬───────┬────────────┐
    │ index ┆ value ┆ date       │
    │ ---   ┆ ---   ┆ ---        │
    │ i64   ┆ i64   ┆ date       │
    ╞═══════╪═══════╪════════════╡
    │ 1     ┆ 10    ┆ 2021-01-01 │
    │ 1     ┆ 10    ┆ 2022-01-01 │
    │ 1     ┆ 10    ┆ 2023-01-01 │
    │ 2     ┆ 20    ┆ 2021-01-01 │
    │ 2     ┆ 200   ┆ 2022-01-01 │
    │ 2     ┆ 200   ┆ 2023-01-01 │
    │ 3     ┆ 30    ┆ 2021-01-01 │
    │ 3     ┆ 300   ┆ 2022-01-01 │
    │ 3     ┆ 300   ┆ 2023-01-01 │
    └───────┴───────┴────────────┘
    

    If you need to update only those rows that are strictly greater than current_date, you can simply add one day to your current_date. Polars makes this easy with the offset_by expression.

    (
        df
        .sort("date")
        .rename({"value": "prev_value"})
        .join_asof(
            update_df.with_columns(pl.lit(current_date).dt.offset_by("1d").alias("date")),
            on="date",
            by=["index"],
            strategy="backward"
        )
        .sort("index", "date")
        .with_columns(pl.col("value").fill_null(pl.col("prev_value")))
        .drop("prev_value")
        .select(
            pl.exclude("date"),
            pl.col("date")
        )
    )
    
    shape: (9, 3)
    ┌───────┬───────┬────────────┐
    │ index ┆ value ┆ date       │
    │ ---   ┆ ---   ┆ ---        │
    │ i64   ┆ i64   ┆ date       │
    ╞═══════╪═══════╪════════════╡
    │ 1     ┆ 10    ┆ 2021-01-01 │
    │ 1     ┆ 10    ┆ 2022-01-01 │
    │ 1     ┆ 10    ┆ 2023-01-01 │
    │ 2     ┆ 20    ┆ 2021-01-01 │
    │ 2     ┆ 20    ┆ 2022-01-01 │
    │ 2     ┆ 200   ┆ 2023-01-01 │
    │ 3     ┆ 30    ┆ 2021-01-01 │
    │ 3     ┆ 30    ┆ 2022-01-01 │
    │ 3     ┆ 300   ┆ 2023-01-01 │
    └───────┴───────┴────────────┘