Search code examples
pythonpython-polars

How can I rotate/shift/increment one particular column's values in Polars DataFrame?


I have a polars dataframe as follows:

df = pl.from_repr("""
┌─────┬───────┐
│ day ┆ value │
│ --- ┆ ---   │
│ i64 ┆ i64   │
╞═════╪═══════╡
│ 1   ┆ 1     │
│ 1   ┆ 2     │
│ 1   ┆ 2     │
│ 3   ┆ 3     │
│ 3   ┆ 5     │
│ 3   ┆ 2     │
│ 5   ┆ 1     │
│ 5   ┆ 2     │
│ 8   ┆ 7     │
│ 8   ┆ 3     │
│ 9   ┆ 5     │
│ 9   ┆ 3     │
│ 9   ┆ 4     │
└─────┴───────┘
""")

I want to incrementally rotate the values in column 'day'? By incremental rotation, I mean for each value, change it to its next larger value exists in the column, and if the value is the largest, then change it to null/None.

Basically, the result I expect should be the following:

┌──────┬───────┐
│ day  ┆ value │
│ ---  ┆ ---   │
│ i64  ┆ i64   │
╞══════╪═══════╡
│ 3    ┆ 1     │
│ 3    ┆ 2     │
│ 3    ┆ 2     │
│ 5    ┆ 3     │
│ 5    ┆ 5     │
│ 5    ┆ 2     │
│ 8    ┆ 1     │
│ 8    ┆ 2     │
│ 9    ┆ 7     │
│ 9    ┆ 3     │
│ null ┆ 5     │
│ null ┆ 3     │
│ null ┆ 4     │
└──────┴───────┘

Is there some particular polars-python idiomatic way to achieve this?


Solution

  • If day is sorted - you could group together - shift - then explode back?

    (df.groupby("day", maintain_order=True)
       .agg_list()  
       .with_columns(pl.col("day").shift(-1))
       .explode(pl.exclude("day")))
    
    shape: (13, 2)
    ┌──────┬───────┐
    │ day  | value │
    │ ---  | ---   │
    │ i64  | i64   │
    ╞══════╪═══════╡
    │ 3    | 1     │
    │ 3    | 2     │
    │ 3    | 2     │
    │ 5    | 3     │
    │ 5    | 5     │
    │ 5    | 2     │
    │ 8    | 1     │
    │ 8    | 2     │
    │ 9    | 7     │
    │ 9    | 3     │
    │ null | 5     │
    │ null | 3     │
    │ null | 4     │
    └──────┴───────┘
    

    Perhaps another approach is to .rank() the column.

    .search_sorted() for rank + 1 could find the positions of the next "group".

    The max values could be nulled out then passed to .take() to get the new values.

    (df.with_columns(
       pl.col("day").rank("dense")
         .cast(pl.Int64)
         .alias("rank"))
    .with_columns(   
       pl.col("rank")
         .search_sorted(pl.col("rank") + 1)
         .alias("idx"))
    .with_columns(
       pl.when(pl.col("idx") != pl.col("idx").max())
         .then(pl.col("idx"))
         .alias("idx"))
    .with_columns(
       pl.col("day").take(pl.col("idx"))
         .alias("new"))
    )
    
    shape: (13, 5)
    ┌─────┬───────┬──────┬──────┬──────┐
    │ day | value | rank | idx  | new  │
    │ --- | ---   | ---  | ---  | ---  │
    │ i64 | i64   | i64  | u32  | i64  │
    ╞═════╪═══════╪══════╪══════╪══════╡
    │ 1   | 1     | 1    | 3    | 3    │
    │ 1   | 2     | 1    | 3    | 3    │
    │ 1   | 2     | 1    | 3    | 3    │
    │ 3   | 3     | 2    | 6    | 5    │
    │ 3   | 5     | 2    | 6    | 5    │
    │ 3   | 2     | 2    | 6    | 5    │
    │ 5   | 1     | 3    | 8    | 8    │
    │ 5   | 2     | 3    | 8    | 8    │
    │ 8   | 7     | 4    | 10   | 9    │
    │ 8   | 3     | 4    | 10   | 9    │
    │ 9   | 5     | 5    | null | null │
    │ 9   | 3     | 5    | null | null │
    │ 9   | 4     | 5    | null | null │
    └─────┴───────┴──────┴──────┴──────┘
    

    Feels like I'm missing an obvious simpler approach here..