Polars: set missing value from another row

The following data frame represents basic flatten tree structure, as shown below, where pairs (id, sub-id) and (sub-id, key) are always unique and key always represents the same thing under the same id

id1
└─┬─ sub-id 
│ │   └─── key1
│ │        └─── value
│ └─ sub-id2 
│     └─── key1
│          └─── None
id2
└─── sub-id3 
      └─── key2
           └─── value

with graphical representation out of the way, below is the definition as polars.DataFrame

df = pl.DataFrame(
    {
        "id": [1, 1, 2, 2, 2, 3],
        "sub-id": [1, 1, 2, 3, 3, 4],
        "key": ["key_1_1", "key_1_2", "key_2_1", "key_2_1", "key_2_2", "key_3"],
        "value": ["value1 1", "value 1 2", None, "value 2 1", "value 2 2", "value 3"],
    }
)

The same data frame in table representation:

shape: (6, 4)

┌────┬────────┬─────────┬───────────┐
│ ID │ sub-id │ key     │ value     │
╞════╪════════╪═════════╪═══════════╡
│ 1  │ 1      │ key_1_1 │ value 1   │
│ 1  │ 1      │ key_1_2 │ value 2   │
│ 2  │ 2      │ key_2_1 │ value 2 1 │
│ 2  │ 3      │ key_2_1 │ None      │
│ 2  │ 3      │ key_2_2 │ value 2 2 │
│ 3  │ 4      │ key_3   │ value 3   │
└────┴────────┴─────────┴───────────┘

How would I to fill gaps like shown below using polars. total size of data is about 100k rows.

shape: (6, 4)

┌────┬────────┬─────────┬───────────┐
│ ID │ sub-id │ key     │ value     │
╞════╪════════╪═════════╪═══════════╡
│ 1  │ 1      │ key_1_1 │ value 1   │
│ 1  │ 1      │ key_1_2 │ value 2   │
│ 2  │ 2      │ key_2_1 │ value 2 1 │
│ 2  │ 3      │ key_2_1 │ value 2 1 │
│ 2  │ 3      │ key_2_2 │ value 2 2 │
│ 3  │ 4      │ key_3   │ value 3   │
└────┴────────┴─────────┴───────────┘

Solution

There is pl.Expr.fill_null to fill missing values.

As fill value, we use the first non-null value with the same id and key. As we assume that all values for the same id and key are the same, taking the first value is reasonable. It can be constructed as follows:

pl.Expr.filter and pl.Expr.is_not_null to filter for non-null values,
pl.Expr.first to select the first such value,
the window function pl.Expr.over to evaluate the expression separately for each id and key pair.

df.with_columns(
    pl.col("value").fill_null(
        pl.col("value").filter(pl.col("value").is_not_null()).first().over("id", "key")
    )
)

shape: (6, 4)
┌─────┬────────┬─────────┬───────────┐
│ id  ┆ sub-id ┆ key     ┆ value     │
│ --- ┆ ---    ┆ ---     ┆ ---       │
│ i64 ┆ i64    ┆ str     ┆ str       │
╞═════╪════════╪═════════╪═══════════╡
│ 1   ┆ 1      ┆ key_1_1 ┆ value 1   │
│ 1   ┆ 1      ┆ key_1_2 ┆ value 2   │
│ 2   ┆ 2      ┆ key_2_1 ┆ value 2 1 │
│ 2   ┆ 3      ┆ key_2_1 ┆ value 2 1 │
│ 2   ┆ 3      ┆ key_2_2 ┆ value 2 2 │
│ 3   ┆ 4      ┆ key_3   ┆ value 3   │
└─────┴────────┴─────────┴───────────┘

Note. I needed to adapt your code slightly to match the example you described in the text.

df = pl.DataFrame({
    "id": [1, 1, 2, 2, 2, 3],
    "sub-id": [1, 1, 2, 3, 3, 4],
    "key": ["key_1_1", "key_1_2", "key_2_1", "key_2_1", "key_2_2", "key_3"],
    "value": ["value 1", "value 2", None, "value 2 1", "value 2 2", "value 3"],
})