Search code examples
pythondataframepython-polars

Polars: set missing value from another row


The following data frame represents basic flatten tree structure, as shown below, where pairs (id, sub-id) and (sub-id, key) are always unique and key always represents the same thing under the same id

id1
└─┬─ sub-id 
│ │   └─── key1
│ │        └─── value
│ └─ sub-id2 
│     └─── key1
│          └─── None
id2
└─── sub-id3 
      └─── key2
           └─── value

with graphical representation out of the way, below is the definition as polars.DataFrame

df = pl.DataFrame(
    {
        "id": [1, 1, 2, 2, 2, 3],
        "sub-id": [1, 1, 2, 3, 3, 4],
        "key": ["key_1_1", "key_1_2", "key_2_1", "key_2_1", "key_2_2", "key_3"],
        "value": ["value1 1", "value 1 2", None, "value 2 1", "value 2 2", "value 3"],
    }
)

The same data frame in table representation:

shape: (6, 4)

┌────┬────────┬─────────┬───────────┐
│ ID │ sub-id │ key     │ value     │
╞════╪════════╪═════════╪═══════════╡
│ 1  │ 1      │ key_1_1 │ value 1   │
│ 1  │ 1      │ key_1_2 │ value 2   │
│ 2  │ 2      │ key_2_1 │ value 2 1 │
│ 2  │ 3      │ key_2_1 │ None      │
│ 2  │ 3      │ key_2_2 │ value 2 2 │
│ 3  │ 4      │ key_3   │ value 3   │
└────┴────────┴─────────┴───────────┘

How would I to fill gaps like shown below using polars. total size of data is about 100k rows.

shape: (6, 4)

┌────┬────────┬─────────┬───────────┐
│ ID │ sub-id │ key     │ value     │
╞════╪════════╪═════════╪═══════════╡
│ 1  │ 1      │ key_1_1 │ value 1   │
│ 1  │ 1      │ key_1_2 │ value 2   │
│ 2  │ 2      │ key_2_1 │ value 2 1 │
│ 2  │ 3      │ key_2_1 │ value 2 1 │
│ 2  │ 3      │ key_2_2 │ value 2 2 │
│ 3  │ 4      │ key_3   │ value 3   │
└────┴────────┴─────────┴───────────┘

Solution

  • There is pl.Expr.fill_null to fill missing values.

    As fill value, we use the first non-null value with the same id and key. As we assume that all values for the same id and key are the same, taking the first value is reasonable. It can be constructed as follows:

    df.with_columns(
        pl.col("value").fill_null(
            pl.col("value").filter(pl.col("value").is_not_null()).first().over("id", "key")
        )
    )
    
    shape: (6, 4)
    ┌─────┬────────┬─────────┬───────────┐
    │ id  ┆ sub-id ┆ key     ┆ value     │
    │ --- ┆ ---    ┆ ---     ┆ ---       │
    │ i64 ┆ i64    ┆ str     ┆ str       │
    ╞═════╪════════╪═════════╪═══════════╡
    │ 1   ┆ 1      ┆ key_1_1 ┆ value 1   │
    │ 1   ┆ 1      ┆ key_1_2 ┆ value 2   │
    │ 2   ┆ 2      ┆ key_2_1 ┆ value 2 1 │
    │ 2   ┆ 3      ┆ key_2_1 ┆ value 2 1 │
    │ 2   ┆ 3      ┆ key_2_2 ┆ value 2 2 │
    │ 3   ┆ 4      ┆ key_3   ┆ value 3   │
    └─────┴────────┴─────────┴───────────┘
    

    Note. I needed to adapt your code slightly to match the example you described in the text.

    df = pl.DataFrame({
        "id": [1, 1, 2, 2, 2, 3],
        "sub-id": [1, 1, 2, 3, 3, 4],
        "key": ["key_1_1", "key_1_2", "key_2_1", "key_2_1", "key_2_2", "key_3"],
        "value": ["value 1", "value 2", None, "value 2 1", "value 2 2", "value 3"],
    })