The following data frame represents basic flatten tree structure, as shown below, where pairs (id, sub-id)
and (sub-id, key)
are always unique and key
always represents the same thing under the same id
id1
└─┬─ sub-id
│ │ └─── key1
│ │ └─── value
│ └─ sub-id2
│ └─── key1
│ └─── None
id2
└─── sub-id3
└─── key2
└─── value
with graphical representation out of the way, below is the definition as polars.DataFrame
df = pl.DataFrame(
{
"id": [1, 1, 2, 2, 2, 3],
"sub-id": [1, 1, 2, 3, 3, 4],
"key": ["key_1_1", "key_1_2", "key_2_1", "key_2_1", "key_2_2", "key_3"],
"value": ["value1 1", "value 1 2", None, "value 2 1", "value 2 2", "value 3"],
}
)
The same data frame in table representation:
shape: (6, 4)
┌────┬────────┬─────────┬───────────┐
│ ID │ sub-id │ key │ value │
╞════╪════════╪═════════╪═══════════╡
│ 1 │ 1 │ key_1_1 │ value 1 │
│ 1 │ 1 │ key_1_2 │ value 2 │
│ 2 │ 2 │ key_2_1 │ value 2 1 │
│ 2 │ 3 │ key_2_1 │ None │
│ 2 │ 3 │ key_2_2 │ value 2 2 │
│ 3 │ 4 │ key_3 │ value 3 │
└────┴────────┴─────────┴───────────┘
How would I to fill gaps like shown below using polars
. total size of data is about 100k rows.
shape: (6, 4)
┌────┬────────┬─────────┬───────────┐
│ ID │ sub-id │ key │ value │
╞════╪════════╪═════════╪═══════════╡
│ 1 │ 1 │ key_1_1 │ value 1 │
│ 1 │ 1 │ key_1_2 │ value 2 │
│ 2 │ 2 │ key_2_1 │ value 2 1 │
│ 2 │ 3 │ key_2_1 │ value 2 1 │
│ 2 │ 3 │ key_2_2 │ value 2 2 │
│ 3 │ 4 │ key_3 │ value 3 │
└────┴────────┴─────────┴───────────┘
There is pl.Expr.fill_null
to fill missing values.
As fill value, we use the first non-null value with the same id
and key
. As we assume that all values for the same id
and key
are the same, taking the first value is reasonable. It can be constructed as follows:
pl.Expr.filter
and pl.Expr.is_not_null
to filter for non-null values,pl.Expr.first
to select the first such value,pl.Expr.over
to evaluate the expression separately for each id
and key
pair.df.with_columns(
pl.col("value").fill_null(
pl.col("value").filter(pl.col("value").is_not_null()).first().over("id", "key")
)
)
shape: (6, 4)
┌─────┬────────┬─────────┬───────────┐
│ id ┆ sub-id ┆ key ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str │
╞═════╪════════╪═════════╪═══════════╡
│ 1 ┆ 1 ┆ key_1_1 ┆ value 1 │
│ 1 ┆ 1 ┆ key_1_2 ┆ value 2 │
│ 2 ┆ 2 ┆ key_2_1 ┆ value 2 1 │
│ 2 ┆ 3 ┆ key_2_1 ┆ value 2 1 │
│ 2 ┆ 3 ┆ key_2_2 ┆ value 2 2 │
│ 3 ┆ 4 ┆ key_3 ┆ value 3 │
└─────┴────────┴─────────┴───────────┘
Note. I needed to adapt your code slightly to match the example you described in the text.
df = pl.DataFrame({
"id": [1, 1, 2, 2, 2, 3],
"sub-id": [1, 1, 2, 3, 3, 4],
"key": ["key_1_1", "key_1_2", "key_2_1", "key_2_1", "key_2_2", "key_3"],
"value": ["value 1", "value 2", None, "value 2 1", "value 2 2", "value 3"],
})