Search code examples
pythonwindow-functionslarge-datapython-polarsunique-values

Fill nulls in Python Polars lazyframe by groups, conditional on the number of unique values in each group


I have a large (~300M rows x 44 cols) dataframe and I need to fill in null values in certain ways depending on the characteristics of each group.

For example, say we have

lf = pl.LazyFrame(
    {'group':(1,1,1,2,2,2,3,3,3),
     'val':('yes', None, 'no', '2', '2', '2', 'answer', None, 'answer')
     }
)
┌───────┬────────┐
│ group ┆ val    │
│ ---   ┆ ---    │
│ i64   ┆ str    │
╞═══════╪════════╡
│ 1     ┆ yes    │
│ 1     ┆ null   │
│ 1     ┆ no     │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 2     ┆ 2      │
│ 3     ┆ answer │
│ 3     ┆ null   │
│ 3     ┆ answer │
└───────┴────────┘

I want to fill in nulls if and only if the group contains a single non-null unique value in the other cells, since in my context that's the expectation of the data and the presense of more than one unique value (or all nulls) in the group signals another issue that will be handled differently.

I'm able to fill null values for each group with the following:

filled_lf = (
    lf
    .with_columns(
        pl.col('val')
        .fill_null(pl.col('val').unique().first().over('group')).alias('filled_val')
        )
    )

However, for one, it seems that pl.col('val').unique() includes 'null' as one of the values, and the ordering is stochastic so choosing the first value on the list has inconsistent results. Secondly, it doesn't include the condition I need.

Desired result:

┌───────┬────────┬────────────┐
│ group ┆ val    ┆ filled_val │
│ ---   ┆ ---    ┆ ---        │
│ i64   ┆ str    ┆ str        │
╞═══════╪════════╪════════════╡
│ 1     ┆ yes    ┆ yes        │
│ 1     ┆ null   ┆ null       │
│ 1     ┆ no     ┆ no         │
│ 2     ┆ 2      ┆ 2          │
│ 2     ┆ 2      ┆ 2          │
│ 2     ┆ 2      ┆ 2          │
│ 3     ┆ answer ┆ answer     │
│ 3     ┆ null   ┆ answer     │
│ 3     ┆ answer ┆ answer     │
└───────┴────────┴────────────┘

Pandas 3.12 Polars 0.20.1

Thanks in advance for your advice!


Solution

  • You can add:

    unique = pl.col("val").drop_nulls().unique(maintain_order=True)
    
    df.with_columns(
       pl.when(unique.len().over("group") == 1)
         .then(pl.col("val").fill_null(unique.first().over("group")))
         .otherwise(pl.col("val"))
         .alias("filled")
    )
    
    shape: (9, 3)
    ┌───────┬────────┬────────┐
    │ group ┆ val    ┆ filled │
    │ ---   ┆ ---    ┆ ---    │
    │ i64   ┆ str    ┆ str    │
    ╞═══════╪════════╪════════╡
    │ 1     ┆ yes    ┆ yes    │
    │ 1     ┆ null   ┆ null   │
    │ 1     ┆ no     ┆ no     │
    │ 2     ┆ 2      ┆ 2      │
    │ 2     ┆ 2      ┆ 2      │
    │ 2     ┆ 2      ┆ 2      │
    │ 3     ┆ answer ┆ answer │
    │ 3     ┆ null   ┆ answer │
    │ 3     ┆ answer ┆ answer │
    └───────┴────────┴────────┘