Search code examples
pythonpython-polars

How to ignore EMPTY/NULL value columns while grouping in python polars?


I have a dataframe.

df = pl.DataFrame({
    'last_name':['James','Warner','Marino','James','Warner','Marino','James'],
    'first_name':['Horn','Bro','Kach','Horn','Bro','Kach','Horn'],
    'dob':['03/06/1990','09/16/1990','03/06/1990','','03/06/1990','','']
})

I'm applying a grouping on last,first and dob columns to get the counts as

df.group_by('last_name','first_name','dob').len()
shape: (6, 4)
┌───────────┬────────────┬────────────┬─────┐
│ last_name ┆ first_name ┆ dob        ┆ len │
│ ---       ┆ ---        ┆ ---        ┆ --- │
│ str       ┆ str        ┆ str        ┆ u32 │
╞═══════════╪════════════╪════════════╪═════╡
│ Warner    ┆ Bro        ┆ 09/16/1990 ┆ 1   │
│ Marino    ┆ Kach       ┆ 03/06/1990 ┆ 1   │
│ Marino    ┆ Kach       ┆            ┆ 1   │
│ James     ┆ Horn       ┆            ┆ 2   │ # NOT OK
│ James     ┆ Horn       ┆ 03/06/1990 ┆ 1   │
│ Warner    ┆ Bro        ┆ 03/06/1990 ┆ 1   │
└───────────┴────────────┴────────────┴─────┘

Here i would like to ignore the NULL/EMPTY values on grouping columns.

As James Horn has two empty DOB's these should not be included to grouping operation.

Here is the expected output.

shape: (4, 4)
┌───────────┬────────────┬────────────┬─────┐
│ last_name ┆ first_name ┆ dob        ┆ len │
│ ---       ┆ ---        ┆ ---        ┆ --- │
│ str       ┆ str        ┆ str        ┆ u32 │
╞═══════════╪════════════╪════════════╪═════╡
│ Warner    ┆ Bro        ┆ 03/06/1990 ┆ 1   │
│ James     ┆ Horn       ┆ 03/06/1990 ┆ 1   │
│ Marino    ┆ Kach       ┆ 03/06/1990 ┆ 1   │
│ Warner    ┆ Bro        ┆ 09/16/1990 ┆ 1   │
└───────────┴────────────┴────────────┴─────┘

Of course we can do filter on the column as below before pass to grouping as

df.filter(pl.col.dob != '').group_by('last_name','first_name','dob').len()

But what if I have 10 columns to be specified in filter operation? I need to write them one after another.

Is there any other solution for it?


Solution

  • First replace empty strings with null values and then use drop_nulls

    (
        df
        .with_columns(
            pl.when(pl.col(group_columns).str.len_chars() == 0)
            .then(None)
            .otherwise(pl.col(group_columns))
            .name.keep()
        )
        .drop_nulls(group_columns)
        .group_by(group_columns)
        .len()
    )
    
    shape: (4, 4)
    ┌───────────┬────────────┬────────────┬───────┐
    │ last_name ┆ first_name ┆ dob        ┆ len   │
    │ ---       ┆ ---        ┆ ---        ┆ ---   │
    │ str       ┆ str        ┆ str        ┆ u32   │
    ╞═══════════╪════════════╪════════════╪═══════╡
    │ Warner    ┆ Bro        ┆ 09/16/1990 ┆ 1     │
    │ Marino    ┆ Kach       ┆ 03/06/1990 ┆ 1     │
    │ Warner    ┆ Bro        ┆ 03/06/1990 ┆ 1     │
    │ James     ┆ Horn       ┆ 03/06/1990 ┆ 1     │
    └───────────┴────────────┴────────────┴───────┘