Search code examples
pythonpython-polars

How to roll up duplicate observation in Python polars?


I have a data frame as-

df = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648']})

Here I would like to find out duplicates considering last_name and firs_name columns and if any duplicates found their respective ssn needs to be rolled up with semicolon(;) if SSN are not different. if SSN are also same only one SSN needs to be present.

enter image description here

the expected output as:

enter image description here

Here since mallesh yamulla is duplicated and has different SSN's they are rolled up with ';'

and in case of jagarini yegurla it has a unique SSN hence one SSN is only taken.

enter image description here

Added one more case:

Here on given any set of column it should rollup the unique values using ; from the remaining columns. here on last and first name, roll up should be done on both DOB and SSN.

df = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648'],
                  'dob':['10/11/1990','09/16/1991','01/01/1990','10/11/1990','02/14/1983']   })

enter image description here

Another case as:

df = pl.DataFrame({'last_name':['mallesh','bhavik','jagarini','mallesh','jagarini'],
                  'first_name':['yamulla','vemulla','yegurla','yamulla','yegurla'],
                  'ssn':['1234','7847','0648','4567','0648'],
                  'dob':['10/11/1990','09/16/1991','01/01/1990','','02/14/1983']   })

In case of having null values in a field it should treat as empty not as a value.

";10/11/1990" it should just be "10/11/1990" for mallesh yamulla entry.

enter image description here


Solution

  • Use a group_by and unique to remove duplicates. From there, you can use list.join on the resulting list.

    (
        df
        .group_by('last_name', 'first_name')
        .agg(
            pl.col('ssn').unique()
        )
        .with_columns(
            pl.col('ssn').list.join(';')
        )
    )
    
    shape: (3, 3)
    ┌───────────┬────────────┬───────────┐
    │ last_name ┆ first_name ┆ ssn       │
    │ ---       ┆ ---        ┆ ---       │
    │ str       ┆ str        ┆ str       │
    ╞═══════════╪════════════╪═══════════╡
    │ mallesh   ┆ yamulla    ┆ 4567;1234 │
    │ bhavik    ┆ vemulla    ┆ 7847      │
    │ jagarini  ┆ yegurla    ┆ 0648      │
    └───────────┴────────────┴───────────┘
    

    Edit: if you want to ensure that the rolled up list is sorted:

    (
        df
        .groupby('last_name', 'first_name')
        .agg(
            pl.col('ssn')
            .unique()
            .sort()
        )
        .with_columns(
            pl.col('ssn')
            .list.join(';')
        )
    )
    
    shape: (3, 3)
    ┌───────────┬────────────┬───────────┐
    │ last_name ┆ first_name ┆ ssn       │
    │ ---       ┆ ---        ┆ ---       │
    │ str       ┆ str        ┆ str       │
    ╞═══════════╪════════════╪═══════════╡
    │ jagarini  ┆ yegurla    ┆ 0648      │
    │ mallesh   ┆ yamulla    ┆ 1234;4567 │
    │ bhavik    ┆ vemulla    ┆ 7847      │
    └───────────┴────────────┴───────────┘
    

    Edit: Rolling up multiple columns

    We can roll up multiple columns elegantly as follows:

    (
        df
        .group_by("last_name", "first_name")
        .agg(
            pl.all().unique().sort().cast(pl.String)
        )
        .with_columns(
            pl.exclude('last_name', 'first_name').list.join(";")
        )
    )
    
    shape: (3, 4)
    ┌───────────┬────────────┬───────────┬───────────────────────┐
    │ last_name ┆ first_name ┆ ssn       ┆ dob                   │
    │ ---       ┆ ---        ┆ ---       ┆ ---                   │
    │ str       ┆ str        ┆ str       ┆ str                   │
    ╞═══════════╪════════════╪═══════════╪═══════════════════════╡
    │ bhavik    ┆ vemulla    ┆ 7847      ┆ 1991-09-16            │
    │ jagarini  ┆ yegurla    ┆ 0648      ┆ 1983-02-14;1990-01-01 │
    │ mallesh   ┆ yamulla    ┆ 1234;4567 ┆ 1990-10-11            │
    └───────────┴────────────┴───────────┴───────────────────────┘
    
    

    Edit: eliminating empty strings and null values from rollup

    We can add a filter step just before the list.join to filter out both null and empty string "" values.

    (
        df.group_by("last_name", "first_name")
        .agg(pl.all().unique().sort().cast(pl.String))
        .with_columns(
            pl.exclude("last_name", "first_name")
            .list.eval(
                pl.element().filter(pl.element().is_not_null() & (pl.element() != ""))
            )
            .list.join(";")
        )
    )