Search code examples
pythonpython-polars

How to apply frozenset on polars dataframe?


I have a pandas dataframe as:

df_names = pd.DataFrame({'last_name':['Williams','Henry','XYX','Smith','David','Freeman','Walter','Test_A'],
                        'first_name':['Henry','Williams','ABC','David','Smith','Walter','Freeman','Test_B']})

enter image description here

Here i have applied a frozen set on last_name and first_name columns to see if the names are interchanged i.e williams henry, henry williams.

df_names[['last_name','first_name']].apply(frozenset,axis=1)

enter image description here

Here the same kind of implementation is required on polars dataframe. How to get it done ?

enter image description here


Solution

  • As @ritchie46 mentioned, you'll want to avoid embedding Python objects (like frozensets) into a Polars DataFrame. The performance is not good, and columns of type object have limited functionality.

    Here's a Polars algorithm that will be very performant and will accomplish what you need.

    (
        df_names
        .with_columns(
            pl.concat_list(
                pl.col("first_name").str.replace_all(r'\s','').str.to_uppercase(),
                pl.col("last_name").str.replace_all(r'\s','').str.to_uppercase(),
            )
            .list.sort()
            .list.join('|')
            .alias('name_key')
        )
        .filter(pl.len().over('name_key') > 1)
    )
    
    shape: (6, 3)
    ┌───────────┬────────────┬────────────────┐
    │ last_name ┆ first_name ┆ name_key       │
    │ ---       ┆ ---        ┆ ---            │
    │ str       ┆ str        ┆ str            │
    ╞═══════════╪════════════╪════════════════╡
    │ Williams  ┆ Henry      ┆ HENRY|WILLIAMS │
    │ Henry     ┆ Williams   ┆ HENRY|WILLIAMS │
    │ Smith     ┆ David      ┆ DAVID|SMITH    │
    │ David     ┆ Smith      ┆ DAVID|SMITH    │
    │ Freeman   ┆ Walter     ┆ FREEMAN|WALTER │
    │ Walter    ┆ Freeman    ┆ FREEMAN|WALTER │
    └───────────┴────────────┴────────────────┘
    

    To help with matching, I've converted names to all uppercase and eliminated white space. (You can drop that if you don't find it useful.)