Search code examples
pythonpython-polarsrapidfuzz

How to get fuzzy matches of given set of names in python polars dataframe?


I'm trying to implement a name duplications for one of our use case.

Here I have a set of 10 names along with their index column as below.

df = pl.from_repr("""
┌───────┬───────────────────┐
│ index ┆ full_name         │
│ ---   ┆ ---               │
│ u32   ┆ str               │
╞═══════╪═══════════════════╡
│ 0     ┆ Mallesham Yamulla │
│ 1     ┆ Velmala Sharath   │
│ 2     ┆ Jagarini Yegurla  │
│ 3     ┆ Sharath Velmala   │
│ 4     ┆ Bhavik Vemulla    │
│ 5     ┆ Yegurla Mahesh    │
│ 6     ┆ Yegurla Jagarini  │
│ 7     ┆ Vermula Bhavik    │
│ 8     ┆ Mahesh Yegurla    │
│ 9     ┆ Yamulla Mallesham │
└───────┴───────────────────┘
""")

Here I would like to calculate fuzzy metrics(Levenshtein,JaroWinkler) per each of name combinations using a rapidxfuzz module as below.

from rapidfuzz import fuzz
from rapidfuzz.distance import Levenshtein,JaroWinkler
round(Levenshtein.normalized_similarity(name_0,name_1),5)
round(JaroWinkler.similarity(name_0,name_1),5)

For example: idx-0 name Mallesham Yamulla to be paired with names having indexes sequence (1,9) names[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,9)] and calculate their levenshtein and Jarowrinkler similar percentages.

Next idx-1 name with names index sequence (2,9), idx-2 with name index sequence (3,9), idx-3 with (4,9) so on so forth till (8,9)

The expected output would be :

enter image description here


Solution

  • Create example dataframe.

    df = pl.DataFrame(
        pl.Series("full_name", ["Aaaa aaaa", "Baaa abba", "Acac acca", "Dada dddd"])
    ).with_row_index()
    
    shape: (4, 2)
    ┌───────┬───────────┐
    │ index ┆ full_name │
    │ ---   ┆ ---       │
    │ u32   ┆ str       │
    ╞═══════╪═══════════╡
    │ 0     ┆ Aaaa aaaa │
    │ 1     ┆ Baaa abba │
    │ 2     ┆ Acac acca │
    │ 3     ┆ Dada dddd │
    └───────┴───────────┘
    

    Join dataframe with itself in a cross join and remove rows where index == index

    df_combinations = df.join(
        df,
        how="cross",
        suffix="_2",
    ).filter(
        pl.col("index") != pl.col("index_2")
    )
    
    shape: (12, 4)
    ┌───────┬───────────┬─────────┬─────────────┐
    │ index ┆ full_name ┆ index_2 ┆ full_name_2 │
    │ ---   ┆ ---       ┆ ---     ┆ ---         │
    │ u32   ┆ str       ┆ u32     ┆ str         │
    ╞═══════╪═══════════╪═════════╪═════════════╡
    │ 0     ┆ Aaaa aaaa ┆ 1       ┆ Baaa abba   │
    │ 0     ┆ Aaaa aaaa ┆ 2       ┆ Acac acca   │
    │ 0     ┆ Aaaa aaaa ┆ 3       ┆ Dada dddd   │
    │ 1     ┆ Baaa abba ┆ 0       ┆ Aaaa aaaa   │
    │ 1     ┆ Baaa abba ┆ 2       ┆ Acac acca   │
    │ …     ┆ …         ┆ …       ┆ …           │
    │ 2     ┆ Acac acca ┆ 1       ┆ Baaa abba   │
    │ 2     ┆ Acac acca ┆ 3       ┆ Dada dddd   │
    │ 3     ┆ Dada dddd ┆ 0       ┆ Aaaa aaaa   │
    │ 3     ┆ Dada dddd ┆ 1       ┆ Baaa abba   │
    │ 3     ┆ Dada dddd ┆ 2       ┆ Acac acca   │
    └───────┴───────────┴─────────┴─────────────┘
    

    Run rapidfuzz using map_elements

    df_combinations.with_columns(
        # Combine "index" and "index_2" columns to one struct column.
        pl.struct("index", "index_2").alias("index_comb"),
        # Combine "full_name" and "full_name_2" columns to one struct column.
        pl.struct("full_name", "full_name_2").alias("full_name_comb"),
    ).with_columns(
        # Run custom functions on struct column.
        pl.col("full_name_comb").map_elements(lambda t: Levenshtein.normalized_similarity(t["full_name"], t["full_name_2"])).alias("levenshtein"),
        pl.col("full_name_comb").map_elements(lambda t: JaroWinkler.similarity(t["full_name"], t["full_name_2"])).alias("jarowinkler"),
    )
    
    shape: (12, 8)
    ┌───────┬───────────┬─────────┬─────────────┬────────────┬───────────────────────────┬─────────────┬─────────────┐
    │ index ┆ full_name ┆ index_2 ┆ full_name_2 ┆ index_comb ┆ full_name_comb            ┆ levenshtein ┆ jarowinkler │
    │ ---   ┆ ---       ┆ ---     ┆ ---         ┆ ---        ┆ ---                       ┆ ---         ┆ ---         │
    │ u32   ┆ str       ┆ u32     ┆ str         ┆ struct[2]  ┆ struct[2]                 ┆ f64         ┆ f64         │
    ╞═══════╪═══════════╪═════════╪═════════════╪════════════╪═══════════════════════════╪═════════════╪═════════════╡
    │ 0     ┆ Aaaa aaaa ┆ 1       ┆ Baaa abba   ┆ {0,1}      ┆ {"Aaaa aaaa","Baaa abba"} ┆ 0.666667    ┆ 0.777778    │
    │ 0     ┆ Aaaa aaaa ┆ 2       ┆ Acac acca   ┆ {0,2}      ┆ {"Aaaa aaaa","Acac acca"} ┆ 0.555556    ┆ 0.637037    │
    │ 0     ┆ Aaaa aaaa ┆ 3       ┆ Dada dddd   ┆ {0,3}      ┆ {"Aaaa aaaa","Dada dddd"} ┆ 0.333333    ┆ 0.555556    │
    │ 1     ┆ Baaa abba ┆ 0       ┆ Aaaa aaaa   ┆ {1,0}      ┆ {"Baaa abba","Aaaa aaaa"} ┆ 0.666667    ┆ 0.777778    │
    │ 1     ┆ Baaa abba ┆ 2       ┆ Acac acca   ┆ {1,2}      ┆ {"Baaa abba","Acac acca"} ┆ 0.444444    ┆ 0.546296    │
    │ …     ┆ …         ┆ …       ┆ …           ┆ …          ┆ …                         ┆ …           ┆ …           │
    │ 2     ┆ Acac acca ┆ 1       ┆ Baaa abba   ┆ {2,1}      ┆ {"Acac acca","Baaa abba"} ┆ 0.444444    ┆ 0.546296    │
    │ 2     ┆ Acac acca ┆ 3       ┆ Dada dddd   ┆ {2,3}      ┆ {"Acac acca","Dada dddd"} ┆ 0.111111    ┆ 0.444444    │
    │ 3     ┆ Dada dddd ┆ 0       ┆ Aaaa aaaa   ┆ {3,0}      ┆ {"Dada dddd","Aaaa aaaa"} ┆ 0.333333    ┆ 0.555556    │
    │ 3     ┆ Dada dddd ┆ 1       ┆ Baaa abba   ┆ {3,1}      ┆ {"Dada dddd","Baaa abba"} ┆ 0.333333    ┆ 0.555556    │
    │ 3     ┆ Dada dddd ┆ 2       ┆ Acac acca   ┆ {3,2}      ┆ {"Dada dddd","Acac acca"} ┆ 0.111111    ┆ 0.444444    │
    └───────┴───────────┴─────────┴─────────────┴────────────┴───────────────────────────┴─────────────┴─────────────┘