Search code examples
pythonpandasfuzzywuzzyrapidfuzz

Fuzzy Matching with different fuzz ratios


I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by @laurent but I have some added specificities now. I would now like to:

  1. Get the fuzz ratios for each of fname and lname in a column in my final matched dataframe

  2. Write the code such that fuzz ratio for fname is set to >60, while fuzz ratio for lname is set to >75. In other words, a true match occurs if fuzz_ratio for fname>60 and fuzz ratio for lname>75; otherwise not a true match. A match would not be true if fuzz ratio for fname==80 while fuzz ratio for lname==60. While I understand that this can be done from (1) as a post-hoc filtering, it would make sense to do this at the stage of coding for a different matching.

I post here an example of my data. The solution by @laurent for the original problem can be found in the above link.

import pandas as pd

df1 = pd.DataFrame(
    {
        "ein": {0: 1001, 1: 1500, 2: 3000},
        "ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
    }
)

df2 = pd.DataFrame(
    {
        "lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
        "fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
        "score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
    }
)

Expected output is:

df3 = pd.DataFrame(
    {
        "df1_ein": {0: 1001, 1: 1500, 2: 3000},
        "df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
        "fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
        "fuzz_ratio_fname": {0: 62, 1: 67, 2: NA},
        "df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
        "df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
        "df2_score": {0: 2.5, 1: 3.5, 2: NA},

    }
)

Note from the above expected output: Bradley Cupper is a bad match for Bradley Cooper based on the fuzz ratios that I assigned. The better match for Bradley Cooper is M Brad Couper. Similarly, Thomas Cruise matches with Tom Cruise rather than with Thomas Cruz.

I am a user of Stata primarily (haha) and the reclink2 ado file can do the above in theory, i.e. if Stata can handle the size of the data. However, with the size of data I have, nothing even starts after hours.


Solution

  • Here is one way to do it:

    import pandas as pd
    from fuzzywuzzy import fuzz
    
    # Setup
    df1.columns = [f"df1_{col}" for col in df1.columns]
    
    # Add new columns
    df1["fuzz_ratio_lname"] = (
        df1["df1_lname"]
        .apply(
            lambda x: max(
                [(value, fuzz.ratio(x, value)) for value in df2["lname"]],
                key=lambda x: x[1],
            )
        )
        .apply(lambda x: x if x[1] > 75 else pd.NA)
    )
    
    df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
        df1["fuzz_ratio_lname"].tolist(), index=df1.index
    )
    df1 = (
        pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
        .drop(columns="lname")
        .rename(columns={"fname": "df2_fname"})
    )
    
    df1["df2_fname"] = df1["df2_fname"].fillna(value="")
    for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
        ratio = fuzz.ratio(x, value)
        df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
    
    # Cleanup
    df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
    df1 = df1[
        [
            "df1_ein",
            "df1_ein_name",
            "df1_lname",
            "df1_fname",
            "fuzz_ratio_lname",
            "fuzz_ratio_fname",
            "df2_lname",
            "df2_fname",
            "score",
        ]
    ]
    
    print(df1)
    # Output
       df1_ein         df1_ein_name df1_lname df1_fname  fuzz_ratio_lname  \
    0     1001       H for Humanity    Cooper   Bradley              83.0   
    1     1500          Labor Union    Cruise    Thomas             100.0   
    2     3000  Something something      Pitt      Brad               NaN   
    
      fuzz_ratio_fname df2_lname df2_fname  score  
    0             62.0    Couper    M Brad    2.5  
    1             67.0    Cruise       Tom    3.5  
    2             <NA>      <NA>      <NA>    NaN