Search code examples
pythonpandasdifffuzzywuzzysequencematcher

Replacing similar strings in the column by using the same for both


I'm encountering the following issue during a small project of mine. I'm having a large dataset where some string values are accidentally not written properly. My goal is to write a function that ensures that all names that look fairly similar (.75) will be looked for in a loop, and will get the same name. In the example below I described a subset of the data where "Bob Fisherman", "Bob Felony" & "Bob Haris" are the correct names. I would like to have the misspelled names changes to the above if they match

Here is a subset of the dataframe:

columns = ["Name", "Type","Amount", "Year"]
data = [("Bob fisherman", "Income", 150, 2022), ("Bob fisherman","Income", 100, 2021), ("Bob Felony", "Income", 100, 2021), ("Bob Felany", "Expense", 50, 2022), ("Bob Haris", "Expense", 100, 2022), ("Bob Disherman", "Expense", 100, 2021)]
data = spark.createDataFrame(data).toDF(*columns)

So eventually I would like to have something like this:

Name Type Amount Year
Bob Fisherman Income 150 2022
Bob Fisherman Income 100 2021
Bob Felony Income 100 2021
Bob Felany Income 50 2022
Bob Haris Income 100 2022
Bob Felony Income 100 2021
Bob Fisherman Income 100 2022

In the example it only goes about Bob. But in the total sample, I have much more names so the use of pre-specified list is not going to cut it unfortunately.

I tried to get some inspiration from the following question but I didn't seem to make it work: Replace similar strings in a column with the same string


Solution

  • With the following toy dataframe:

    import pandas as pd
    
    df = pd.DataFrame(
        [
            ("Bob Fisherman", "Income", 150, 2022),
            ("Bob Heris", "Income", 100, 2021),
            ("Bob Felony", "Income", 100, 2021),
            ("Bob Felany", "Expense", 50, 2022),
            ("Bob Haris", "Expense", 100, 2022),
            ("Bob Disherman", "Expense", 100, 2021),
        ],
        columns=["Name", "Type", "Amount", "Year"],
    )
    print(df)
    # Output
                Name     Type  Amount  Year
    0  Bob Fisherman   Income     150  2022
    1      Bob Heris   Income     100  2021
    2     Bob Felony   Income     100  2021
    3     Bob Felany  Expense      50  2022
    4      Bob Haris  Expense     100  2022
    5  Bob Disherman  Expense     100  2021
    

    Here is one way to do it with the help of SequenceMatcher class from Python standard library difflib module:

    from difflib import SequenceMatcher
    
    def similar(a, b):
        """Get similarity ratio between a and b.
    
        Args:
            a: value.
            b: other value.
    
        Returns:
            Similatity ratio.
    
        """
        return SequenceMatcher(None, a, b).ratio()
    
    
    def replace_similar_values(df, col, value):
        """Main helper function.
    
        Args:
            df: target dataframe.
            col: target column.
            value: value to use for replacement.
    
        Returns:
            Modified dataframe.
    
        """
        df = df.assign(
            Match=df[col].map(
                lambda x: max(
                    [similar(x, value)],
                    key=lambda x: x if x != 1 else 0,
                )
            )
        )
        df["Name"] = df.apply(lambda x: value if x["Match"] >= 0.8 else x["Name"], axis=1)
        df = df.drop(columns="Match")
        return df
    

    And then:

    correct_names = ["Bob Fisherman", "Bob Felony", "Bob Haris"]
    
    for correct_name in correct_names:
        df = replace_similar_values(df, "Name", correct_name)
    
    print(df)
    # Output
                Name     Type  Amount  Year
    0  Bob Fisherman   Income     150  2022
    1      Bob Haris   Income     100  2021
    2     Bob Felony   Income     100  2021
    3     Bob Felony  Expense      50  2022
    4      Bob Haris  Expense     100  2022
    5  Bob Fisherman  Expense     100  2021