Search code examples
python-3.xpandaspandas-groupbysklearn-pandas

Cleaning up a column based on spelling? Pandas


I've got two very important, user entered, information columns in my data frame. They are mostly cleaned up except for one issue: the spelling, and the way names are written differ. For example I have five entries for one name: "red rocks canyon", "redrcks", "redrock canyon", "red rocks canyons". This data set is too large for me to go through and clean this manually (2 million entries). Are there any strategies to clean these features up with code?

Screen_shot_of_data


Solution

  • I would look into doing phonetic string matching here. The basic idea behind this approach is to obtain a phonetic encoding for each entered string, and then group spelling variations by their encoding. Then, you could choose the most frequent variation in each group to be the "correct" spelling.

    There are several different variations on phonetic encoding, and a great package in Python for trying some of them out is jellyfish. Here is an example of how to use it with the Soundex encoding:

    import jellyfish
    import pandas as pd
    
    data = pd.DataFrame({
        "name": [
            "red rocks canyon",
            "redrcks",
            "redrock canyon",
            "red rocks canyons",
            "bosque",
            "bosque escoces",
            "bosque escocs",
            "borland",
            "borlange"
        ]
    })
    data["soundex"] = data.name.apply(lambda x: jellyfish.soundex(x))
    print(data.groupby("soundex").agg({"name": lambda x: ", ".join(x)}))
    

    This prints:

                                                          name
    soundex                                                   
    B200                                                bosque
    B222                         bosque escoces, bosque escocs
    B645                                     borland, borlange
    R362     red rocks canyon, redrcks, redrock canyon, red...
    

    This definitely won't be perfect and you'll have to be careful as it might group things too aggressively, but I hope it gives you something to try!