I've got two very important, user entered, information columns in my data frame. They are mostly cleaned up except for one issue: the spelling, and the way names are written differ. For example I have five entries for one name: "red rocks canyon", "redrcks", "redrock canyon", "red rocks canyons". This data set is too large for me to go through and clean this manually (2 million entries). Are there any strategies to clean these features up with code?
I would look into doing phonetic string matching here. The basic idea behind this approach is to obtain a phonetic encoding for each entered string, and then group spelling variations by their encoding. Then, you could choose the most frequent variation in each group to be the "correct" spelling.
There are several different variations on phonetic encoding, and a great package in Python for trying some of them out is jellyfish. Here is an example of how to use it with the Soundex encoding:
import jellyfish
import pandas as pd
data = pd.DataFrame({
"name": [
"red rocks canyon",
"redrcks",
"redrock canyon",
"red rocks canyons",
"bosque",
"bosque escoces",
"bosque escocs",
"borland",
"borlange"
]
})
data["soundex"] = data.name.apply(lambda x: jellyfish.soundex(x))
print(data.groupby("soundex").agg({"name": lambda x: ", ".join(x)}))
This prints:
name
soundex
B200 bosque
B222 bosque escoces, bosque escocs
B645 borland, borlange
R362 red rocks canyon, redrcks, redrock canyon, red...
This definitely won't be perfect and you'll have to be careful as it might group things too aggressively, but I hope it gives you something to try!