sql sql-server data-analysis data-cleaning

Data cleansing - how to decide which names are misspellings or are equivalent but slightly different?

We have table with companies names and a numeric primary key identity. We are cleaning up the data and we have discovered the name column is full of similar names that represent the same company.

E.g. BA and Ba or GTC Ltd and GTC Limited.

Is there anyway using SQL server that we can get counts and summary of all items that have similar names and list of IDs. I wondered if there was some sort of similarity comparison we could set a threshold value for etc

We need to present a list of names to the client that look like they need merged.

Solution

The basic answer is "No". Name rectification is a hard problem. Two names like "GTC Ltd" and "GTC Limited" are more different than "GTC" and "GTE" (by more obvious measures). There are outside service bureaus and special-purpose software for this purpose.

If you are dealing with a smallish amount of data, I would suggest that you alphabetize the values, load them into Excel, and add a column in Excel with the "official" name. You can then re-import this as a table in the database to do what you want. It might help if you remove known suffixes and prefixes, such as "ltd", "bros", "partners" and so on.

If you do try to go down the path of something like soundex(), then be sure that you understand it well. For instance, the soundex() values of the following two strings are the same: "gte, blah blah blah" and "gdteey, junk goes here".