Search code examples
sqlsql-servert-sqlgroupingfuzzy

TSQL grouping on fuzzy column


I would like to group all the merchant transactions from a single table, and just get a count. The problem is, the merchant, let's say redbox, will have a redbox plus the store number added to the end(redbox 4562,redbox*1234). I will also include the category for grouping purpose.

Category      Merchant
restaurant    bruger king 123 main st
restaurant    burger king 456 abc ave
restaurant    mc donalds * 45877d2d
restaurant    mc 'donalds *888544d
restaurant    subway 454545
travelsubway  MTA
gas station   mc donalds gas
travel        nyc taxi
travel        nyc-taxi

The question: How can I group the merchants when they have address or store locations added on to them.All I need is a count for each merchant.


Solution

  • The short answer is there is no way to accurately do this, especially with just pure SQL.

    You can find exact matches, and you can find wildcard matches using the LIKE operator or a (potentially huge) series of regular expressions, but you cannot find similar matches nor can you find potential misspellings of matches.

    There's a few potential approaches I can think of to solve this problem, depending on what type of application you're building.

    First, normalize the merchant data in your database. I'd recommend against storing the exact, unprocessed string such as Bruger King in your database. If you come across a merchant that doesn't match a known set of merchants, ask the user if it already matches something in your database. When data goes in, process it then and match it to an existing known merchant.

    Store a similarity coefficient. You might have some luck using something like a Jaccard index to judge how similar two strings are. Perhaps after stripping out the numbers, this could work fairly well. At the very least, it could allow you to create a user interface that can attempt to guess what merchant it is. Also, some database engines have full-text indexing operators that can descibe things like similar to or sounds like. Those could potentially be worth investigating.

    Remember merchant matches per user. If a user corrects bruger king 123 main st to Burger King, store that relation and remember it in the future without having to prompt the user. This data could also be used to help other users correct their data.

    But what if there is no UI? Perhaps you're trying to do some automated data processing. I really see no way to handle this without some sort of human intervention, though some of the techniques described above could help automate this process. I'd also look at the source of your data. Perhaps there's a distinct merchant ID you can use as a key, or perhaps there exists somewhere a list of all known merchants (maybe credit card companies provide this API?) If there's boat loads of data to process, another option would be to partially automate it using a service such as Amazon's Mechanical Turk.