Search code examples
algorithmsearche-commercegroupingfuzzy-search

most efficient way to group search results by string similarity


I am working on a sql server 2008 DB and asp.net mvc web E-commerce app.

I have different users feeding their products to the DB, and I want to compare the prices of products with similar names. I know that string matching is domain specific, but I still need the best generic solution.

What is the most efficient way to group the search results? Should I compare each of the records recursively using the Levenshtien Distance algorithm? Should I do it in the DB, or in the code? Is there a way to implement SSIS Fuzzy Grouping in real time for this task? Is there an efficient way to do it using the Sql server 2008 free text search?

Edit 1: What about network-graph analysis. If I'll define a matrix using the Levenshtien Distance algorithm, I could use a clustering algorithm (for example: clauset newman moore) and seperate groups that don't have phonological path between them. I have attached Nick Johnson (see comment) cat-dog for example (the red lines are the clusters) - and by using the clauset newman moore I am creating 2 different clusters and seperating cats from dogs.

What do you think?

enter image description here


Solution

  • If you can get ahold of a suitable thesaurus/ontology that basically provides the best clustering possible - since words are leaves in a concept tree, distance in the tree is the distance between words in a semantic sense. Thus cat and dog aren't nearly as close as tabby and calico (cat), but they're substantially closer than cat and banana, which themselves are closer than cat(n.) and jump(v.).

    Allowing for small spelling errors (by looking for similarly spelled words that are in the thesaurus for words that aren't) could increase robustness, but it could also create unexpected results as a result of homonyms.

    As for doing it in the database or in code, do it in code. To the extent that you can cache, that will be faster.