Search code examples
mysqlruby-on-railsmatchingsimilarityaffiliate

Rails: A way to check for duplicate item in DB? Affiliate data feeds


I have a problem regarding affiliate data feeds.

For example from Amazon or other e-shop partners. I am trying to import their product data, but want to avoid having duplicates, if both shops sell the same product.

for example Amazon: Product Title: iPhone 5 16GB Black

and another shop uses Product Title: iPhone 5 16GB.

They should be listed as one product, now imagine I have 10 shops selling iPhone 5.

of course their are many more parameters. Still I need an algorithm to prevent this from happening. Like a similarity match algorithm of product parameters.

Does anyone have experience with this and can tell me, what kind of algorithm can be advised for this scenario?

A detailed list of parameters can be found here GET Products Documentation WebApi

Thank you very much!

It can be done by EAN number, but what if this number is not provided.


Solution

  • Before developing an alogorithm, you need to define business rules. If your situation is where all attributes match except title then you can try substring (one is partial of other) match or fuzzy match on the title.

    We are using fuzzy-string-match gem for finding duplicate companies.

    Assuming that discrepancy is only on title, you can put more intelligence into an algorithm by analyzing title parts. In you example, title parts could be model, version, capacity and color. For this example:

     required_attributes = [model, version, capacity]
     optional_attributes = [color]
    

    And define attributes for each product category. Combine this with fuzzy match and you should be able to get a good match even on spelling errors and following should match:

     iPhone 5 16GB Black
     iPhone 5 16GB
     iPone 5 16GB White