Search code examples
pythonmachine-learningnlpartificial-intelligencegensim

What kind of model/technique should I use to compare supermarket product names


I have a database with supermarket product items(it contains name, descriptions, price, stock, etc).

I want to make a price comparison between those supermarkets, but, for that i need to know if supermarket A and B refers to the same product.

For example I found out that supermarket A has a product called Leche Evaporada GLORIA Azul Paquete 6un Lata 400g and supermarket B has a product named Leche Evaporada Gloria Azul Pack 6 Unid x 400 g and those refers to the same product.

I pointed out that I will need to have semantic comparison for those cases. I'm new in this problems so I don't really know what is the best solution to not underestimate the problem or overkill it.

What I'm doing right now with not so great results:

  1. I'm only using product names.
  2. Remove stop words from those product names.
  3. Convert the sentence in an array of words.
  4. Get frequency for every word.
  5. If a word has frequency <= 1, then delete it.
  6. With that words I create a dictionary(bag of words) that i will use to map an array of words(a sentence converted) to a feature vector.
  7. Then I "train" a TFIDF model with all feature vectors.
  8. Make comparisons(with no great results).

I'm using python as LP and gensim to create models, dictionaries(bag of word) and to make comparisons.

EDIT: Another examples:

Leche Fresca UHT GLORIA Entera Bolsa 946ml == Leche Entera UHT Gloria Bolsa 946 ml
Yogurt Griego Gloria con Miel y Granola Vaso 115 g == Yogurt Griego GLORIA Batido con Miel Vaso 115g
Leche sin Lactosa GLORIA Mocaccino Botella 330ml == Shake Mocaccino UHT Gloria Frasco 330 ml.

Solution

  • I think a good solution for this problem would be that you compare the products based on a similarity score. For instance, I would use the Jaro-Winkler distance to compare two product descriptions and if the descriptions match to a defined threshold, I would compare the prices.