python machine-learning sentence-similarity

Find Similarity on Excel Column - Brand, Product Name and Weight

I'll try to explain you, as far as I can, my new Python Challenge! We have two datasets in Excell for two different Retailer (supermarket) and in each of them there are some information about their products (name, brand, weight, etc.) but with different structures (see the example table 1-2 below).

Brand 1	Name 1	Weight 1
Ferrero spa	Nutella	250g
Barilla	Pasta Inegrale	500g
Coca Cola	Zero Sugar	500ml

Brand 2	Name 2
Ferrero	Nutella 250gr
Barilla	Pasta Ineg. 500gr
Pepsi	Zero Sug. 500gml

The goal would be an NLP Text Similarity Algorithm able to identify the common products btw the two retailer as accurately as possible. (To be honest at the moment I can't give you more guidance on the output because I also don't know much about this world of ML)

Thank you very much in advance for the availability and good luck!

Solution

For that, you can use an NLP library, such as spaCy or NLTK to tekenize the product names and compute their similarity scores. This is done by iterating over all the products in both datasets and computing the similarity scores between each pair of products. Additionally, it will be setting a threshold for the similarity score to determine if two products are similar enough to be considered the same. In the end, it will output the list of common products between the two datasets.

Therefore, you will need to do the following:

import pandas as pd
import spacy

nlp = spacy.load("en_core_web_sm")

df1 = pd.read_excel("retailer1.clsx")
df2 = pd.read_excel("retailer2.clsx")

def preprocess(text):
  text = text,lower()

  doc = npl(text)
  tokens = [token.text for token in doc if not token.is_stop and token.is_alpha]
  return " ".join(tokens)

df1["Name"] = df1["Name"].apply(preprocess)
df2["Name"] = df2["Name"].apply(preprocess)

def get_similarity_score(text1, text2):
  doc1 = nlp(text1)
  doc2 = nlp(text2)
  return doc1.similarity(doc2)

common_products = []
for index1, row1 in df1.iterrows():
  for index2, row2 in df2.iterrows():
    similarity_score = get_similarity_score(row1["Name"], row2["Name"])
    if similarity_score > 0.9:
      common_products.append((row1["Brand"], row1["Name"], row1["Weight"], row2["Brand"], row2["Name"], row2["Weight"]))

for product in commmon_products:
  print(product)