I'll try to explain you, as far as I can, my new Python Challenge! We have two datasets in Excell for two different Retailer (supermarket) and in each of them there are some information about their products (name, brand, weight, etc.) but with different structures (see the example table 1-2 below).
Brand 1 | Name 1 | Weight 1 |
---|---|---|
Ferrero spa | Nutella | 250g |
Barilla | Pasta Inegrale | 500g |
Coca Cola | Zero Sugar | 500ml |
Brand 2 | Name 2 |
---|---|
Ferrero | Nutella 250gr |
Barilla | Pasta Ineg. 500gr |
Pepsi | Zero Sug. 500gml |
The goal would be an NLP Text Similarity Algorithm able to identify the common products btw the two retailer as accurately as possible. (To be honest at the moment I can't give you more guidance on the output because I also don't know much about this world of ML)
Thank you very much in advance for the availability and good luck!
For that, you can use an NLP library, such as spaCy or NLTK to tekenize the product names and compute their similarity scores. This is done by iterating over all the products in both datasets and computing the similarity scores between each pair of products. Additionally, it will be setting a threshold for the similarity score to determine if two products are similar enough to be considered the same. In the end, it will output the list of common products between the two datasets.
Therefore, you will need to do the following:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
df1 = pd.read_excel("retailer1.clsx")
df2 = pd.read_excel("retailer2.clsx")
def preprocess(text):
text = text,lower()
doc = npl(text)
tokens = [token.text for token in doc if not token.is_stop and token.is_alpha]
return " ".join(tokens)
df1["Name"] = df1["Name"].apply(preprocess)
df2["Name"] = df2["Name"].apply(preprocess)
def get_similarity_score(text1, text2):
doc1 = nlp(text1)
doc2 = nlp(text2)
return doc1.similarity(doc2)
common_products = []
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
similarity_score = get_similarity_score(row1["Name"], row2["Name"])
if similarity_score > 0.9:
common_products.append((row1["Brand"], row1["Name"], row1["Weight"], row2["Brand"], row2["Name"], row2["Weight"]))
for product in commmon_products:
print(product)