Search code examples
pythonmachine-learningnlpstring-matching

Python + Machine Learning : String matching problem


I have been given one problem to solve:

The problem is explained below:

The company maintains a dataset for specifications of all the products (nearly 4,500 at present) which it sells. Now each customer shares the details (name, quantity, brand etc.) of the products which he/she wants to buy from the company. Now, the customer while entering details in his/her dataset may spell the name of the product incorrectly. Also a product can be referred by many different ways in the company dataset. Example : red chilly can be referred as guntur chilly, whole red chilly, red chilly with stem, red chilly without stem etc.

I am absolutely confused about how to approach this problem. Should I use any machine learning based technique? If yes, then plz explain me what to do. Or, if it is possible to solve this problem without machine learning then also explain your approach. I am using Python.

The challenge : customer can refer to a product in many ways and the company also stores a single product in many ways with different specifications like variations in name, quantity, unit of measurements etc. With a labeled dataset I can find out that red bull energy drink(data entered by customer) is red bull (label) and red bull(entered by customer) is also red bull. But what's the use of finding this label? Because in my company dataset also red bull is present in many ways. Again I have to find all the different names of red bull in which they present in company dataset.

My approach: I will prepare a Python dictionary like this:

{
"red chilly" : ['red chilly', 'guntur chilly', 'red chilly with stem'],
"red bull" : ['red bull energy drink', 'red bull']
}

Each entry in the dictionary is a product. whose keys are the sort of stem names of the products and the values are the all possible names for a product. Now customer enters a product name, say red bull energy drink. I will check in the dictionary for each key. If any value of that key matches, then I'll understand that the product is actually red bull and it can be referred as red bull and red bull energy drink, both ways in the company dataset. How's this approach ?


Solution

  • Best situation

    If you have access to all possible usage names of the product it will be the best situation, all you have to do is check if the name entered by the user falls in the synonyms. 5000 products with say 10 synonyms each with a well desired schema should be easily handled by a powerful Database system.

    Search engine based solution

    Lets say if you don't have access to synonyms but say you have access to detailed English description of the product, then you can search for the user entered name in the description. One can use search engine like Apache Solr which uses inverted Index based on TFIDF. The document which SOLR returns as top result will be the corresponding product then. In short, index you document desciptions into solr and search for the user entered product name in solr. Mind that it is lexicon based not semantic based but lexion based will suffice for you, as long as your user will not call a banana as "yellow color cylinder shaped fruit"

    ML Based

    The are good distributed vector representations (word2vec, glove) called embeddings. The important properly of embeddings is that the distance between related words will be small. However, these vectors are not good for you because what you have are phrases not words (red is a word but red chilly is a phrase). There are no good pre-trained phrase to vector embeddings available in open source. If you want to use a model based on vector similarity then you will have to build your own phrase2vec model. So assuming you are able to build a phrase2vec model, you have to find the vector(corresponding to the product) which is close to the vector of the product name typed by your customer.