Search code examples
machine-learningsvmsimilarity

Techniques for Similarity matching to find similar customers with non-textual attributes


I am a beginner in machine learning and its techniques.

I need suggestion for building a model. Here is the problem statement -

I have a data set of customers who own all the products(4 products) of a particular company X - Call this set Cust4.
I also have another data set of customers who only own a few products (3 products) of the same company X - Call this set Cust3.
I have collected numerous 'categorical' and 'numerical' attributes for both the datasets (There is no text data).
I would like to sell more into customers who have 3 products (I will like to sell another 4th product into them) and so would like to know how similar are Cust3 customers comparing them to Cust4 customer set, so that I sell only to customers who are highly similar to the customers in Cust4 set.

Is there a technique/ what technique(s) is suitable that would tell me that a particular test-customer in Cust3 set for example is say 70% similar to the Cust4 set or 80% similar etc.?

Research so far -
I am trying to frame this as a one-class classification problem and have looked into One-Class classification especially the One-Class SVM (in R). This does build a model and classifies the data, however does not support probability predictions for now (R package e1071).

A peek into other techniques that might hold good for this kind of problem would be helpful. Appreciate all the help.


Solution

  • of cours, this is one class classification problem (or look alike), because you are looking of customers wich look like cust4,and you will not get a probability because you have not a prior probability of the 4th product. but you can get a distance of similarity between characters of Cust3 and Cust4.

    For that I recommend you the clustering algorithm: 1.Fist you do a clustering for your cust4 (on one or more clusters) and you will get one or more centroids(center of the cluster).

    2.For each customer in Cust3 you count the distance from this customer and each centroid (you should use same variables used in clustering). If the distance is more than a certain threshold. So this customer is Appetizing to product4.

    there are olso other technics like k nearest neighbors, but very expensive In calculation time.

    Hope that will help you.