machine-learning cluster-analysis weka data-mining

Algorithm to classify instances from a dataset similar to another smaller dataset, where this smaller dataset represents a single class

I have a dataset that represents instances from a binary class. The twist here is that there are only instances from the positive class and I have none of the negative one. Or rather, I want to extract those from the negatives which are closer to the positives.

To get more concrete let's say we have data of people who bought from our store and asked for a loyalty card at the moment or later of their own volition. Privacy concerns aside (it's just an example) we have different attributes like age, postcode, etc.

The other set of clients, following with our example, are clientes that did not apply for the card.

What we want is to find a subset of those that are most similar to the ones that applied for the loyalty card in the first group, so that we can send them an offer to apply for the loyalty program.

It's not exactly a classification problem because we are trying to get instances from within the group of "negatives".

It's not exactly clustering, which is typically unsupervised, because we already know a cluster (the loyalty card clients).

I thought about using kNN. But I don't really know what are my options here.

I would also like to know how, if possible, can this be achieved with weka or another Java library and if I should normalize all the attributes.

Solution

You could use anomaly detection algorithms. These algorithms tell you whether your new client belongs to the group of clients who got a loyalty card or not (in which case they would be an anomaly).

There are two basic ideas (coming from the article I linked below):

You transform the feature vectors of your positive labelled data (clients with card) to a vector space with a lower dimensionality (e.g. by using PCA). Then you can calculate the probability distribution for the resulting transformed data and find out whether a new client belongs to the same statistical distribution or not. You can also compute the distance of a new client to the centroid of the transformed data and decide by using the standard deviation of the distribution whether it is still close enough.
The Machine Learning Approach: You train an auto-encoder network on the clients with card data. An auto-encoder has a bottleneck in its architecture. It compresses the input data into a new feature vector with a lower dimensionality and tries afterwards to reconstruct the input data from that compressed vector. If the training is done correctly, the reconstruction error for input data similar to the clients with card dataset should be smaller than for input data which is not similar to it (hopefully these are clients who do not want a card).

Have a look at this tutorial for a start: https://towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7

Both methods would require to standardize the attributes first.