machine-learning scikit-learn classification multilabel-classification multiclass-classification

Balance problem for classification on Cleveland Dataset

I’ve questioned the way famous Cleveland heart disease dataset labels its objects here

This dataset is very unbalanced (many objects of “no disease” class). I noticed that many papers that used this dataset used to combine all the other classes and reduce this to a binary classification (disease vs no disease)

Are there other ways to deal with this unbalancing class problem rather than reduce the number of classes to get a good result from a classifer?

Solution

Generally speaking, when handling a non balanced dataset, one should use a non-supervised learning approach.

You may use the Multivariate Normal Distribution. In your case, if you have many elements in one class and very few in the other class, a supervised learning method is not appropriate. Therefore, the Multivariate Normal Distribution, which is a non supervised machine learning approach, may be the solution. The algorithm learns from the data and finds values which define the data (i.e. the most important part of the data, here the "no desease" cases). Once these values are outputed, one can search the elements which do not fit them, and these elements are the so called "abnormal elements" or "anomalies". In your case, these are the "disease" individuals.

A second solution would be to ballance you dataset, and use the initial supervised learning algorithm. You can do that using the following techniques. These statements are generally good, but they depend a lot on the data you have (mind, I do not have access to your input data!), so you should test them and see which one best fits your purpose.

Collecting more elements for the class with few elements.
Duplicate the elements in the class with less elements, in order to obtain the same amount of data for both classes, as for the class with more lements. There is a problem with this solution, in the case where you have a great difference of input data volume between the two classes, and you use a neural network, because the class with duplicated elements will not be very variate, and neural networks provide good results only when trained with a great amount of very variate data.
Use less data in the class with more lements, in order to have the same amount of elements in both classes as in the class with few elements. Here too there might be a problem when using a neural network, because training it with less data might not give the good results. be careful also in order to have more input elements than features, otherwise it would not work.