machine-learning scikit-learn dataset classification multiclass-classification

Cleveland heart disease dataset - can’t describe the class

I’m using the Cleveland Heart Disease dataset from UCI for classification but i don’t understand the target attribute.

The dataset description says that the values go from 0 to 4 but the attribute description says:

0: < 50% coronary disease

1: > 50% coronary disease

I’d like to know how to interpret this, is this dataset meant to be a multiclass or a binary classification problem? And must i group values 1-4 to a single class (presence of disease)?

Solution

If you are working on imbalanced dataset, you should use re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.

You should try SMOTE, it's synthesizing elements for the minority class, based on those that already exist. It works randomly picking a point from the minority class and computing the k-nearest neighbors for this point.

I also used cross validation K-fold method along with SMOTE, Cross validation assures that model gets the correct patterns from the data.

While measuring the performance of model, accuracy metric mislead, its shows high accuracy even though there are more False Positive. Use metric such as F1-score and MCC.

References :

https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets