Search code examples
machine-learningwekadecision-treej48

Generating a Decision Tree that Perfectly Models the Training Set?


I have my data set which are my rules, and I want to generate a decision tree that at least has 100% accuracy at classifying my rules, but I can never get 100%. I set minNumObjs to 1 and made it unpruned but I only get 84% correctly classified instances.

My attributes are:

@attribute users numeric
@attribute bandwidth numeric
@attribute latency numeric
@attribute mode {C,H,DCF,MP,DC,IND}

ex data:

2,200000,0,C
2,200000,1000,C
2,200000,2000,MP
2,200000,5000,C
2,400000,0,C
2,400000,1000,DCF

Can someone help me understand why I can never get 100% of my instances classified and how I can get 100% of them classified (while still allowing my attributes to be numeric)

Thanks


Solution

  • It is sometimes impossible to get 100% accuracy due to identical feature vectors having different labels. I am guessing in your case that users, bandwidth, and latency are the features, while mode is the label that you are trying to predict. If so, then there may be identical values of {users, bandwidth, latency} that happen to have different mode labels.

    In general, having different labels for the same features may occur through one of several ways:

    1. There is noise in the data due to a bad reading of the data.
    2. There is a source of randomness that is not captured.
    3. There are more possible features that can distinguish between different labels, but the features are not in your data set.

    One thing you can do now is to run your training set through the decision tree and find the items that were misclassified. Try to determine why they are wrong and see if any data instances exhibit what I wrote above (namely that there are some data instances with the same features but different labels).