I have my data set which are my rules, and I want to generate a decision tree that at least has 100% accuracy at classifying my rules, but I can never get 100%. I set minNumObjs to 1 and made it unpruned but I only get 84% correctly classified instances.
My attributes are:
@attribute users numeric
@attribute bandwidth numeric
@attribute latency numeric
@attribute mode {C,H,DCF,MP,DC,IND}
ex data:
2,200000,0,C
2,200000,1000,C
2,200000,2000,MP
2,200000,5000,C
2,400000,0,C
2,400000,1000,DCF
Can someone help me understand why I can never get 100% of my instances classified and how I can get 100% of them classified (while still allowing my attributes to be numeric)
Thanks
It is sometimes impossible to get 100% accuracy due to identical feature vectors having different labels. I am guessing in your case that users
, bandwidth
, and latency
are the features, while mode
is the label that you are trying to predict. If so, then there may be identical values of {users
, bandwidth
, latency
} that happen to have different mode
labels.
In general, having different labels for the same features may occur through one of several ways:
One thing you can do now is to run your training set through the decision tree and find the items that were misclassified. Try to determine why they are wrong and see if any data instances exhibit what I wrote above (namely that there are some data instances with the same features but different labels).