Search code examples
validationtreewekacross-validationpruning

how does cross validation work for these 2 trees?


I have 1 tree (ID3 or J48) in weka . it has only 25 training set. and it learns 100% accuracy. I think this is too high for accuracy of training set. how can I understand weather it has overfiting problem or not? (I want to use my test set from this 25 train data itself- because I don't hava any test) and I khow cross validation is good for stop overfitting ,but I want to prove it before using cross validation. actually I pruned this tree and compare cross validation accuracy between pruned and unpruned trees. but I can't explain and understand how does accuracy should change between the overfited tree and pruned tree? (In this case I khow that my tree has overfiting problem - but how can I infer ?) what about other way? can you suggest me? notice that test data is not available .


Solution

  • This is what I would do:

    1. Take the 25 data points and use 10 fold cross validation. Record the accuracy (provided that your classes are balanced/near-balanced)
    2. Take the training accuracy and compare these two accuracy values. If they differ significantly (say 100% training accuracy vs 85% test accuracy), then this is a signal for overfitting to me. From that point on, I would try to increase data points and plot learning curves as I increase them.

    NOTE: If you do not have any test data then CV is the only choice and the results you obtain from CVs should be considered as test results.