Search code examples
machine-learningweka

How to purposely overfit Weka tree classifiers?


I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.

When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".

Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.

I just can't create a model that overfits my data!

I've also tried almost all of the other classifiers Weka provides, but got similar results.

Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.

How can I create a completely unpruned tree? Or otherwise force Weka to overfit my data?

Thanks.

Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):

J48 unpruned tree
------------------

F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)

Needless to say, IB1 still gives 100% precision.

Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.


Solution

  • The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.

    The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.