WEKA classifier evaluation

I'm trying to evaluate the performance of a classifier using 10-fold CV in WEKA. I have 32,000 records split across three different classes, "po", "ng", "ne". po: ~950 ng: ~1200 ne: ~30000

How should I split the dataset for performing CV? Am I right in assuming that for CV I should have a roughly equal number of records for each class, so to prevent unfair weighting towards the "ne" class?

Solution

Firstly, no you need not have equal no. of cases in your classes. Not all datasets are balanced. Yes it might give unrealistic answer. The imbalance in the dataset is a common phenomenon but there are few tactics to handle it-:

1) Resampling the dataset

Undersampling- Deleting the records of majority class

Oversampling- Adding the records in minority class

you can use SMOTE algorithm to do it for you.

2) Performance Metrics

Some metrics like Kappa (or Cohen’s kappa)can work great in which Classification accuracy is normalized by the imbalance of the classes in the data.

3) Cost Sensitive Classifier Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification. But the challenge here is how you determine the cost because cost should be domain dependent and not data dependent.

In case of cross-validation, I found this link to be useful. http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation

Hope it helps.