I'm trying to evaluate the performance of a classifier using 10-fold CV in WEKA. I have 32,000 records split across three different classes, "po", "ng", "ne". po: ~950 ng: ~1200 ne: ~30000
How should I split the dataset for performing CV? Am I right in assuming that for CV I should have a roughly equal number of records for each class, so to prevent unfair weighting towards the "ne" class?
Firstly, no you need not have equal no. of cases in your classes. Not all datasets are balanced. Yes it might give unrealistic answer. The imbalance in the dataset is a common phenomenon but there are few tactics to handle it-:
1) Resampling the dataset
Undersampling- Deleting the records of majority class
Oversampling- Adding the records in minority class
you can use SMOTE algorithm to do it for you.
2) Performance Metrics
Some metrics like Kappa (or Cohen’s kappa)can work great in which Classification accuracy is normalized by the imbalance of the classes in the data.
3) Cost Sensitive Classifier Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification. But the challenge here is how you determine the cost because cost should be domain dependent and not data dependent.
In case of cross-validation, I found this link to be useful. http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation
Hope it helps.